Skip to content

Conversation

@ShangkunLi
Copy link
Collaborator

Sorry for this too large pr......

Counter Classification

We classify the counter into three types:

  1. root: no parent, has child(ren)
  2. relay: has parent, has child(ren)
  3. leaf: has parent, no child

We need to map each counter op onto the tile array. But only leaf counter has the self-increment logic in FU. For other two types, they only have a register to store the counter values, the values are updated through off-array affine controller.

Task Classification

We classify tasks into two categories:

  1. task with taskflow.counter:
module {
  func.func @_Z6kernelPiS_S_(%arg0: memref<?xi32>, %arg1: memref<?xi32>, %arg2: memref<?xi32>) -> i32 attributes {llvm.linkage = #llvm.linkage<external>} {
    %c0_i32 = arith.constant 0 : i32
    %value_outputs = "taskflow.task"(%arg0, %arg2, %c0_i32) <{operandSegmentSizes = array<i32: 2, 1>, resultSegmentSizes = array<i32: 0, 1>, task_name = "Task_0"}> ({
    ^bb0(%arg3: memref<?xi32>, %arg4: memref<?xi32>, %arg5: i32):
      %0 = taskflow.counter attributes {counter_id = 0 : i32, counter_type = "leaf", lower_bound = 0 : index, step = 1 : index, upper_bound = 32 : index} : index
      %1 = "taskflow.hyperblock"(%0, %arg5) <{operandSegmentSizes = array<i32: 1, 1>}> ({
      ^bb0(%arg6: index, %arg7: i32):
        %2 = memref.load %arg3[%arg6] : memref<?xi32>
        %3 = memref.load %arg4[%arg6] : memref<?xi32>
        %4 = arith.muli %2, %3 : i32
        %5 = arith.addi %arg7, %4 : i32
        taskflow.hyperblock.yield iter_args_next(%5 : i32) results(%5 : i32)
      }) : (index, i32) -> i32
      "taskflow.yield"(%1) <{operandSegmentSizes = array<i32: 0, 1>}> : (i32) -> ()
    }) : (memref<?xi32>, memref<?xi32>, i32) -> i32
    return %value_outputs : i32
  }
}

This kind of task is driven by the counter; it is also terminated by the (root) counter (leaf counter when there is only one counter).

This kind of task can be further classified into two categories:
a. hyperblock with yield results: We introduce an extract_predicate op to extract the predicate bit from the root counter and grant_predicate the return value
b. hyperblock without yield results: The hyperblock execution terminates when the root counter sends a signal to the controller

  1. task without taskflow.counter:
module {
  func.func @_Z6kernelPiS_S_(%arg0: memref<?xi32>, %arg1: memref<?xi32>, %arg2: memref<?xi32>) -> i32 attributes {llvm.linkage = #llvm.linkage<external>} {
    %c0_i32 = arith.constant 0 : i32
    %value_outputs = "taskflow.task"(%arg0, %arg2, %c0_i32) <{operandSegmentSizes = array<i32: 2, 1>, resultSegmentSizes = array<i32: 0, 1>, task_name = "Task_0"}> ({
    ^bb0(%arg3: memref<?xi32>, %arg4: memref<?xi32>, %arg5: i32):
      %0 = neura.kernel inputs(%arg3, %arg4, %arg5 : memref<?xi32>, memref<?xi32>, i32) {
      ^bb0(%arg6: memref<?xi32>, %arg7: memref<?xi32>, %arg8: i32):
        %c0 = arith.constant 0 : index
        %1 = builtin.unrealized_conversion_cast %c0 : index to i64
        %c32 = arith.constant 32 : index
        %c1 = arith.constant 1 : index
        llvm.br ^bb1(%1, %arg8 : i64, i32)
      ^bb1(%2: i64, %3: i32):  // 2 preds: ^bb0, ^bb2
        %4 = builtin.unrealized_conversion_cast %2 : i64 to index
        %5 = arith.cmpi slt, %4, %c32 : index
        llvm.cond_br %5, ^bb2, ^bb3
      ^bb2:  // pred: ^bb1
        %6 = memref.load %arg6[%4] : memref<?xi32>
        %7 = memref.load %arg7[%4] : memref<?xi32>
        %8 = arith.muli %6, %7 : i32
        %9 = arith.addi %3, %8 : i32
        %10 = arith.addi %4, %c1 : index
        %11 = builtin.unrealized_conversion_cast %10 : index to i64
        llvm.br ^bb1(%11, %9 : i64, i32)
      ^bb3:  // pred: ^bb1
        neura.yield results(%3 : i32)
      } : i32
      "taskflow.yield"(%0) <{operandSegmentSizes = array<i32: 0, 1>}> : (i32) -> ()
    }) : (memref<?xi32>, memref<?xi32>, i32) -> i32
    return %value_outputs : i32
  }
}

This kind of task is self-driven, so we utilize an existing method similar to func::FuncOp to handle this task.

Taskflow to Neura Conversion

  1. We redefine the neura.kernel with the IsolatedFromAbove trait.
  2. We implement the convert-taskflow-to-neura to convert the taskflow.hyperblock into neura.kernel
  3. If the source taskflow.task has taskflow.counters outside the hyperblock, we embed them into the entry block of the neura.kernel as neura.counter

taskflow.task Mapping

  • Each taskflow.task is converted to a task that contains one neura.kernel
  • The neura.kernel is mapped onto the tile array

@tancheng
Copy link
Contributor

Sorry for this too large pr......

Counter Classification

We classify the counter into three types:

  1. root: no parent, has child(ren)
  2. relay: has parent, has child(ren)
  3. leaf: has parent, no child

We need to map each counter op onto the tile array. But only leaf counter has the self-increment logic in FU. For other two types, they only have a register to store the counter values, the values are updated through off-array affine controller.

Task Classification

We classify tasks into two categories:

  1. task with taskflow.counter:
module {
  func.func @_Z6kernelPiS_S_(%arg0: memref<?xi32>, %arg1: memref<?xi32>, %arg2: memref<?xi32>) -> i32 attributes {llvm.linkage = #llvm.linkage<external>} {
    %c0_i32 = arith.constant 0 : i32
    %value_outputs = "taskflow.task"(%arg0, %arg2, %c0_i32) <{operandSegmentSizes = array<i32: 2, 1>, resultSegmentSizes = array<i32: 0, 1>, task_name = "Task_0"}> ({
    ^bb0(%arg3: memref<?xi32>, %arg4: memref<?xi32>, %arg5: i32):
      %0 = taskflow.counter attributes {counter_id = 0 : i32, counter_type = "leaf", lower_bound = 0 : index, step = 1 : index, upper_bound = 32 : index} : index
      %1 = "taskflow.hyperblock"(%0, %arg5) <{operandSegmentSizes = array<i32: 1, 1>}> ({
      ^bb0(%arg6: index, %arg7: i32):
        %2 = memref.load %arg3[%arg6] : memref<?xi32>
        %3 = memref.load %arg4[%arg6] : memref<?xi32>
        %4 = arith.muli %2, %3 : i32
        %5 = arith.addi %arg7, %4 : i32
        taskflow.hyperblock.yield iter_args_next(%5 : i32) results(%5 : i32)
      }) : (index, i32) -> i32
      "taskflow.yield"(%1) <{operandSegmentSizes = array<i32: 0, 1>}> : (i32) -> ()
    }) : (memref<?xi32>, memref<?xi32>, i32) -> i32
    return %value_outputs : i32
  }
}

This kind of task is driven by the counter; it is also terminated by the (root) counter (leaf counter when there is only one counter).

This kind of task can be further classified into two categories: a. hyperblock with yield results: We introduce an extract_predicate op to extract the predicate bit from the root counter and grant_predicate the return value b. hyperblock without yield results: The hyperblock execution terminates when the root counter sends a signal to the controller

  1. task without taskflow.counter:
module {
  func.func @_Z6kernelPiS_S_(%arg0: memref<?xi32>, %arg1: memref<?xi32>, %arg2: memref<?xi32>) -> i32 attributes {llvm.linkage = #llvm.linkage<external>} {
    %c0_i32 = arith.constant 0 : i32
    %value_outputs = "taskflow.task"(%arg0, %arg2, %c0_i32) <{operandSegmentSizes = array<i32: 2, 1>, resultSegmentSizes = array<i32: 0, 1>, task_name = "Task_0"}> ({
    ^bb0(%arg3: memref<?xi32>, %arg4: memref<?xi32>, %arg5: i32):
      %0 = neura.kernel inputs(%arg3, %arg4, %arg5 : memref<?xi32>, memref<?xi32>, i32) {
      ^bb0(%arg6: memref<?xi32>, %arg7: memref<?xi32>, %arg8: i32):
        %c0 = arith.constant 0 : index
        %1 = builtin.unrealized_conversion_cast %c0 : index to i64
        %c32 = arith.constant 32 : index
        %c1 = arith.constant 1 : index
        llvm.br ^bb1(%1, %arg8 : i64, i32)
      ^bb1(%2: i64, %3: i32):  // 2 preds: ^bb0, ^bb2
        %4 = builtin.unrealized_conversion_cast %2 : i64 to index
        %5 = arith.cmpi slt, %4, %c32 : index
        llvm.cond_br %5, ^bb2, ^bb3
      ^bb2:  // pred: ^bb1
        %6 = memref.load %arg6[%4] : memref<?xi32>
        %7 = memref.load %arg7[%4] : memref<?xi32>
        %8 = arith.muli %6, %7 : i32
        %9 = arith.addi %3, %8 : i32
        %10 = arith.addi %4, %c1 : index
        %11 = builtin.unrealized_conversion_cast %10 : index to i64
        llvm.br ^bb1(%11, %9 : i64, i32)
      ^bb3:  // pred: ^bb1
        neura.yield results(%3 : i32)
      } : i32
      "taskflow.yield"(%0) <{operandSegmentSizes = array<i32: 0, 1>}> : (i32) -> ()
    }) : (memref<?xi32>, memref<?xi32>, i32) -> i32
    return %value_outputs : i32
  }
}

This kind of task is self-driven, so we utilize an existing method similar to func::FuncOp to handle this task.

Taskflow to Neura Conversion

  1. We redefine the neura.kernel with the IsolatedFromAbove trait.
  2. We implement the convert-taskflow-to-neura to convert the taskflow.hyperblock into neura.kernel
  3. If the source taskflow.task has taskflow.counters outside the hyperblock, we embed them into the entry block of the neura.kernel as neura.counter

taskflow.task Mapping

  • Each taskflow.task is converted to a task that contains one neura.kernel
  • The neura.kernel is mapped onto the tile array

Would a task be driven by multiple counters? What the IRs look like when there are root, relay, and leaf, co-existing.

for (OpOperand &use : result.getUses()) {
Operation *user = use.getOwner();

// Case 1: Operand of a branch/cond_br → grant_once
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only Case 1 here? If so, is live_out_non_arg_values redundant here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants