[Transform][Vectorization] canonicalize vector with physical vector#100
[Transform][Vectorization] canonicalize vector with physical vector#100BRUCE11111 wants to merge 81 commits intomainfrom
Conversation
|
Example: Give matmul + relu: func.func @fc_relu(%lhs: tensor<512x512xf32>, %rhs: tensor<512x512xf32>,
%bias: tensor<512x512xf32>, %output: tensor<512x512xf32>)
-> tensor<512x512xf32> {
%matmul = linalg.matmul ins(%lhs, %rhs: tensor<512x512xf32>, tensor<512x512xf32>)
outs(%output: tensor<512x512xf32>) -> tensor<512x512xf32>
// Elementwise addition.
%biased = linalg.elemwise_binary { fun = #linalg.binary_fn<add> }
ins(%matmul, %bias : tensor<512x512xf32>, tensor<512x512xf32>)
outs(%output : tensor<512x512xf32>) -> tensor<512x512xf32>
// Elementwise max with 0 (ReLU).
%c0f = arith.constant 0.0 : f32
// expected-remark @below {{elementwise binary}}
%relued = linalg.elemwise_binary { fun = #linalg.binary_fn<max_signed> }
ins(%biased, %c0f : tensor<512x512xf32>, f32)
outs(%output : tensor<512x512xf32>) -> tensor<512x512xf32>
func.return %relued : tensor<512x512xf32>
}// -----// IR Dump After LowerToTileVector (lower-to-tile-vector) //----- // func.func @fc_relu(%arg0: tensor<512x512xf32>, %arg1: tensor<512x512xf32>, %arg2: tensor<512x512xf32>, %arg3: tensor<512x512xf32>) -> tensor<512x512xf32> {
%cst = arith.constant dense<0.000000e+00> : vector<512x512xf32>
%cst_0 = arith.constant 0.000000e+00 : f32
%c0 = arith.constant 0 : index
// remain matmul op to brgemm to do the optimization
%0 = linalg.matmul ins(%arg0, %arg1 : tensor<512x512xf32>, tensor<512x512xf32>) outs(%arg3 : tensor<512x512xf32>) -> tensor<512x512xf32>
%1 = vector.transfer_read %0[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<512x512xf32>, vector<512x512xf32>
%2 = vector.transfer_read %arg2[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<512x512xf32>, vector<512x512xf32>
%3 = arith.addf %1, %2 : vector<512x512xf32>
%4 = arith.maximumf %3, %cst : vector<512x512xf32>
%5 = vector.transfer_write %4, %arg3[%c0, %c0] {in_bounds = [true, true]} : vector<512x512xf32>, tensor<512x512xf32>
return %5 : tensor<512x512xf32>
}// -----// IR Dump After CPUPhysicalRegisterPass (CPU-physical-register-pass) //----- // func.func @fc_relu(%arg0: tensor<512x512xf32>, %arg1: tensor<512x512xf32>, %arg2: tensor<512x512xf32>, %arg3: tensor<512x512xf32>) -> tensor<512x512xf32> {
%c16 = arith.constant 16 : index
%c512 = arith.constant 512 : index
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
%cst = arith.constant dense<0.000000e+00> : vector<16xf32>
%cst_0 = arith.constant 0.000000e+00 : f32
// remain matmul op to brgemm to do the optimization
%0 = linalg.matmul ins(%arg0, %arg1 : tensor<512x512xf32>, tensor<512x512xf32>) outs(%arg3 : tensor<512x512xf32>) -> tensor<512x512xf32>
%1 = scf.for %arg4 = %c0 to %c512 step %c1 iter_args(%arg5 = %arg3) -> (tensor<512x512xf32>) {
%2 = scf.for %arg6 = %c0 to %c512 step %c16 iter_args(%arg7 = %arg5) -> (tensor<512x512xf32>) {
%3 = vector.transfer_read %arg2[%arg4, %arg6], %cst_0 {in_bounds = [true]} : tensor<512x512xf32>, vector<16xf32>
%4 = vector.transfer_read %0[%arg4, %arg6], %cst_0 {in_bounds = [true]} : tensor<512x512xf32>, vector<16xf32>
%5 = arith.addf %4, %3 : vector<16xf32>
%6 = arith.maximumf %5, %cst : vector<16xf32>
%7 = vector.transfer_write %6, %arg7[%arg4, %arg6] {in_bounds = [true]} : vector<16xf32>, tensor<512x512xf32>
scf.yield %7 : tensor<512x512xf32>
}
scf.yield %2 : tensor<512x512xf32>
}
return %1 : tensor<512x512xf32>
} |
bccf326 to
d29f038
Compare
55348fb to
8d65cbf
Compare
|
@BRUCE11111, do we need microkernel definition for this first? |
Hi~ Peter! Thanks for the suggestion! What does microkernel mean here? And why do you think we need it? |
I think Petr's question is from your example, to fully handle matmul lowering, we need microkernel definition to provide brgemm lowering. Matmul lowering and brgemm lowering is not a part of this PR. Please consider providing another example to avoid confusion, like RMSNorm. |
|
Waiting for the community's PR merge to fix the remaining errors on CI. |
|
Performance data:
|
cc0b4c1 to
1242a74
Compare
1242a74 to
e8d7612
Compare
lmontigny
left a comment
There was a problem hiding this comment.
Approved. Can we have in future an iterative process with smaller PR to review?
Okay~ Thanks~ |
Tracking issue 331
Tasks:
vector.multi_reductionwith graph compiler reduce implementation.vector.transposewith graph compiler transpose implementation.vector.broadcast.vector.shapecastwith graph compiler reorder implementation.Performance data: