[GPU] Add MLP test and linalg.fill lowering in 'linalg-to-xegpu'#220
[GPU] Add MLP test and linalg.fill lowering in 'linalg-to-xegpu'#220LongshengDu merged 28 commits intomainfrom
Conversation
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
|
|
||
| // CHECK: Unranked Memref base@{{(0x)?[-0-9a-fA-F]*}} | ||
| // CHECK-SAME: rank = 1 offset = 0 sizes = [32] strides = [4096] data = | ||
| // CHECK-NEXT: [8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344] |
There was a problem hiding this comment.
| // CHECK-NEXT: [8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344, 8.02344] | |
| // CHECK-NEXT: [17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625, 17.625] |
computing this with numpy gives different result:
import numpy as np
arg0 = np.ones(shape=(32, 4096), dtype="float16")
arg1 = np.ones(shape=(4096, 4096), dtype="float16")
arg2 = np.ones(shape=(32, 4096), dtype="float16")
arg3 = np.ones(shape=(4096, 4096), dtype="float16")
arg4 = np.ones(shape=(32, 4096), dtype="float16")
arg0[:] = 0.01
arg1[:] = 0.01
arg2[:] = 0.02
arg3[:] = 0.01
arg4[:] = 0.02
p2 = np.dot(arg0, arg1)
p4 = arg2 + p2
p5 = np.zeros(shape=(32, 4096), dtype="float16")
p7 = np.maximum(p5, p4)
p10 = np.dot(p7, arg3)
p12 = arg4 + p10
print(p12)
# array([[17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
# [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
# [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
# ...,
# [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
# [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
# [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62]], dtype=float16)p.s. running this test with this and this being fixed (and if tiling sizes are 2d default-tile-size=matmul:{16,16}) makes the test produce the result that is equivalent to numpy
There was a problem hiding this comment.
The temporary ref data is produced by gc-cpu-pipeline, will replace ref data with accurate ones. Currently bf16/f16 matmul in gc-cpu-pipeline uses naive lowering for matmul(not doing reduce in f32 then cast to bf16/f16), thus the value is incorrect.
Another problem I encountered is larger matmul requires k-tiling but current tiling for reduce axis will not add reduce op to compensate, this will lead to correctness issue as well, will avoid k-tiling(only use 2d tiling sizes for now).
| %15 = linalg.max ins(%13, %12 : tensor<32x4096xf16>, tensor<32x4096xf16>) | ||
| outs(%14 : tensor<32x4096xf16>) -> tensor<32x4096xf16> | ||
|
|
||
| return %15 : tensor<32x4096xf16> |
There was a problem hiding this comment.
insert-gpu-allocs pass from IMEX seems to go crazy if we allocate & return a buffer in a function that contains gpu.launch (linalg_mlp in our case). For me it crashes the program at the end. It works fine though if we pass the final output buffer to linalg_mlp, e.g
func.func @main() {
%out = tensor.empty() : <...>
func.call @linalg_mlp(..., %out)
}If you haven't figured out a fix for this yet, we may simply stick to the 'pass output buffer to the func' model as it should be enough for our testing. We won't use insert-gpu-allocs in our final pipeline anyway.
There was a problem hiding this comment.
Thanks a lot! I encountered this problem and haven't figured out why. I will try to use your suggestion.
| @@ -0,0 +1,75 @@ | |||
| // RUN: gc-opt %s --pass-pipeline='builtin.module(func.func(iterative-tiling-and-fusion{use-cost-model=0 default-tile-size=matmul:{16,16}}),eliminate-empty-tensors,empty-tensor-to-alloc-tensor,one-shot-bufferize{bufferize-function-boundaries=1 function-boundary-type-conversion=identity-layout-map},drop-equivalent-buffer-results,func.func(finalizing-bufferize),canonicalize,cse,drop-equivalent-buffer-results,expand-realloc,canonicalize,ownership-based-buffer-deallocation,canonicalize,buffer-deallocation-simplification,bufferization-lower-deallocations,cse,canonicalize,convert-bufferization-to-memref,func.func(scf-forall-to-parallel),func.func(linalg-to-xegpu{stages=1 dpas-tile=8,16,16 k-tile=16}),xegpu-fold-alias-ops,func.func(convert-linalg-to-parallel-loops),func.func(gpu-map-parallel-loops),func.func(convert-parallel-loops-to-gpu),func.func(insert-gpu-allocs),gpu-kernel-outlining,canonicalize,set-spirv-capabilities{client-api=opencl},gpu.module(set-spirv-abi-attrs{client-api=opencl}),lower-affine,imex-vector-linearize,gpu.module(convert-xegpu-to-vc),reconcile-unrealized-casts,bf16-to-gpu,gpu.module(convert-func-to-spirv),gpu.module(convert-vector-to-spirv),imex-convert-gpu-to-spirv,spirv.module(spirv-lower-abi-attrs,spirv-update-vce),func.func(llvm-request-c-wrappers),serialize-spirv,convert-vector-to-scf,convert-gpu-to-gpux,convert-scf-to-cf,convert-cf-to-llvm,convert-vector-to-llvm,convert-index-to-llvm,convert-arith-to-llvm,convert-func-to-llvm,convert-math-to-llvm,convert-gpux-to-llvm,convert-index-to-llvm,expand-strided-metadata,lower-affine,finalize-memref-to-llvm,reconcile-unrealized-casts)' \ | |||
There was a problem hiding this comment.
Does this test pass on your machine? For me it fails with the following error:
incorrect lowering for 'linalg.fill'?
/home/jovyan/graph-compiler/test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir:14:10: error: 'func.call' op operand type mismatch: expected operand type 'vector<16x16xf16>', but provided 'vector<256xf16>' for operand number 9
%4 = linalg.add ins(%arg2, %2 : tensor<32x4096xf16>, tensor<32x4096xf16>)
^
/home/jovyan/graph-compiler/test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir:14:10: note: see current operation: "func.call"(%275, %276, %277, %278, %279, %280, %281, %282, %274, %247) <{callee = @llvm.genx.raw.sends2.noresult.i1.v8i32.v128i32}> : (i8, i8, i1, i8, i8, i8, i32, i32, vector<8xi32>, vector<256xf16>) -> ()
/home/jovyan/graph-compiler/test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir:26:11: error: 'func.call' op operand type mismatch: expected operand type 'vector<16x16xf16>', but provided 'vector<256xf16>' for operand number 9
%12 = linalg.add ins(%arg4, %10 : tensor<32x4096xf16>, tensor<32x4096xf16>)
^
/home/jovyan/graph-compiler/test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir:26:11: note: see current operation: "func.call"(%275, %276, %277, %278, %279, %280, %281, %282, %274, %247) <{callee = @llvm.genx.raw.sends2.noresult.i1.v8i32.v128i32}> : (i8, i8, i1, i8, i8, i8, i32, i32, vector<8xi32>, vector<256xf16>) -> ()
If I remove all linalg.fill from the test it then fails with another error caused by double deallocations added by insert-gpu-allocs pass. This can be fixed with this patch to IMEX: Menooker/mlir-extensions#3 (have you applied this patch to your IMEX build? If so, we probably should merge it and update IMEX version)
free() problem
0. Program arguments: /home/jovyan/graph-compiler/build/bin/gc-cpu-runner -e main --entry-point-result=void --shared-libs=/home/jovyan/llvm/llvm-gc-master-patches-install/lib/libmlir_runner_utils.so,/home/jovyan/llvm/llvm-gc-master-patches-install/lib/libmlir_c_runner_utils.so,/home/jovyan/graph-compiler/build/lib/libGcOpenclRuntime.so
#0 0x0000562ee351c2a0 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/jovyan/graph-compiler/build/bin/gc-cpu-runner+0x3b82a0)
#1 0x0000562ee35193af llvm::sys::RunSignalHandlers() (/home/jovyan/graph-compiler/build/bin/gc-cpu-runner+0x3b53af)
#2 0x0000562ee3519505 SignalHandler(int) Signals.cpp:0:0
#3 0x00007f2d1e2716ac (/usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so+0x5436ac)
#4 0x00007f2d544cf520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
#5 0x00007f2d545323fe __libc_free (/lib/x86_64-linux-gnu/libc.so.6+0xa53fe)
#6 0x00007f2d54a097aa
#7 0x00007f2d54a0a09b
#8 0x00007f2d54a0a441
#9 0x0000562ee3ad5a0c compileAndExecute((anonymous namespace)::Options&, mlir::Operation*, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, void**, std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine>>) JitRunner.cpp:0:0
#10 0x0000562ee3ad5ead compileAndExecuteVoidFunction((anonymous namespace)::Options&, mlir::Operation*, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine>>) JitRunner.cpp:0:0
#11 0x0000562ee3ad7473 mlir::JitRunnerMain(int, char**, mlir::DialectRegistry const&, mlir::JitRunnerConfig) (/home/jovyan/graph-compiler/build/bin/gc-cpu-runner+0x973473)
#12 0x0000562ee34546c0 std::vector<std::unique_ptr<mlir::DialectExtensionBase, std::default_delete<mlir::DialectExtensionBase>>, std::allocator<std::unique_ptr<mlir::DialectExtensionBase, std::default_delete<mlir::DialectExtensionBase>>>>::~vector() /usr/include/c++/11/bits/stl_vector.h:680:15
#13 0x0000562ee34546c0 mlir::DialectRegistry::~DialectRegistry() /home/jovyan/llvm/llvm-gc-master-patches-install/include/mlir/IR/DialectRegistry.h:139:7
#14 0x0000562ee34546c0 main /home/jovyan/graph-compiler/src/gc-cpu-runner/gc-cpu-runner.cpp:46:1
#15 0x00007f2d544b6d90 (/lib/x86_64-linux-gnu/libc.so.6+0x29d90)
#16 0x00007f2d544b6e40 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e40)
#17 0x0000562ee3505195 _start (/home/jovyan/graph-compiler/build/bin/gc-cpu-runner+0x3a1195)
After removing linalg.fill and applying the patch above to IMEX the test passes for me.
| %7 = linalg.max ins(%5, %4 : tensor<32x4096xf16>, tensor<32x4096xf16>) | ||
| outs(%6 : tensor<32x4096xf16>) -> tensor<32x4096xf16> | ||
|
|
||
| %8 = tensor.empty() : tensor<32x4096xf16> |
There was a problem hiding this comment.
do you use it anywhere?
| %8 = tensor.empty() : tensor<32x4096xf16> |
| %0 = tensor.generate { | ||
| ^bb0(%i : index, %j : index): | ||
| tensor.yield %cst0 : f16 | ||
| } : tensor<32x4096xf16> |
There was a problem hiding this comment.
why not?
| %0 = tensor.generate { | |
| ^bb0(%i : index, %j : index): | |
| tensor.yield %cst0 : f16 | |
| } : tensor<32x4096xf16> | |
| %0 = arith.constant dense<%cst0> : tensor<32x4096xf16> |
There was a problem hiding this comment.
Use to generate more "random" data, I will see if it is still needed.
|
There is a bug inside imex for constant vector store_nd. The code will produce 2D tile in gpu kernel args and reuse the llvm.genx.raw.sends2.noresult. functions that have the same name as 1D tile, thus produce args type mismatch error. |
.github/workflows/build-llvm.yml
Outdated
| - uses: actions/checkout@v4 | ||
| with: | ||
| repository: Menooker/mlir-extensions | ||
| repository: LongshengDu/mlir-extensions |
There was a problem hiding this comment.
We need smth like a staging branch in imex since most of the changes should land in the main anyway. cc @Garra1980
There was a problem hiding this comment.
I think anyone (with our access rights) can create a branch in the IMEX repo. May we simply create a GC-dev branch in intel/mlir-extensions and that would be it?
There was a problem hiding this comment.
that's exactly what I'm suggesting
There was a problem hiding this comment.
I think anyone (with our access rights) can create a branch in the IMEX repo
Just tried to do so and I think we don't have enough right for this :)
Need some help from @Garra1980
There was a problem hiding this comment.
Here's the branch https://github.com/intel/mlir-extensions/tree/gc-staging
cmake/imex.cmake
Outdated
| # TODO: Change to main https://github.com/intel/mlir-extensions when all the | ||
| # required functionality is merged. | ||
| gc_fetch_content(imex "${IMEX_HASH}" https://github.com/Menooker/mlir-extensions | ||
| gc_fetch_content(imex "${IMEX_HASH}" https://github.com/LongshengDu/mlir-extensions |
There was a problem hiding this comment.
May we merge the fix to https://github.com/Menooker/mlir-extensions/tree/dev to keep things consistent? We already have some of our patches there
There was a problem hiding this comment.
Sure, @Menooker, @LongshengDu, could you please apply the patch?
This PR aimed to demonstrate more complex workloads on GPU. We want to incorporate multi-tiling for larger matmuls(3 nested tiling loops), with pre-fusion(fill) and post-fusion(add/relu) to better exhibit real life workloads.
Added linalg.fill lowering support to 'linalg-to-xegpu', however, we do not want to keep expanding linalg-to-xegpu'in the future, other demos may need to check the supported dtype/ops in 'linalg-to-xegpu'.
Depend on #201
Tracking #219