[GPU] Add MLP test and linalg.fill lowering in 'linalg-to-xegpu'#220

Merged

LongshengDu merged 28 commits intomainfrom

longsheng/mlp-xegpu

Sep 11, 2024

Contributor

LongshengDu commented Aug 7, 2024 •

edited

Loading

This PR aimed to demonstrate more complex workloads on GPU. We want to incorporate multi-tiling for larger matmuls(3 nested tiling loops), with pre-fusion(fill) and post-fusion(add/relu) to better exhibit real life workloads.

Added linalg.fill lowering support to 'linalg-to-xegpu', however, we do not want to keep expanding linalg-to-xegpu'in the future, other demos may need to check the supported dtype/ops in 'linalg-to-xegpu'.

Depend on #201
Tracking #219

dchigarev and others added 14 commits

July 31, 2024 15:34


          Aling 'linalg-to-xegpu' pass with patched XeGPU dialect

964398e

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>


          add simple e2e test

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>


          Fix tests

0f25517

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>


          fix tests

05aa8d6

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>


          fix tests

829b9d4

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>


          Merge remote-tracking branch 'origin/main' into fix-linalg-to-xe

3660cdc


          fix imex build

52eb013

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>


          distinct between ENABLE and USE IMEX

48914ac

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>


          Merge branch 'main' into fix-linalg-to-xe

2f5561c


          Merge remote-tracking branch 'origin/main' into fix-linalg-to-xe

8184f5d


          remove l0 runtime

be7fdf0

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>


          fix formatting

a94205a

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>


          Merge remote-tracking branch 'origin/main' into fix-linalg-to-xe

4cf3457


          add mlp test

0e21466

LongshengDu added WIP test labels

Longsheng Du added 2 commits

August 7, 2024 15:14


          update

97f914b


          update

8a1e875

dchigarev self-requested a review

August 7, 2024 09:10

dchigarev mentioned this pull request

Remove remaining memref.dealloc after InsertGpuAllocs pass Menooker/mlir-extensions#3

Open

dchigarev reviewed

View reviewed changes

test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir Outdated

+              // CHECK: Unranked Memref base@{{(0x)?[-0-9a-fA-F]*}}
+              // CHECK-SAME: rank = 1 offset = 0 sizes = [32] strides = [4096] data =
+              // CHECK-NEXT: [8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344]

Contributor

dchigarev Aug 7, 2024 •

edited

Loading

Suggested change

      
            // CHECK-NEXT: [8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344,  8.02344]
          
            // CHECK-NEXT: [17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625,  17.625]

computing this with numpy gives different result:

import numpy as np

arg0 = np.ones(shape=(32, 4096), dtype="float16")
arg1 = np.ones(shape=(4096, 4096), dtype="float16")
arg2 = np.ones(shape=(32, 4096), dtype="float16")
arg3 = np.ones(shape=(4096, 4096), dtype="float16")
arg4 = np.ones(shape=(32, 4096), dtype="float16")
arg0[:] = 0.01
arg1[:] = 0.01
arg2[:] = 0.02
arg3[:] = 0.01
arg4[:] = 0.02
p2 = np.dot(arg0, arg1)
p4 = arg2 + p2
p5 = np.zeros(shape=(32, 4096), dtype="float16")
p7 = np.maximum(p5, p4)
p10 = np.dot(p7, arg3)
p12 = arg4 + p10
print(p12)
# array([[17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
#        [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
#        [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
#        ...,
#        [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
#        [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62],
#        [17.62, 17.62, 17.62, ..., 17.62, 17.62, 17.62]], dtype=float16)

p.s. running this test with this and this being fixed (and if tiling sizes are 2d default-tile-size=matmul:{16,16}) makes the test produce the result that is equivalent to numpy

Contributor Author

LongshengDu Aug 8, 2024 •

edited

Loading

The temporary ref data is produced by gc-cpu-pipeline, will replace ref data with accurate ones. Currently bf16/f16 matmul in gc-cpu-pipeline uses naive lowering for matmul(not doing reduce in f32 then cast to bf16/f16), thus the value is incorrect.

Another problem I encountered is larger matmul requires k-tiling but current tiling for reduce axis will not add reduce op to compensate, this will lead to correctness issue as well, will avoid k-tiling(only use 2d tiling sizes for now).

test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir Outdated

Comment on lines 30 to 33

+                  %15 = linalg.max ins(%13, %12 : tensor<32x4096xf16>, tensor<32x4096xf16>)
+                                   outs(%14 : tensor<32x4096xf16>) -> tensor<32x4096xf16>
+                  return %15 : tensor<32x4096xf16>

Contributor

dchigarev Aug 7, 2024

insert-gpu-allocs pass from IMEX seems to go crazy if we allocate & return a buffer in a function that contains gpu.launch (linalg_mlp in our case). For me it crashes the program at the end. It works fine though if we pass the final output buffer to linalg_mlp, e.g

func.func @main() {
    %out = tensor.empty() : <...>

    func.call @linalg_mlp(..., %out)
}

If you haven't figured out a fix for this yet, we may simply stick to the 'pass output buffer to the func' model as it should be enough for our testing. We won't use insert-gpu-allocs in our final pipeline anyway.

Contributor Author

LongshengDu Aug 8, 2024 •

edited

Loading

Thanks a lot! I encountered this problem and haven't figured out why. I will try to use your suggestion.

dchigarev mentioned this pull request

Make linalg->xegpu->gpu_exe pipeline working #193

Closed

5 tasks

Longsheng Du added 4 commits

August 8, 2024 11:12


          Merge branch 'main' into longsheng/mlp-xegpu

e1a841f

fix

53b8a0e

fix

7a767bd


          Merge branch 'main' into longsheng/mlp-xegpu

da8aea7

LongshengDu added ready to review and removed WIP labels

LongshengDu requested review from dchigarev and kurapov-peter

August 12, 2024 05:29

dchigarev reviewed

View reviewed changes

test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir

		@@ -0,0 +1,75 @@
		// RUN: gc-opt %s --pass-pipeline='builtin.module(func.func(iterative-tiling-and-fusion{use-cost-model=0 default-tile-size=matmul:{16,16}}),eliminate-empty-tensors,empty-tensor-to-alloc-tensor,one-shot-bufferize{bufferize-function-boundaries=1 function-boundary-type-conversion=identity-layout-map},drop-equivalent-buffer-results,func.func(finalizing-bufferize),canonicalize,cse,drop-equivalent-buffer-results,expand-realloc,canonicalize,ownership-based-buffer-deallocation,canonicalize,buffer-deallocation-simplification,bufferization-lower-deallocations,cse,canonicalize,convert-bufferization-to-memref,func.func(scf-forall-to-parallel),func.func(linalg-to-xegpu{stages=1 dpas-tile=8,16,16 k-tile=16}),xegpu-fold-alias-ops,func.func(convert-linalg-to-parallel-loops),func.func(gpu-map-parallel-loops),func.func(convert-parallel-loops-to-gpu),func.func(insert-gpu-allocs),gpu-kernel-outlining,canonicalize,set-spirv-capabilities{client-api=opencl},gpu.module(set-spirv-abi-attrs{client-api=opencl}),lower-affine,imex-vector-linearize,gpu.module(convert-xegpu-to-vc),reconcile-unrealized-casts,bf16-to-gpu,gpu.module(convert-func-to-spirv),gpu.module(convert-vector-to-spirv),imex-convert-gpu-to-spirv,spirv.module(spirv-lower-abi-attrs,spirv-update-vce),func.func(llvm-request-c-wrappers),serialize-spirv,convert-vector-to-scf,convert-gpu-to-gpux,convert-scf-to-cf,convert-cf-to-llvm,convert-vector-to-llvm,convert-index-to-llvm,convert-arith-to-llvm,convert-func-to-llvm,convert-math-to-llvm,convert-gpux-to-llvm,convert-index-to-llvm,expand-strided-metadata,lower-affine,finalize-memref-to-llvm,reconcile-unrealized-casts)' \

Contributor

dchigarev Aug 12, 2024

Does this test pass on your machine? For me it fails with the following error:

incorrect lowering for 'linalg.fill'?

/home/jovyan/graph-compiler/test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir:14:10: error: 'func.call' op operand type mismatch: expected operand type 'vector<16x16xf16>', but provided 'vector<256xf16>' for operand number 9
    %4 = linalg.add ins(%arg2, %2 : tensor<32x4096xf16>, tensor<32x4096xf16>) 
         ^
/home/jovyan/graph-compiler/test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir:14:10: note: see current operation: "func.call"(%275, %276, %277, %278, %279, %280, %281, %282, %274, %247) <{callee = @llvm.genx.raw.sends2.noresult.i1.v8i32.v128i32}> : (i8, i8, i1, i8, i8, i8, i32, i32, vector<8xi32>, vector<256xf16>) -> ()
/home/jovyan/graph-compiler/test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir:26:11: error: 'func.call' op operand type mismatch: expected operand type 'vector<16x16xf16>', but provided 'vector<256xf16>' for operand number 9
    %12 = linalg.add ins(%arg4, %10 : tensor<32x4096xf16>, tensor<32x4096xf16>) 
          ^
/home/jovyan/graph-compiler/test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir:26:11: note: see current operation: "func.call"(%275, %276, %277, %278, %279, %280, %281, %282, %274, %247) <{callee = @llvm.genx.raw.sends2.noresult.i1.v8i32.v128i32}> : (i8, i8, i1, i8, i8, i8, i32, i32, vector<8xi32>, vector<256xf16>) -> ()

If I remove all linalg.fill from the test it then fails with another error caused by double deallocations added by insert-gpu-allocs pass. This can be fixed with this patch to IMEX: Menooker/mlir-extensions#3 (have you applied this patch to your IMEX build? If so, we probably should merge it and update IMEX version)

free() problem

0.      Program arguments: /home/jovyan/graph-compiler/build/bin/gc-cpu-runner -e main --entry-point-result=void --shared-libs=/home/jovyan/llvm/llvm-gc-master-patches-install/lib/libmlir_runner_utils.so,/home/jovyan/llvm/llvm-gc-master-patches-install/lib/libmlir_c_runner_utils.so,/home/jovyan/graph-compiler/build/lib/libGcOpenclRuntime.so
 #0 0x0000562ee351c2a0 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/jovyan/graph-compiler/build/bin/gc-cpu-runner+0x3b82a0)
 #1 0x0000562ee35193af llvm::sys::RunSignalHandlers() (/home/jovyan/graph-compiler/build/bin/gc-cpu-runner+0x3b53af)
 #2 0x0000562ee3519505 SignalHandler(int) Signals.cpp:0:0
 #3 0x00007f2d1e2716ac (/usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so+0x5436ac)
 #4 0x00007f2d544cf520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #5 0x00007f2d545323fe __libc_free (/lib/x86_64-linux-gnu/libc.so.6+0xa53fe)
 #6 0x00007f2d54a097aa 
 #7 0x00007f2d54a0a09b 
 #8 0x00007f2d54a0a441 
 #9 0x0000562ee3ad5a0c compileAndExecute((anonymous namespace)::Options&, mlir::Operation*, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, void**, std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine>>) JitRunner.cpp:0:0
#10 0x0000562ee3ad5ead compileAndExecuteVoidFunction((anonymous namespace)::Options&, mlir::Operation*, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine>>) JitRunner.cpp:0:0
#11 0x0000562ee3ad7473 mlir::JitRunnerMain(int, char**, mlir::DialectRegistry const&, mlir::JitRunnerConfig) (/home/jovyan/graph-compiler/build/bin/gc-cpu-runner+0x973473)
#12 0x0000562ee34546c0 std::vector<std::unique_ptr<mlir::DialectExtensionBase, std::default_delete<mlir::DialectExtensionBase>>, std::allocator<std::unique_ptr<mlir::DialectExtensionBase, std::default_delete<mlir::DialectExtensionBase>>>>::~vector() /usr/include/c++/11/bits/stl_vector.h:680:15
#13 0x0000562ee34546c0 mlir::DialectRegistry::~DialectRegistry() /home/jovyan/llvm/llvm-gc-master-patches-install/include/mlir/IR/DialectRegistry.h:139:7
#14 0x0000562ee34546c0 main /home/jovyan/graph-compiler/src/gc-cpu-runner/gc-cpu-runner.cpp:46:1
#15 0x00007f2d544b6d90 (/lib/x86_64-linux-gnu/libc.so.6+0x29d90)
#16 0x00007f2d544b6e40 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e40)
#17 0x0000562ee3505195 _start (/home/jovyan/graph-compiler/build/bin/gc-cpu-runner+0x3a1195)

After removing linalg.fill and applying the patch above to IMEX the test passes for me.

LongshengDu linked an issue

that may be closed by this pull request

Add tests for more complex MLP workload on GPU #219

Closed

dchigarev reviewed

View reviewed changes

test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir

+                  %7 = linalg.max ins(%5, %4 : tensor<32x4096xf16>, tensor<32x4096xf16>)
+                                  outs(%6 : tensor<32x4096xf16>) -> tensor<32x4096xf16>
+                  %8 = tensor.empty() : tensor<32x4096xf16>

Contributor

dchigarev Aug 12, 2024

do you use it anywhere?

Suggested change

%8 = tensor.empty() : tensor<32x4096xf16>

Contributor Author

LongshengDu Aug 12, 2024

fixed

test/mlir/test/gc/gpu-runner/XeGPU/f16_mlp_32x4096x4096x4096.mlir Outdated

Comment on lines 44 to 47

+                  %0 = tensor.generate {
+                    ^bb0(%i : index, %j : index):
+                      tensor.yield %cst0 : f16
+                  } : tensor<32x4096xf16>

Contributor

dchigarev Aug 12, 2024

why not?

Suggested change

      
                %0 = tensor.generate {
          
                  ^bb0(%i : index, %j : index):
          
                    tensor.yield %cst0 : f16
          
                } : tensor<32x4096xf16>
          
                %0 = arith.constant dense<%cst0> : tensor<32x4096xf16>

Contributor Author

LongshengDu Aug 12, 2024

Use to generate more "random" data, I will see if it is still needed.

LongshengDu added WIP and removed ready to review labels


          Merge branch 'main' into longsheng/mlp-xegpu

40403f7

lmontigny linked an issue

that may be closed by this pull request

Make linalg->xegpu->gpu_exe pipeline working #193

Closed

5 tasks

Longsheng Du added 3 commits

September 5, 2024 12:45


          Merge branch 'main' into longsheng/mlp-xegpu

b2ae504

fix

8c8db90


          imex fix

92ec215

Contributor Author

LongshengDu commented Sep 6, 2024

There is a bug inside imex for constant vector store_nd. The code will produce 2D tile in gpu kernel args and reuse the llvm.genx.raw.sends2.noresult. functions that have the same name as 1D tile, thus produce args type mismatch error.
However, the imex fix requires to deal with how the gpu kernel generated(since vector.shape_cast is illegal in convert-xegpu-to-vc pass, and the cast cannot be added here). Thus after added a lot fixes within imex for a long time, I opted for the simpler option to just add vector.shape_cast when lowering linalg.fill.
Note that store_nd for constant is still prone to error inside imex, but we do not need to spend too much time on imex.

LongshengDu requested a review from dchigarev

September 6, 2024 07:46

LongshengDu added ready to review and removed WIP labels

Longsheng Du added 2 commits

September 7, 2024 01:28


          Merge branch 'main' into longsheng/mlp-xegpu

51dc371


          fixed

0fcd2ab

LongshengDu requested a review from Menooker

September 6, 2024 17:40


          Merge branch 'main' into longsheng/mlp-xegpu

a5b7cf3

Menooker approved these changes

View reviewed changes

kurapov-peter approved these changes

View reviewed changes

.github/workflows/build-llvm.yml Outdated

                     - uses: actions/checkout@v4
                       with:
-                        repository: Menooker/mlir-extensions
+                        repository: LongshengDu/mlir-extensions

kurapov-peter Sep 9, 2024

We need smth like a staging branch in imex since most of the changes should land in the main anyway. cc @Garra1980

Contributor

dchigarev Sep 9, 2024

I think anyone (with our access rights) can create a branch in the IMEX repo. May we simply create a GC-dev branch in intel/mlir-extensions and that would be it?

kurapov-peter Sep 9, 2024

that's exactly what I'm suggesting

Contributor

dchigarev Sep 10, 2024 •

edited

Loading

I think anyone (with our access rights) can create a branch in the IMEX repo

Just tried to do so and I think we don't have enough right for this :)
Need some help from @Garra1980

kurapov-peter Sep 10, 2024

Here's the branch https://github.com/intel/mlir-extensions/tree/gc-staging

dchigarev reviewed

View reviewed changes

cmake/imex.cmake Outdated

                   # TODO: Change to main https://github.com/intel/mlir-extensions when all the
                   # required functionality is merged.
-                  gc_fetch_content(imex "${IMEX_HASH}" https://github.com/Menooker/mlir-extensions
+                  gc_fetch_content(imex "${IMEX_HASH}" https://github.com/LongshengDu/mlir-extensions

Contributor

dchigarev Sep 9, 2024

May we merge the fix to https://github.com/Menooker/mlir-extensions/tree/dev to keep things consistent? We already have some of our patches there

kurapov-peter Sep 9, 2024

Sure, @Menooker, @LongshengDu, could you please apply the patch?


          update imex

65403e3

dchigarev approved these changes

View reviewed changes

LongshengDu merged commit 748b698 into main

LongshengDu deleted the longsheng/mlp-xegpu branch

September 11, 2024 08:39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready to review test