Make Ahead-of-Time compilation feasible for packages using CUDA.jl #2998

apozharski · 2025-12-16T17:07:26Z

Currently, Ahead-of-Time (AOT) compilation for any package using CUDA.jl is impossible (with or without --trim). This is due to several things as far as I can tell:

Many methods in CUDA.jl which are visible in the global method table, llvmcall LLVM ir with llvm.nvvm intrinsics which therefore leak through to the AOT step which generates assembly, leading to compilation failures.
It seems odd that these methods are being compiled in but it seems the common denominator is that even though no MethodInstances exist for the methods, the methods are concrete (unsure if this is the correct Julia terminology). This is because they either are defined to take concrete types (for example the clock methods) or no arguments at all.
There is another issue caused specifically by the the wrapper in libcudadevrt.jl. The "concrete" methods in that file contain ccalls to external functions which do not exist until they are dynamically loaded.
Unfortunately even fixing all of this in the current state of the PR, there seems to be calls to (at least) gpu_malloc, gpu_signal_exception, and gpu_report_oom in the produced llvm ir. These functions don't exist as they would come from the runtime compilation of the runtime. This is likely a problem more fixable in GPUCompiler.jl, but maybe @vchuravy or @maleadt you have a better understanding of that.

The relevant issues and other PRs:

The conversation here: JIT session error triggered by precompilation GPUCompiler.jl#611
This draft PR which identifies some strangeness in what methods are precompiled: Filter out methods that are unreachable from native AbsInt JuliaLang/julia#54155
This draft PR (which was the starting point for this PR, thanks @vchuravy) Mark more methods as device methods #2336 which uses @device_function to provide erroring copies of the methods for the global method table

codecov · 2025-12-18T10:15:00Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.78%. Comparing base (9eb1085) to head (3b831b0).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #2998       +/-   ##
===========================================
- Coverage   89.15%   75.78%   -13.38%     
===========================================
  Files         148      148               
  Lines       12885    12844       -41     
===========================================
- Hits        11488     9734     -1754     
- Misses       1397     3110     +1713

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-12-18T11:48:07Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.

diff --git a/src/device/intrinsics/atomics.jl b/src/device/intrinsics/atomics.jl
index 0874adf77..1a6d64784 100644
--- a/src/device/intrinsics/atomics.jl
+++ b/src/device/intrinsics/atomics.jl
@@ -151,7 +151,7 @@ for A in (AS.Generic, AS.Global, AS.Shared), T in (:Int16, :UInt16)
     end
 
     intr = "atom$scope.cas.b16 \$0, [\$1], \$2, \$3;"
-    @eval @device_function @inline atomic_cas!(ptr::LLVMPtr{$T,$A}, cmp::$T, val::$T) =
+    @eval @device_function @inline atomic_cas!(ptr::LLVMPtr{$T, $A}, cmp::$T, val::$T) =
         @asmcall($intr, "=h,l,h,h", true, $T, Tuple{Core.LLVMPtr{$T,$A},$T,$T}, ptr, cmp, val)
 end
 
@@ -172,7 +172,7 @@ for A in (AS.Generic, AS.Global, AS.Shared)
         nb = sizeof(T)*8
         fn = Symbol("atomic_$(op)!")
         intr = "llvm.nvvm.atomic.load.$op.$nb.p$(convert(Int, A))i$nb"
-        @eval @device_function @inline $fn(ptr::LLVMPtr{$T,$A}, val::$T) =
+        @eval @device_function @inline $fn(ptr::LLVMPtr{$T, $A}, val::$T) =
             @typed_ccall($intr, llvmcall, $T, (LLVMPtr{$T,$A}, $T), ptr, val)
     end
 end
@@ -192,7 +192,7 @@ for A in (AS.Generic, AS.Global, AS.Shared), T in (:Float16,)
     end
 
     intr = "atom$scope.add.noftz.f16 \$0, [\$1], \$2;"
-    @eval @device_function @inline atomic_add!(ptr::LLVMPtr{$T,$A}, val::$T) =
+    @eval @device_function @inline atomic_add!(ptr::LLVMPtr{$T, $A}, val::$T) =
         @asmcall($intr, "=h,l,h", true, $T, Tuple{Core.LLVMPtr{$T,$A},$T}, ptr, val)
 end
 
diff --git a/src/device/intrinsics/cooperative_groups.jl b/src/device/intrinsics/cooperative_groups.jl
index 0c07e3117..4bf553ce9 100644
--- a/src/device/intrinsics/cooperative_groups.jl
+++ b/src/device/intrinsics/cooperative_groups.jl
@@ -24,7 +24,7 @@ Noteworthy missing functionality:
 module CG
 
 using ..CUDA
-using ..CUDA: i32, Aligned, alignment, @device_function
+    using ..CUDA: i32, Aligned, alignment, @device_function
 
 import ..LLVM
 using ..LLVM.Interop
@@ -73,7 +73,7 @@ const grid_workspace = Ptr{grid_workspace_st}
     end
 end
 
-@device_function function get_grid_workspace()
+    @device_function function get_grid_workspace()
     # interpret the address from envreg 1 and 2 as the driver's grid workspace
     hi = ccall("llvm.nvvm.read.ptx.sreg.envreg1", llvmcall, UInt32, ())
     lo = ccall("llvm.nvvm.read.ptx.sreg.envreg2", llvmcall, UInt32, ())
@@ -561,13 +561,13 @@ end
 
 ## pipeline operations
 
-@device_function pipeline_commit() =
-    ccall("llvm.nvvm.cp.async.commit.group", llvmcall, Cvoid, ())
+    @device_function pipeline_commit() =
+        ccall("llvm.nvvm.cp.async.commit.group", llvmcall, Cvoid, ())
 
-@device_function pipeline_wait_prior(n) =
+    @device_function pipeline_wait_prior(n) =
     ccall("llvm.nvvm.cp.async.wait.group", llvmcall, Cvoid, (Int32,), n)
 
-@device_function @generated function pipeline_memcpy_async(dst::LLVMPtr{T}, src::LLVMPtr{T}) where T
+    @device_function @generated function pipeline_memcpy_async(dst::LLVMPtr{T}, src::LLVMPtr{T}) where {T}
     size_and_align = sizeof(T)
     size_and_align in (4, 8, 16) || :(return error($"Unsupported size $size_and_align"))
     intr = "llvm.nvvm.cp.async.ca.shared.global.$(sizeof(T))"
diff --git a/src/device/intrinsics/indexing.jl b/src/device/intrinsics/indexing.jl
index 6661591ec..b77da46cd 100644
--- a/src/device/intrinsics/indexing.jl
+++ b/src/device/intrinsics/indexing.jl
@@ -43,7 +43,7 @@ for dim in (:x, :y, :z)
     # Thread index
     fn = Symbol("threadIdx_$dim")
     intr = Symbol("tid.$dim")
-    @eval @device_function @inline $fn() = _index($(Val(intr)), $(Val(0:max_block_size[dim]-1))) + 1i32
+    @eval @device_function @inline $fn() = _index($(Val(intr)), $(Val(0:(max_block_size[dim] - 1)))) + 1i32
 
     # Block size (#threads per block)
     fn = Symbol("blockDim_$dim")
@@ -53,7 +53,7 @@ for dim in (:x, :y, :z)
     # Block index
     fn = Symbol("blockIdx_$dim")
     intr = Symbol("ctaid.$dim")
-    @eval @device_function @inline $fn() = _index($(Val(intr)), $(Val(0:max_grid_size[dim]-1))) + 1i32
+    @eval @device_function @inline $fn() = _index($(Val(intr)), $(Val(0:(max_grid_size[dim] - 1)))) + 1i32
 
     # Grid size (#blocks per grid)
     fn = Symbol("gridDim_$dim")
diff --git a/src/device/intrinsics/synchronization.jl b/src/device/intrinsics/synchronization.jl
index dcee17b10..241a26ba5 100644
--- a/src/device/intrinsics/synchronization.jl
+++ b/src/device/intrinsics/synchronization.jl
@@ -67,7 +67,7 @@ end # @device_functions
 
 export barrier_sync
 
-@device_function barrier_sync(id=0) =
+@device_function barrier_sync(id = 0) =
     ccall("llvm.nvvm.barrier.sync", llvmcall, Cvoid, (Int32,), id)
 
 
diff --git a/src/device/intrinsics/warp.jl b/src/device/intrinsics/warp.jl
index 63913c5aa..ed60d495f 100644
--- a/src/device/intrinsics/warp.jl
+++ b/src/device/intrinsics/warp.jl
@@ -26,7 +26,7 @@ for (name, mode, mask, offset) in (("_up",   :up,   UInt32(0x00), src->src),
     for (T,typ) in ((Int32, "i32"), (UInt32, "i32"), (Float32, "f32"))
         intrinsic = "llvm.nvvm.shfl.sync.$mode.$typ"
         @eval begin
-            @device_function @inline $fname(mask, val::$T, src, width=$ws) =
+            @device_function @inline $fname(mask, val::$T, src, width = $ws) =
                 ccall($intrinsic, llvmcall, $T,
                       (UInt32, $T, UInt32, UInt32),
                       mask, val, $(offset(:src)), pack(width, $mask))
diff --git a/src/device/intrinsics/wmma.jl b/src/device/intrinsics/wmma.jl
index 2998a5566..6faecfd55 100644
--- a/src/device/intrinsics/wmma.jl
+++ b/src/device/intrinsics/wmma.jl
@@ -2,7 +2,7 @@ export WMMA
 module WMMA
 
 import ..LLVM
-using ..CUDA: AS, @device_function
+    using ..CUDA: AS, @device_function
 using Core: LLVMPtr
 
 ################################################################################
@@ -200,10 +200,10 @@ for ops in all_ldst_ops,
     ptr_ty = :(LLVMPtr{$arr_ty, $addr_space_int})
 
     if sz == 1
-        @eval @device_function $func_name(src_addr, stride) = tuple(ccall($ccall_name, llvmcall, $frag_ty, ($ptr_ty, Int32), src_addr, stride))
+            @eval @device_function $func_name(src_addr, stride) = tuple(ccall($ccall_name, llvmcall, $frag_ty, ($ptr_ty, Int32), src_addr, stride))
     else
         struct_ty = Symbol("LLVMStruct$sz")
-        @eval @device_function $func_name(src_addr, stride) = convert(NTuple{$sz, $frag_ty}, ccall($ccall_name, llvmcall, $struct_ty{$frag_ty}, ($ptr_ty, Int32), src_addr, stride))
+            @eval @device_function $func_name(src_addr, stride) = convert(NTuple{$sz, $frag_ty}, ccall($ccall_name, llvmcall, $struct_ty{$frag_ty}, ($ptr_ty, Int32), src_addr, stride))
     end
     @eval export $func_name
     @eval @doc (@doc llvm_wmma_load) $func_name
@@ -270,7 +270,7 @@ export llvm_wmma_store
 
     ptr_ty = :(LLVMPtr{$arr_ty, $addr_space_int})
 
-    @eval @device_function $func_name(dst_addr, data, stride) = ccall($ccall_name, llvmcall, Nothing, ($ptr_ty, $(frag_types...), Int32), dst_addr, $(frag_vars...), stride)
+        @eval @device_function $func_name(dst_addr, data, stride) = ccall($ccall_name, llvmcall, Nothing, ($ptr_ty, $(frag_types...), Int32), dst_addr, $(frag_vars...), stride)
     @eval export $func_name
     @eval @doc (@doc llvm_wmma_store) $func_name
 end
@@ -347,10 +347,10 @@ for ops in all_wmma_ops,
     c_vars = ntuple(i -> :(c[$i]), c_sz)
 
     if d_sz == 1
-        @eval @device_function $func_name(a, b, c) = tuple(ccall($ccall_name, llvmcall, $d_frag_ty, ($(a_types...), $(b_types...), $(c_types...)), $(a_vars...), $(b_vars...), $(c_vars...)))
+            @eval @device_function $func_name(a, b, c) = tuple(ccall($ccall_name, llvmcall, $d_frag_ty, ($(a_types...), $(b_types...), $(c_types...)), $(a_vars...), $(b_vars...), $(c_vars...)))
     else
         struct_ty = Symbol("LLVMStruct$d_sz")
-        @eval @device_function $func_name(a, b, c) = convert(NTuple{$d_sz, $d_frag_ty}, ccall($ccall_name, llvmcall, $struct_ty{$d_frag_ty}, ($(a_types...), $(b_types...), $(c_types...)), $(a_vars...), $(b_vars...), $(c_vars...)))
+            @eval @device_function $func_name(a, b, c) = convert(NTuple{$d_sz, $d_frag_ty}, ccall($ccall_name, llvmcall, $struct_ty{$d_frag_ty}, ($(a_types...), $(b_types...), $(c_types...)), $(a_vars...), $(b_vars...), $(c_vars...)))
     end
     @eval export $func_name
     @eval @doc (@doc llvm_wmma_mma) $func_name
diff --git a/src/device/pointer.jl b/src/device/pointer.jl
index e38adf52e..df4852bb2 100644
--- a/src/device/pointer.jl
+++ b/src/device/pointer.jl
@@ -33,7 +33,8 @@ for T in LDGTypes
     typ = Symbol(class, width)
 
     intr = "llvm.nvvm.ldg.global.$class.$typ.p1$typ"
-    @eval @device_function @inline function pointerref_ldg(base_ptr::LLVMPtr{$T,AS.Global}, i::Integer,
+    @eval @device_function @inline function pointerref_ldg(
+            base_ptr::LLVMPtr{$T, AS.Global}, i::Integer,
                                           ::Val{align}) where align
         offset = i-one(i) # in elements
         ptr = base_ptr + offset*sizeof($T)
@@ -52,7 +53,8 @@ for (N, T) in ((4, Float32), (2, Float64), (4, Int8), (4, Int16), (4, Int32), (2
     typ = Symbol(class, width)
 
     intr = "llvm.nvvm.ldg.global.$class.v$N$typ.p1v$N$typ"
-    @eval @device_function @inline function pointerref_ldg(base_ptr::LLVMPtr{NTuple{$N, Base.VecElement{$T}},AS.Global}, i::Integer,
+    @eval @device_function @inline function pointerref_ldg(
+            base_ptr::LLVMPtr{NTuple{$N, Base.VecElement{$T}}, AS.Global}, i::Integer,
                                           ::Val{align}) where align
         offset = i-one(i) # in elements
         ptr = base_ptr + offset*$N*sizeof($T)
diff --git a/src/device/texture.jl b/src/device/texture.jl
index f675ff583..e28b60810 100644
--- a/src/device/texture.jl
+++ b/src/device/texture.jl
@@ -67,7 +67,7 @@ for (dispatch_rettyp, julia_rettyp, llvm_rettyp) in
                     NTuple{4,$dispatch_rettyp}})
 
     # tex1D only supports array memory
-    @eval @device_function tex(texObject::CuDeviceTexture{<:$eltyp,1,ArrayMemorySource}, x::Number) =
+    @eval @device_function tex(texObject::CuDeviceTexture{<:$eltyp, 1, ArrayMemorySource}, x::Number) =
         Tuple(ccall($("llvm.nvvm.tex.unified.1d.$llvm_rettyp.f32"), llvmcall,
                     $julia_rettyp, (CUtexObject, Float32), texObject, x))
 
@@ -78,7 +78,7 @@ for (dispatch_rettyp, julia_rettyp, llvm_rettyp) in
         julia_sig = ntuple(_->Float32, dims)
         julia_params = ntuple(i->:($(julia_args[i])::Number), dims)
 
-        @eval @device_function tex(texObject::CuDeviceTexture{<:$eltyp,$dims}, $(julia_params...)) =
+        @eval @device_function tex(texObject::CuDeviceTexture{<:$eltyp, $dims}, $(julia_params...)) =
             Tuple(ccall($("llvm.nvvm.tex.unified.$llvm_dim.$llvm_rettyp.f32"), llvmcall,
                         $julia_rettyp, (CUtexObject, $(julia_sig...)), texObject, $(julia_args...)))
     end
diff --git a/src/device/utils.jl b/src/device/utils.jl
index b899c6708..5245ae1c4 100644
--- a/src/device/utils.jl
+++ b/src/device/utils.jl
@@ -19,7 +19,7 @@ macro device_override(ex)
         end)
     else
         esc(quote
-            Base.Experimental.@overlay($(CUDA).method_table, $ex)
+                Base.Experimental.@overlay($(CUDA).method_table, $ex)
         end)
     end
 end
@@ -37,7 +37,7 @@ macro device_function(ex)
         $(combinedef(def))
 
         # NOTE: no use of `@consistent_overlay` here because the regular function errors
-        Base.Experimental.@overlay($(CUDA).method_table, $ex)
+            Base.Experimental.@overlay($(CUDA).method_table, $ex)
     end)
 end

KSepetanc · 2025-12-18T22:59:48Z

Can confirm. Error Cannot select: intrinsic %llvm.nvvm does not appear with this changes any more. Thank you!!

Remaining issue is as you state at point 4. Previously it was observed in comment when trying to AOT compile GPUCompiler.

I assume GPUCompiler.jl/src/runtime.jl:85 and its fixme note need to be checked.

apozharski · 2025-12-19T08:34:54Z

I assume GPUCompiler.jl/src/runtime.jl:85 and its fixme note need to be checked.

I started some digging into this yesterday and I think this is a bit of a red herring. If this was the core issue I would expect the methods that leak through to be all of the ones that need to be specified by the target or only the ones that are provided defaults. The two that currently are causing issues are signal_exception and gc_pool_alloc (which contains calls to malloc and report_oom) the first of which is target provided and the latter has a default implementation. I suspect there is something a little more subtle going on. I have some more time today to look at this so hopefully I will have at least a draft fix for that part soon.

michel2323 · 2025-12-23T17:48:52Z

I've started to mark all the methods in CUDA.jl as device methods. I still have to clean it up. Using @apozharski 's branch https://github.com/apozharski/GPUCompiler.jl/tree/ap/aot-compilation I can now get past the compilation. However, the next hurdle is that CUDA_Drive_jll.jl tries run julia and I get this error from the _init_() here: https://github.com/JuliaBinaryWrappers/CUDA_Driver_jll.jl/blob/039474600cf989a464f195dc573326e8cfa6a6cc/src/wrappers/x86_64-linux-gnu.jl#L9

Core.InitError(mod=:CUDA_Driver_jll, error=Base.TaskFailedException(task=Core.Task(next=nothing, queue=nothing, storage=nothing, donenotify=Base.GenericCondition{Base.Threads.SpinLock}(waitq=Base.IntrusiveLinkedList{Core.Task}(head=nothing, tail=nothing), lock=Base.Threads.SpinLock(owned=0)), result=Base.IOError(msg="could not spawn setenv(`/home/michel/git/ExaPF.jl/app/build/bin/julia -C native -g1 -O0 --compile=min -t1 --startup-file=no -e '...'

apozharski · 2025-12-23T18:16:31Z

I've started to mark all the methods in CUDA.jl as device methods. I still have to clean it up. Using @apozharski 's branch https://github.com/apozharski/GPUCompiler.jl/tree/ap/aot-compilation I can now get past the compilation. However, the next hurdle is that CUDA_Drive_jll.jl tries run julia and I get this error from the _init_() here: https://github.com/JuliaBinaryWrappers/CUDA_Driver_jll.jl/blob/039474600cf989a464f195dc573326e8cfa6a6cc/src/wrappers/x86_64-linux-gnu.jl#L9
Core.InitError(mod=:CUDA_Driver_jll, error=Base.TaskFailedException(task=Core.Task(next=nothing, queue=nothing, storage=nothing, donenotify=Base.GenericCondition{Base.Threads.SpinLock}(waitq=Base.IntrusiveLinkedList{Core.Task}(head=nothing, tail=nothing), lock=Base.Threads.SpinLock(owned=0)), result=Base.IOError(msg="could not spawn setenv(`/home/michel/git/ExaPF.jl/app/build/bin/julia -C native -g1 -O0 --compile=min -t1 --startup-file=no -e '...'

@michel2323 the workaround I found for this while trying to compile libMad is to bundle the julia executable (see the latest commit on my fork of JuliaC). Ideally CUDA_driver_jll would not fork to another julia process as this is likely to be very annoying to get to be relocatable, but this seems to work.

michel2323 · 2025-12-23T18:23:28Z

Nice! Thank you! @apozharski

Also `@device_function` decoration

apozharski marked this pull request as ready for review December 18, 2025 11:47

apozharski mentioned this pull request Dec 22, 2025

Enable Ahead-of-Time Compilation by hiding the runtime functions in the GLOBAL_METHOD_TABLE JuliaGPU/GPUCompiler.jl#749

Draft

apozharski added 3 commits December 23, 2025 19:22

add device_function macro to all uses of nvvm intrinsics

caa6a68

Add device_function for ptx_* functionst and ccalls to libcudadevrt

0c33931

rm test which checks for llvm.nvvm which should no longer exist

784cad3

apozharski force-pushed the ap/labeling-device-functions branch from 3b831b0 to 784cad3 Compare December 23, 2025 18:22

michel2323 and others added 3 commits December 23, 2025 12:30

Apply @device_function for AOT

57a60ab

Merge pull request #1 from michel2323/ms/aot

c0cfb99

Also `@device_function` decoration

fixing issues I missed because I only looked at the github diff

0a61a43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make Ahead-of-Time compilation feasible for packages using CUDA.jl #2998

Make Ahead-of-Time compilation feasible for packages using CUDA.jl #2998

Uh oh!

apozharski commented Dec 16, 2025

Uh oh!

codecov bot commented Dec 18, 2025

Uh oh!

github-actions bot commented Dec 18, 2025 •

edited

Loading

Uh oh!

KSepetanc commented Dec 18, 2025 •

edited

Loading

Uh oh!

apozharski commented Dec 19, 2025

Uh oh!

michel2323 commented Dec 23, 2025

Uh oh!

apozharski commented Dec 23, 2025

Uh oh!

michel2323 commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Make Ahead-of-Time compilation feasible for packages using CUDA.jl #2998

Are you sure you want to change the base?

Make Ahead-of-Time compilation feasible for packages using CUDA.jl #2998

Uh oh!

Conversation

apozharski commented Dec 16, 2025

Uh oh!

codecov bot commented Dec 18, 2025

Codecov Report

Uh oh!

github-actions bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KSepetanc commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apozharski commented Dec 19, 2025

Uh oh!

michel2323 commented Dec 23, 2025

Uh oh!

apozharski commented Dec 23, 2025

Uh oh!

michel2323 commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Dec 18, 2025 •

edited

Loading

KSepetanc commented Dec 18, 2025 •

edited

Loading