-
Notifications
You must be signed in to change notification settings - Fork 48
Add async background warmup to reduce first-kernel latency #721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The first GPU kernel in a Metal.jl session takes ~1.75s due to one-time JIT compilation of GPUCompiler internals. This adds async background warmup during package initialization to reduce this to 0.035-0.20s—a 9-50x improvement in perceived first-kernel latency. Implementation: - Start minimal kernel compilation in background during __init__() - Add Metal.warmup() API for explicit synchronization - Add "warmup" preference to disable if needed Key findings from investigation: - Overhead is JIT compilation, not memory page faults - Size-independent: 1KB and 512MB arrays have same delay - Concurrent compilations share initialization (lock serialization) - User kernel benefits even if warmup hasn't completed
|
Your PR no longer requires formatting changes. Thank you for your contribution! |
|
I think this is the wrong approach. A task started in the background can negativly impact perceived latency, by blocking the REPL as an example. There is https://github.com/JuliaGPU/GPUCompiler.jl/blob/e4a697f3b77f5c4ccb3a63354731c022648026d7/src/jlgen.jl#L681 to allow for precompilation of compiler jobs which would warm up the infrastructure and allow you to move this work to precompilation time. |
|
It's async + measurements provided in PR description |
|
Julia uses a cooperative task runtime, so saying that something is |
|
Updated PR description to address these concerns. |
src/warmup.jl
Outdated
| export warmup | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should fix the benchmark error.
| export warmup |
- Remove `export warmup` to avoid benchmark API change detection (warmup still accessible via Metal.warmup()) - Only run async warmup when Threads.nthreads() > 1 to address vchuravy's concern about blocking the REPL on single-threaded Julia - Update docstring to reflect these changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metal Benchmarks
Details
| Benchmark suite | Current: d4db4a1 | Previous: 67d668c | Ratio |
|---|---|---|---|
latency/precompile |
24439509417 ns |
24820843000 ns |
0.98 |
latency/ttfp |
2324493979 ns |
2257593833 ns |
1.03 |
latency/import |
1433921333 ns |
1431203750 ns |
1.00 |
integration/metaldevrt |
837666.5 ns |
834875 ns |
1.00 |
integration/byval/slices=1 |
1586208 ns |
1525666.5 ns |
1.04 |
integration/byval/slices=3 |
20564666.5 ns |
8498958 ns |
2.42 |
integration/byval/reference |
1586374.5 ns |
1538166 ns |
1.03 |
integration/byval/slices=2 |
2743833 ns |
2552562 ns |
1.07 |
kernel/indexing |
490834 ns |
593833 ns |
0.83 |
kernel/indexing_checked |
495520.5 ns |
575750 ns |
0.86 |
kernel/launch |
12750 ns |
11250 ns |
1.13 |
kernel/rand |
522709 ns |
557187.5 ns |
0.94 |
array/construct |
6375 ns |
6000 ns |
1.06 |
array/broadcast |
542709 ns |
591209 ns |
0.92 |
array/random/randn/Float32 |
921188 ns |
836917 ns |
1.10 |
array/random/randn!/Float32 |
583834 ns |
619542 ns |
0.94 |
array/random/rand!/Int64 |
535542 ns |
548834 ns |
0.98 |
array/random/rand!/Float32 |
545083 ns |
593333 ns |
0.92 |
array/random/rand/Int64 |
955375 ns |
735667 ns |
1.30 |
array/random/rand/Float32 |
813687 ns |
631792 ns |
1.29 |
array/accumulate/Int64/1d |
1313042 ns |
1237125 ns |
1.06 |
array/accumulate/Int64/dims=1 |
1876000 ns |
1795625 ns |
1.04 |
array/accumulate/Int64/dims=2 |
2239000 ns |
2130458 ns |
1.05 |
array/accumulate/Int64/dims=1L |
12308208 ns |
11609562.5 ns |
1.06 |
array/accumulate/Int64/dims=2L |
9569834 ns |
9610834 ns |
1.00 |
array/accumulate/Float32/1d |
1087583 ns |
1111187.5 ns |
0.98 |
array/accumulate/Float32/dims=1 |
1635146 ns |
1518146 ns |
1.08 |
array/accumulate/Float32/dims=2 |
1999958 ns |
1836167 ns |
1.09 |
array/accumulate/Float32/dims=1L |
10442708 ns |
9757375 ns |
1.07 |
array/accumulate/Float32/dims=2L |
7382333 ns |
7203562.5 ns |
1.02 |
array/reductions/reduce/Int64/1d |
1351917 ns |
1498333 ns |
0.90 |
array/reductions/reduce/Int64/dims=1 |
1124583 ns |
1076542 ns |
1.04 |
array/reductions/reduce/Int64/dims=2 |
1159958.5 ns |
1129417 ns |
1.03 |
array/reductions/reduce/Int64/dims=1L |
2043479.5 ns |
2002083.5 ns |
1.02 |
array/reductions/reduce/Int64/dims=2L |
3846792 ns |
4214895.5 ns |
0.91 |
array/reductions/reduce/Float32/1d |
787417 ns |
991375 ns |
0.79 |
array/reductions/reduce/Float32/dims=1 |
799250 ns |
827000 ns |
0.97 |
array/reductions/reduce/Float32/dims=2 |
836875 ns |
833917 ns |
1.00 |
array/reductions/reduce/Float32/dims=1L |
1325854.5 ns |
1305125 ns |
1.02 |
array/reductions/reduce/Float32/dims=2L |
1817417 ns |
1788375 ns |
1.02 |
array/reductions/mapreduce/Int64/1d |
1334604.5 ns |
1549292 ns |
0.86 |
array/reductions/mapreduce/Int64/dims=1 |
1119708 ns |
1085333 ns |
1.03 |
array/reductions/mapreduce/Int64/dims=2 |
1155875 ns |
1201959 ns |
0.96 |
array/reductions/mapreduce/Int64/dims=1L |
2020417 ns |
2019583 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
3650958.5 ns |
3628521 ns |
1.01 |
array/reductions/mapreduce/Float32/1d |
793458 ns |
1036542 ns |
0.77 |
array/reductions/mapreduce/Float32/dims=1 |
802270.5 ns |
819667 ns |
0.98 |
array/reductions/mapreduce/Float32/dims=2 |
814333 ns |
843917 ns |
0.96 |
array/reductions/mapreduce/Float32/dims=1L |
1354667 ns |
1280500 ns |
1.06 |
array/reductions/mapreduce/Float32/dims=2L |
1840083 ns |
1784500 ns |
1.03 |
array/private/copyto!/gpu_to_gpu |
573291 ns |
635375 ns |
0.90 |
array/private/copyto!/cpu_to_gpu |
675145.5 ns |
786625 ns |
0.86 |
array/private/copyto!/gpu_to_cpu |
732000 ns |
773833 ns |
0.95 |
array/private/iteration/findall/int |
1578083 ns |
1620458 ns |
0.97 |
array/private/iteration/findall/bool |
1468209 ns |
1430125 ns |
1.03 |
array/private/iteration/findfirst/int |
2091125 ns |
2024937.5 ns |
1.03 |
array/private/iteration/findfirst/bool |
2028916.5 ns |
2010916 ns |
1.01 |
array/private/iteration/scalar |
3334625 ns |
5600375 ns |
0.60 |
array/private/iteration/logical |
2675125 ns |
2504521 ns |
1.07 |
array/private/iteration/findmin/1d |
2249020.5 ns |
2209917 ns |
1.02 |
array/private/iteration/findmin/2d |
1536791.5 ns |
1498584 ns |
1.03 |
array/private/copy |
868542 ns |
558312.5 ns |
1.56 |
array/shared/copyto!/gpu_to_gpu |
84187.5 ns |
82042 ns |
1.03 |
array/shared/copyto!/cpu_to_gpu |
83979.5 ns |
79750 ns |
1.05 |
array/shared/copyto!/gpu_to_cpu |
84083 ns |
82125 ns |
1.02 |
array/shared/iteration/findall/int |
1578396 ns |
1600354 ns |
0.99 |
array/shared/iteration/findall/bool |
1479833 ns |
1452458 ns |
1.02 |
array/shared/iteration/findfirst/int |
1699563 ns |
1621520.5 ns |
1.05 |
array/shared/iteration/findfirst/bool |
1631667 ns |
1607916.5 ns |
1.01 |
array/shared/iteration/scalar |
203708 ns |
202916 ns |
1.00 |
array/shared/iteration/logical |
2309417 ns |
2386416.5 ns |
0.97 |
array/shared/iteration/findmin/1d |
1882604.5 ns |
1799396 ns |
1.05 |
array/shared/iteration/findmin/2d |
1552271.5 ns |
1500416.5 ns |
1.03 |
array/shared/copy |
215291.5 ns |
230791 ns |
0.93 |
array/permutedims/4d |
2497729.5 ns |
2358000 ns |
1.06 |
array/permutedims/2d |
1205896 ns |
1133208 ns |
1.06 |
array/permutedims/3d |
1798750 ns |
1645604 ns |
1.09 |
metal/synchronization/stream |
19167 ns |
18500 ns |
1.04 |
metal/synchronization/context |
19750 ns |
19625 ns |
1.01 |
This comment was automatically generated by workflow using github-action-benchmark.
The warmup task is intentionally skipped when Threads.nthreads() == 1 to avoid blocking the main thread. Updated tests to: - Check thread count before testing task existence - Test that _warmup_task[] === nothing on single thread - Only run multi-threaded specific tests when nthreads > 1 - API tests (warmup() calls) work in both modes
Removed thread count checks and internal state inspection (_warmup_task[]). Tests now verify: - warmup() returns nothing regardless of configuration - Multiple warmup calls are safe - Kernels compile and execute correctly after warmup This makes tests robust across all thread configurations without branching on implementation details.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #721 +/- ##
==========================================
- Coverage 80.96% 80.54% -0.42%
==========================================
Files 62 63 +1
Lines 2837 2858 +21
==========================================
+ Hits 2297 2302 +5
- Misses 540 556 +16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
src/initialization.jl
Outdated
| # Only run with multiple threads - with a single thread, the async task would | ||
| # block the main thread due to Julia's cooperative task runtime. | ||
| return if functional() && _warmup_enabled && Threads.nthreads() > 1 | ||
| _warmup_task[] = errormonitor(@async _warmup_compilation()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| _warmup_task[] = errormonitor(@async _warmup_compilation()) | |
| _warmup_task[] = errormonitor(Threads.@spawn _warmup_compilation()) |
@async is pinned to the same thread as parent.
|
I overlooked that |
Summary
The first GPU kernel in a Metal.jl session takes ~1.75 seconds due to one-time JIT compilation of the GPU compilation pipeline (GPUCompiler, LLVM passes, etc.). This PR introduces async background warmup during package initialization to reduce this to 0.035-0.20 seconds—a 9-50x improvement in perceived first-kernel latency.
Problem
Users experience a jarring ~2 second delay on their first GPU operation:
This causes:
Root Cause Analysis
The delay was previously attributed to memory page faults on large arrays. Investigation revealed this is incorrect—the actual cause is JIT compilation:
Solution
Start a minimal kernel compilation in the background during
__init__()when multiple threads are available. By the time users run their first kernel, most or all initialization is complete.Key Discovery
Concurrent compilations share the one-time initialization overhead:
The user kernel benefits from shared initialization even when warmup hasn't completed, due to lock serialization in
mtlfunction.Changes
New Files
src/warmup.jl- Warmup kernel and publicMetal.warmup()APItest/warmup.jl- Unit tests for warmup functionalityModified Files
src/initialization.jl- Add warmup task startup in__init__()src/Metal.jl- Include warmup moduleAPI Additions
Note:
warmupis not exported to avoid namespace pollution. Call viaMetal.warmup().Preferences
Users can disable warmup via
LocalPreferences.toml:Performance
Trade-offs
What does the user lose? Nothing meaningful:
The background CPU usage is practically unnoticeable on modern Apple Silicon Macs (8+ cores). Benchmarks show <2% overhead on concurrent CPU workloads—well within measurement noise. The compilation work would happen anyway on the user's first kernel; we're simply shifting it to run earlier in the background while the user's code is still setting up.
Users who need to measure cold-start compilation (debugging/profiling) can disable via preference.
Why This Matters
Misleading Benchmarks Lead to Wasted Debugging Time
Without warmup, users comparing CPU vs GPU performance get dramatically wrong conclusions:
A user unaware of this one-time JIT cost might:
First Impressions for New Users
(Highly relevant for computational scientists with specializations in biology, neuroscience, chemistry, etc. who might not know or care about compilation mechanics despite being the target audience for Julia)
When someone evaluates Metal.jl for the first time:
This 2-second hang on a trivial operation creates a poor first impression, especially compared to frameworks like PyTorch or CUDA.jl where GPU operations feel instant. With async warmup, the experience becomes what users expect—responsive from the first interaction.
Testing
All existing tests pass. New tests added:
Metal.warmup()API works correctlyCommunity Concerns
Single-threaded REPL blocking
Concern: In single-threaded mode, Julia's cooperative scheduling means JIT compilation doesn't yield, potentially blocking the REPL during warmup.
Response: Metal.jl users are pursuing GPU computing on Apple Silicon. It's reasonable to expect they've explored CPU parallelism first (setting
-t autoorJULIA_NUM_THREADS), which is typically a prerequisite step before GPU for real end users in scientific computing work.Default to old behaviour: Warmup only runs when
Threads.nthreads() > 1(i.e., when Julia is started with-t autoorJULIA_NUM_THREADS > 1).With a single thread, Julia's cooperative task runtime means an async task would block the main thread during JIT compilation, potentially hurting perceived REPL latency. To avoid this, Metal.jl warmup is skipped entirely in single-threaded mode—users get the same behaviour as before this PR (assuming this helps with perceived responsiveness for these niche users).
This addresses @vchuravy's concern about REPL blocking while still providing the optimization for the common case (multi-threaded Julia for Metal.jl users on Apple Silicon).