Skip to content

Conversation

@KaanKesginLW
Copy link
Contributor

@KaanKesginLW KaanKesginLW commented Dec 5, 2025

Summary

The first GPU kernel in a Metal.jl session takes ~1.75 seconds due to one-time JIT compilation of the GPU compilation pipeline (GPUCompiler, LLVM passes, etc.). This PR introduces async background warmup during package initialization to reduce this to 0.035-0.20 seconds—a 9-50x improvement in perceived first-kernel latency.

Problem

Users experience a jarring ~2 second delay on their first GPU operation:

using Metal
a = MtlArray(rand(Float32, 1024, 1024))
@time fill!(a, 1.0f0)  # 1.75s - unexpected!
@time fill!(a, 2.0f0)  # 0.001s - fast as expected

This causes:

  • Misleading benchmark results (first iteration 50x slower)
  • Poor first impressions for new users evaluating Metal.jl
  • Confusion ("is this a memory issue? a bug?")

Root Cause Analysis

The delay was previously attributed to memory page faults on large arrays. Investigation revealed this is incorrect—the actual cause is JIT compilation:

Evidence Finding
1KB array Same 1.75s delay as 512MB
Storage mode No difference (Private vs Shared)
Compilation stages check_method (0.2s) + LLVM IR gen (1.1s) + AIR (0.1s)

Solution

Start a minimal kernel compilation in the background during __init__() when multiple threads are available. By the time users run their first kernel, most or all initialization is complete.

Key Discovery

Concurrent compilations share the one-time initialization overhead:

Warmup kernel:  1.620s
User kernel:    0.196s  (concurrent, not 1.7s!)
Total wall:     1.808s

The user kernel benefits from shared initialization even when warmup hasn't completed, due to lock serialization in mtlfunction.

Changes

New Files

  • src/warmup.jl - Warmup kernel and public Metal.warmup() API
  • test/warmup.jl - Unit tests for warmup functionality

Modified Files

  • src/initialization.jl - Add warmup task startup in __init__()
  • src/Metal.jl - Include warmup module

API Additions

Metal.warmup(; blocking=true)  # Wait for warmup to complete
Metal.warmup(blocking=false)   # Return immediately

Note: warmup is not exported to avoid namespace pollution. Call via Metal.warmup().

Preferences

Users can disable warmup via LocalPreferences.toml:

[Metal]
warmup = false

Performance

Scenario Before After Improvement
Explicit wait 1.75s 0.035s 50x
Immediate (concurrent) 1.75s 0.20s 9x
Typical workflow 1.75s 0.04-0.15s 12-44x

Trade-offs

What does the user lose? Nothing meaningful:

Concern Impact
Import time Unchanged (~1.1s) - warmup runs in background, doesn't block
Memory 4 bytes temporarily allocated, freed immediately
CPU ~1.7s of single-threaded background work
Correctness Unaffected
API No breaking changes

The background CPU usage is practically unnoticeable on modern Apple Silicon Macs (8+ cores). Benchmarks show <2% overhead on concurrent CPU workloads—well within measurement noise. The compilation work would happen anyway on the user's first kernel; we're simply shifting it to run earlier in the background while the user's code is still setting up.

Users who need to measure cold-start compilation (debugging/profiling) can disable via preference.

Why This Matters

Misleading Benchmarks Lead to Wasted Debugging Time

Without warmup, users comparing CPU vs GPU performance get dramatically wrong conclusions:

Matrix multiply (4096×4096 Float32):
  CPU: 0.306s
  GPU (first call):  1.012s  ← User thinks GPU is 3x SLOWER
  GPU (second call): 0.019s  ← Actual: GPU is 16x FASTER

A user unaware of this one-time JIT cost might:

  • Conclude Metal.jl is slower than CPU and abandon it
  • Spend hours debugging a non-existent "performance bug"
  • File issues about inconsistent profiling results
  • Distrust their own benchmarks

First Impressions for New Users

(Highly relevant for computational scientists with specializations in biology, neuroscience, chemistry, etc. who might not know or care about compilation mechanics despite being the target audience for Julia)

When someone evaluates Metal.jl for the first time:

julia> using Metal
julia> a = MtlArray([1, 2, 3])
julia> @time a .+ 1   # 1.7s delay - "is this broken?"

This 2-second hang on a trivial operation creates a poor first impression, especially compared to frameworks like PyTorch or CUDA.jl where GPU operations feel instant. With async warmup, the experience becomes what users expect—responsive from the first interaction.

Testing

All existing tests pass. New tests added:

  • Warmup task starts and completes without error
  • Metal.warmup() API works correctly
  • Kernel compilation is fast after warmup
  • Concurrent compilations don't deadlock

Community Concerns

Single-threaded REPL blocking

Concern: In single-threaded mode, Julia's cooperative scheduling means JIT compilation doesn't yield, potentially blocking the REPL during warmup.

Response: Metal.jl users are pursuing GPU computing on Apple Silicon. It's reasonable to expect they've explored CPU parallelism first (setting -t auto or JULIA_NUM_THREADS), which is typically a prerequisite step before GPU for real end users in scientific computing work.

Default to old behaviour: Warmup only runs when Threads.nthreads() > 1 (i.e., when Julia is started with -t auto or JULIA_NUM_THREADS > 1).

With a single thread, Julia's cooperative task runtime means an async task would block the main thread during JIT compilation, potentially hurting perceived REPL latency. To avoid this, Metal.jl warmup is skipped entirely in single-threaded mode—users get the same behaviour as before this PR (assuming this helps with perceived responsiveness for these niche users).

This addresses @vchuravy's concern about REPL blocking while still providing the optimization for the common case (multi-threaded Julia for Metal.jl users on Apple Silicon).

The first GPU kernel in a Metal.jl session takes ~1.75s due to one-time
JIT compilation of GPUCompiler internals. This adds async background
warmup during package initialization to reduce this to 0.035-0.20s—a
9-50x improvement in perceived first-kernel latency.

Implementation:
- Start minimal kernel compilation in background during __init__()
- Add Metal.warmup() API for explicit synchronization
- Add "warmup" preference to disable if needed

Key findings from investigation:
- Overhead is JIT compilation, not memory page faults
- Size-independent: 1KB and 512MB arrays have same delay
- Concurrent compilations share initialization (lock serialization)
- User kernel benefits even if warmup hasn't completed
@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

Your PR no longer requires formatting changes. Thank you for your contribution!

@vchuravy
Copy link
Member

vchuravy commented Dec 5, 2025

I think this is the wrong approach. A task started in the background can negativly impact perceived latency, by blocking the REPL as an example.

There is https://github.com/JuliaGPU/GPUCompiler.jl/blob/e4a697f3b77f5c4ccb3a63354731c022648026d7/src/jlgen.jl#L681 to allow for precompilation of compiler jobs which would warm up the infrastructure and allow you to move this work to precompilation time.

@KaanKesginLW
Copy link
Contributor Author

It's async + measurements provided in PR description

@vchuravy
Copy link
Member

vchuravy commented Dec 5, 2025

Julia uses a cooperative task runtime, so saying that something is async doesn't mean that much. If you launch single threaded, the thread will be blocked.

@KaanKesginLW
Copy link
Contributor Author

KaanKesginLW commented Dec 5, 2025

Updated PR description to address these concerns.

src/warmup.jl Outdated
Comment on lines 8 to 9
export warmup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should fix the benchmark error.

Suggested change
export warmup

- Remove `export warmup` to avoid benchmark API change detection
  (warmup still accessible via Metal.warmup())
- Only run async warmup when Threads.nthreads() > 1 to address
  vchuravy's concern about blocking the REPL on single-threaded Julia
- Update docstring to reflect these changes
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Details
Benchmark suite Current: d4db4a1 Previous: 67d668c Ratio
latency/precompile 24439509417 ns 24820843000 ns 0.98
latency/ttfp 2324493979 ns 2257593833 ns 1.03
latency/import 1433921333 ns 1431203750 ns 1.00
integration/metaldevrt 837666.5 ns 834875 ns 1.00
integration/byval/slices=1 1586208 ns 1525666.5 ns 1.04
integration/byval/slices=3 20564666.5 ns 8498958 ns 2.42
integration/byval/reference 1586374.5 ns 1538166 ns 1.03
integration/byval/slices=2 2743833 ns 2552562 ns 1.07
kernel/indexing 490834 ns 593833 ns 0.83
kernel/indexing_checked 495520.5 ns 575750 ns 0.86
kernel/launch 12750 ns 11250 ns 1.13
kernel/rand 522709 ns 557187.5 ns 0.94
array/construct 6375 ns 6000 ns 1.06
array/broadcast 542709 ns 591209 ns 0.92
array/random/randn/Float32 921188 ns 836917 ns 1.10
array/random/randn!/Float32 583834 ns 619542 ns 0.94
array/random/rand!/Int64 535542 ns 548834 ns 0.98
array/random/rand!/Float32 545083 ns 593333 ns 0.92
array/random/rand/Int64 955375 ns 735667 ns 1.30
array/random/rand/Float32 813687 ns 631792 ns 1.29
array/accumulate/Int64/1d 1313042 ns 1237125 ns 1.06
array/accumulate/Int64/dims=1 1876000 ns 1795625 ns 1.04
array/accumulate/Int64/dims=2 2239000 ns 2130458 ns 1.05
array/accumulate/Int64/dims=1L 12308208 ns 11609562.5 ns 1.06
array/accumulate/Int64/dims=2L 9569834 ns 9610834 ns 1.00
array/accumulate/Float32/1d 1087583 ns 1111187.5 ns 0.98
array/accumulate/Float32/dims=1 1635146 ns 1518146 ns 1.08
array/accumulate/Float32/dims=2 1999958 ns 1836167 ns 1.09
array/accumulate/Float32/dims=1L 10442708 ns 9757375 ns 1.07
array/accumulate/Float32/dims=2L 7382333 ns 7203562.5 ns 1.02
array/reductions/reduce/Int64/1d 1351917 ns 1498333 ns 0.90
array/reductions/reduce/Int64/dims=1 1124583 ns 1076542 ns 1.04
array/reductions/reduce/Int64/dims=2 1159958.5 ns 1129417 ns 1.03
array/reductions/reduce/Int64/dims=1L 2043479.5 ns 2002083.5 ns 1.02
array/reductions/reduce/Int64/dims=2L 3846792 ns 4214895.5 ns 0.91
array/reductions/reduce/Float32/1d 787417 ns 991375 ns 0.79
array/reductions/reduce/Float32/dims=1 799250 ns 827000 ns 0.97
array/reductions/reduce/Float32/dims=2 836875 ns 833917 ns 1.00
array/reductions/reduce/Float32/dims=1L 1325854.5 ns 1305125 ns 1.02
array/reductions/reduce/Float32/dims=2L 1817417 ns 1788375 ns 1.02
array/reductions/mapreduce/Int64/1d 1334604.5 ns 1549292 ns 0.86
array/reductions/mapreduce/Int64/dims=1 1119708 ns 1085333 ns 1.03
array/reductions/mapreduce/Int64/dims=2 1155875 ns 1201959 ns 0.96
array/reductions/mapreduce/Int64/dims=1L 2020417 ns 2019583 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 3650958.5 ns 3628521 ns 1.01
array/reductions/mapreduce/Float32/1d 793458 ns 1036542 ns 0.77
array/reductions/mapreduce/Float32/dims=1 802270.5 ns 819667 ns 0.98
array/reductions/mapreduce/Float32/dims=2 814333 ns 843917 ns 0.96
array/reductions/mapreduce/Float32/dims=1L 1354667 ns 1280500 ns 1.06
array/reductions/mapreduce/Float32/dims=2L 1840083 ns 1784500 ns 1.03
array/private/copyto!/gpu_to_gpu 573291 ns 635375 ns 0.90
array/private/copyto!/cpu_to_gpu 675145.5 ns 786625 ns 0.86
array/private/copyto!/gpu_to_cpu 732000 ns 773833 ns 0.95
array/private/iteration/findall/int 1578083 ns 1620458 ns 0.97
array/private/iteration/findall/bool 1468209 ns 1430125 ns 1.03
array/private/iteration/findfirst/int 2091125 ns 2024937.5 ns 1.03
array/private/iteration/findfirst/bool 2028916.5 ns 2010916 ns 1.01
array/private/iteration/scalar 3334625 ns 5600375 ns 0.60
array/private/iteration/logical 2675125 ns 2504521 ns 1.07
array/private/iteration/findmin/1d 2249020.5 ns 2209917 ns 1.02
array/private/iteration/findmin/2d 1536791.5 ns 1498584 ns 1.03
array/private/copy 868542 ns 558312.5 ns 1.56
array/shared/copyto!/gpu_to_gpu 84187.5 ns 82042 ns 1.03
array/shared/copyto!/cpu_to_gpu 83979.5 ns 79750 ns 1.05
array/shared/copyto!/gpu_to_cpu 84083 ns 82125 ns 1.02
array/shared/iteration/findall/int 1578396 ns 1600354 ns 0.99
array/shared/iteration/findall/bool 1479833 ns 1452458 ns 1.02
array/shared/iteration/findfirst/int 1699563 ns 1621520.5 ns 1.05
array/shared/iteration/findfirst/bool 1631667 ns 1607916.5 ns 1.01
array/shared/iteration/scalar 203708 ns 202916 ns 1.00
array/shared/iteration/logical 2309417 ns 2386416.5 ns 0.97
array/shared/iteration/findmin/1d 1882604.5 ns 1799396 ns 1.05
array/shared/iteration/findmin/2d 1552271.5 ns 1500416.5 ns 1.03
array/shared/copy 215291.5 ns 230791 ns 0.93
array/permutedims/4d 2497729.5 ns 2358000 ns 1.06
array/permutedims/2d 1205896 ns 1133208 ns 1.06
array/permutedims/3d 1798750 ns 1645604 ns 1.09
metal/synchronization/stream 19167 ns 18500 ns 1.04
metal/synchronization/context 19750 ns 19625 ns 1.01

This comment was automatically generated by workflow using github-action-benchmark.

The warmup task is intentionally skipped when Threads.nthreads() == 1
to avoid blocking the main thread. Updated tests to:
- Check thread count before testing task existence
- Test that _warmup_task[] === nothing on single thread
- Only run multi-threaded specific tests when nthreads > 1
- API tests (warmup() calls) work in both modes
Removed thread count checks and internal state inspection (_warmup_task[]).
Tests now verify:
- warmup() returns nothing regardless of configuration
- Multiple warmup calls are safe
- Kernels compile and execute correctly after warmup

This makes tests robust across all thread configurations without
branching on implementation details.
@codecov
Copy link

codecov bot commented Dec 7, 2025

Codecov Report

❌ Patch coverage is 21.73913% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.54%. Comparing base (239fa4d) to head (d4db4a1).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
src/warmup.jl 21.05% 15 Missing ⚠️
src/initialization.jl 25.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #721      +/-   ##
==========================================
- Coverage   80.96%   80.54%   -0.42%     
==========================================
  Files          62       63       +1     
  Lines        2837     2858      +21     
==========================================
+ Hits         2297     2302       +5     
- Misses        540      556      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

# Only run with multiple threads - with a single thread, the async task would
# block the main thread due to Julia's cooperative task runtime.
return if functional() && _warmup_enabled && Threads.nthreads() > 1
_warmup_task[] = errormonitor(@async _warmup_compilation())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_warmup_task[] = errormonitor(@async _warmup_compilation())
_warmup_task[] = errormonitor(Threads.@spawn _warmup_compilation())

@async is pinned to the same thread as parent.

@async pins the task to the same thread as parent, which would still
block thread 1 even with multiple threads available. Threads.@Spawn
properly schedules the warmup on a worker thread.
@KaanKesginLW
Copy link
Contributor Author

I overlooked that @async pins to the parent thread. Applied Threads.@spawn in d4db4a1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants