Add async background warmup to reduce first-kernel latency #721

KaanKesginLW · 2025-12-05T10:25:07Z

Summary

The first GPU kernel in a Metal.jl session takes ~1.75 seconds due to one-time JIT compilation of the GPU compilation pipeline (GPUCompiler, LLVM passes, etc.). This PR introduces async background warmup during package initialization to reduce this to 0.035-0.20 seconds—a 9-50x improvement in perceived first-kernel latency.

Problem

Users experience a jarring ~2 second delay on their first GPU operation:

using Metal
a = MtlArray(rand(Float32, 1024, 1024))
@time fill!(a, 1.0f0)  # 1.75s - unexpected!
@time fill!(a, 2.0f0)  # 0.001s - fast as expected

This causes:

Misleading benchmark results (first iteration 50x slower)
Poor first impressions for new users evaluating Metal.jl
Confusion ("is this a memory issue? a bug?")

Root Cause Analysis

The delay was previously attributed to memory page faults on large arrays. Investigation revealed this is incorrect—the actual cause is JIT compilation:

Evidence	Finding
1KB array	Same 1.75s delay as 512MB
Storage mode	No difference (Private vs Shared)
Compilation stages	check_method (0.2s) + LLVM IR gen (1.1s) + AIR (0.1s)

Solution

Start a minimal kernel compilation in the background during __init__() when multiple threads are available. By the time users run their first kernel, most or all initialization is complete.

Key Discovery

Concurrent compilations share the one-time initialization overhead:

Warmup kernel:  1.620s
User kernel:    0.196s  (concurrent, not 1.7s!)
Total wall:     1.808s

The user kernel benefits from shared initialization even when warmup hasn't completed, due to lock serialization in mtlfunction.

Changes

New Files

src/warmup.jl - Warmup kernel and public Metal.warmup() API
test/warmup.jl - Unit tests for warmup functionality

Modified Files

src/initialization.jl - Add warmup task startup in __init__()
src/Metal.jl - Include warmup module

API Additions

Metal.warmup(; blocking=true)  # Wait for warmup to complete
Metal.warmup(blocking=false)   # Return immediately

Note: warmup is not exported to avoid namespace pollution. Call via Metal.warmup().

Preferences

Users can disable warmup via LocalPreferences.toml:

[Metal]
warmup = false

Performance

Scenario	Before	After	Improvement
Explicit wait	1.75s	0.035s	50x
Immediate (concurrent)	1.75s	0.20s	9x
Typical workflow	1.75s	0.04-0.15s	12-44x

Trade-offs

What does the user lose? Nothing meaningful:

Concern	Impact
Import time	Unchanged (~1.1s) - warmup runs in background, doesn't block
Memory	4 bytes temporarily allocated, freed immediately
CPU	~1.7s of single-threaded background work
Correctness	Unaffected
API	No breaking changes

The background CPU usage is practically unnoticeable on modern Apple Silicon Macs (8+ cores). Benchmarks show <2% overhead on concurrent CPU workloads—well within measurement noise. The compilation work would happen anyway on the user's first kernel; we're simply shifting it to run earlier in the background while the user's code is still setting up.

Users who need to measure cold-start compilation (debugging/profiling) can disable via preference.

Why This Matters

Misleading Benchmarks Lead to Wasted Debugging Time

Without warmup, users comparing CPU vs GPU performance get dramatically wrong conclusions:

Matrix multiply (4096×4096 Float32):
  CPU: 0.306s
  GPU (first call):  1.012s  ← User thinks GPU is 3x SLOWER
  GPU (second call): 0.019s  ← Actual: GPU is 16x FASTER

A user unaware of this one-time JIT cost might:

Conclude Metal.jl is slower than CPU and abandon it
Spend hours debugging a non-existent "performance bug"
File issues about inconsistent profiling results
Distrust their own benchmarks

First Impressions for New Users

(Highly relevant for computational scientists with specializations in biology, neuroscience, chemistry, etc. who might not know or care about compilation mechanics despite being the target audience for Julia)

When someone evaluates Metal.jl for the first time:

julia> using Metal
julia> a = MtlArray([1, 2, 3])
julia> @time a .+ 1   # 1.7s delay - "is this broken?"

This 2-second hang on a trivial operation creates a poor first impression, especially compared to frameworks like PyTorch or CUDA.jl where GPU operations feel instant. With async warmup, the experience becomes what users expect—responsive from the first interaction.

Testing

All existing tests pass. New tests added:

Warmup task starts and completes without error
Metal.warmup() API works correctly
Kernel compilation is fast after warmup
Concurrent compilations don't deadlock

Community Concerns

Single-threaded REPL blocking

Concern: In single-threaded mode, Julia's cooperative scheduling means JIT compilation doesn't yield, potentially blocking the REPL during warmup.

Response: Metal.jl users are pursuing GPU computing on Apple Silicon. It's reasonable to expect they've explored CPU parallelism first (setting -t auto or JULIA_NUM_THREADS), which is typically a prerequisite step before GPU for real end users in scientific computing work.

Default to old behaviour: Warmup only runs when Threads.nthreads() > 1 (i.e., when Julia is started with -t auto or JULIA_NUM_THREADS > 1).

With a single thread, Julia's cooperative task runtime means an async task would block the main thread during JIT compilation, potentially hurting perceived REPL latency. To avoid this, Metal.jl warmup is skipped entirely in single-threaded mode—users get the same behaviour as before this PR (assuming this helps with perceived responsiveness for these niche users).

This addresses @vchuravy's concern about REPL blocking while still providing the optimization for the common case (multi-threaded Julia for Metal.jl users on Apple Silicon).

The first GPU kernel in a Metal.jl session takes ~1.75s due to one-time JIT compilation of GPUCompiler internals. This adds async background warmup during package initialization to reduce this to 0.035-0.20s—a 9-50x improvement in perceived first-kernel latency. Implementation: - Start minimal kernel compilation in background during __init__() - Add Metal.warmup() API for explicit synchronization - Add "warmup" preference to disable if needed Key findings from investigation: - Overhead is JIT compilation, not memory page faults - Size-independent: 1KB and 512MB arrays have same delay - Concurrent compilations share initialization (lock serialization) - User kernel benefits even if warmup hasn't completed

github-actions · 2025-12-05T10:26:05Z

Your PR no longer requires formatting changes. Thank you for your contribution!

vchuravy · 2025-12-05T12:20:22Z

I think this is the wrong approach. A task started in the background can negativly impact perceived latency, by blocking the REPL as an example.

There is https://github.com/JuliaGPU/GPUCompiler.jl/blob/e4a697f3b77f5c4ccb3a63354731c022648026d7/src/jlgen.jl#L681 to allow for precompilation of compiler jobs which would warm up the infrastructure and allow you to move this work to precompilation time.

KaanKesginLW · 2025-12-05T13:40:45Z

It's async + measurements provided in PR description

vchuravy · 2025-12-05T13:42:36Z

Julia uses a cooperative task runtime, so saying that something is async doesn't mean that much. If you launch single threaded, the thread will be blocked.

KaanKesginLW · 2025-12-05T13:45:02Z

Updated PR description to address these concerns.

christiangnrd · 2025-12-05T16:25:07Z

src/warmup.jl

+export warmup
+


This should fix the benchmark error.

Suggested change

export warmup

- Remove `export warmup` to avoid benchmark API change detection (warmup still accessible via Metal.warmup()) - Only run async warmup when Threads.nthreads() > 1 to address vchuravy's concern about blocking the REPL on single-threaded Julia - Update docstring to reflect these changes

github-actions

Metal Benchmarks

Details

Benchmark suite	Current: `d4db4a1`	Previous: `67d668c`	Ratio
`latency/precompile`	`24439509417` ns	`24820843000` ns	`0.98`
`latency/ttfp`	`2324493979` ns	`2257593833` ns	`1.03`
`latency/import`	`1433921333` ns	`1431203750` ns	`1.00`
`integration/metaldevrt`	`837666.5` ns	`834875` ns	`1.00`
`integration/byval/slices=1`	`1586208` ns	`1525666.5` ns	`1.04`
`integration/byval/slices=3`	`20564666.5` ns	`8498958` ns	`2.42`
`integration/byval/reference`	`1586374.5` ns	`1538166` ns	`1.03`
`integration/byval/slices=2`	`2743833` ns	`2552562` ns	`1.07`
`kernel/indexing`	`490834` ns	`593833` ns	`0.83`
`kernel/indexing_checked`	`495520.5` ns	`575750` ns	`0.86`
`kernel/launch`	`12750` ns	`11250` ns	`1.13`
`kernel/rand`	`522709` ns	`557187.5` ns	`0.94`
`array/construct`	`6375` ns	`6000` ns	`1.06`
`array/broadcast`	`542709` ns	`591209` ns	`0.92`
`array/random/randn/Float32`	`921188` ns	`836917` ns	`1.10`
`array/random/randn!/Float32`	`583834` ns	`619542` ns	`0.94`
`array/random/rand!/Int64`	`535542` ns	`548834` ns	`0.98`
`array/random/rand!/Float32`	`545083` ns	`593333` ns	`0.92`
`array/random/rand/Int64`	`955375` ns	`735667` ns	`1.30`
`array/random/rand/Float32`	`813687` ns	`631792` ns	`1.29`
`array/accumulate/Int64/1d`	`1313042` ns	`1237125` ns	`1.06`
`array/accumulate/Int64/dims=1`	`1876000` ns	`1795625` ns	`1.04`
`array/accumulate/Int64/dims=2`	`2239000` ns	`2130458` ns	`1.05`
`array/accumulate/Int64/dims=1L`	`12308208` ns	`11609562.5` ns	`1.06`
`array/accumulate/Int64/dims=2L`	`9569834` ns	`9610834` ns	`1.00`
`array/accumulate/Float32/1d`	`1087583` ns	`1111187.5` ns	`0.98`
`array/accumulate/Float32/dims=1`	`1635146` ns	`1518146` ns	`1.08`
`array/accumulate/Float32/dims=2`	`1999958` ns	`1836167` ns	`1.09`
`array/accumulate/Float32/dims=1L`	`10442708` ns	`9757375` ns	`1.07`
`array/accumulate/Float32/dims=2L`	`7382333` ns	`7203562.5` ns	`1.02`
`array/reductions/reduce/Int64/1d`	`1351917` ns	`1498333` ns	`0.90`
`array/reductions/reduce/Int64/dims=1`	`1124583` ns	`1076542` ns	`1.04`
`array/reductions/reduce/Int64/dims=2`	`1159958.5` ns	`1129417` ns	`1.03`
`array/reductions/reduce/Int64/dims=1L`	`2043479.5` ns	`2002083.5` ns	`1.02`
`array/reductions/reduce/Int64/dims=2L`	`3846792` ns	`4214895.5` ns	`0.91`
`array/reductions/reduce/Float32/1d`	`787417` ns	`991375` ns	`0.79`
`array/reductions/reduce/Float32/dims=1`	`799250` ns	`827000` ns	`0.97`
`array/reductions/reduce/Float32/dims=2`	`836875` ns	`833917` ns	`1.00`
`array/reductions/reduce/Float32/dims=1L`	`1325854.5` ns	`1305125` ns	`1.02`
`array/reductions/reduce/Float32/dims=2L`	`1817417` ns	`1788375` ns	`1.02`
`array/reductions/mapreduce/Int64/1d`	`1334604.5` ns	`1549292` ns	`0.86`
`array/reductions/mapreduce/Int64/dims=1`	`1119708` ns	`1085333` ns	`1.03`
`array/reductions/mapreduce/Int64/dims=2`	`1155875` ns	`1201959` ns	`0.96`
`array/reductions/mapreduce/Int64/dims=1L`	`2020417` ns	`2019583` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2L`	`3650958.5` ns	`3628521` ns	`1.01`
`array/reductions/mapreduce/Float32/1d`	`793458` ns	`1036542` ns	`0.77`
`array/reductions/mapreduce/Float32/dims=1`	`802270.5` ns	`819667` ns	`0.98`
`array/reductions/mapreduce/Float32/dims=2`	`814333` ns	`843917` ns	`0.96`
`array/reductions/mapreduce/Float32/dims=1L`	`1354667` ns	`1280500` ns	`1.06`
`array/reductions/mapreduce/Float32/dims=2L`	`1840083` ns	`1784500` ns	`1.03`
`array/private/copyto!/gpu_to_gpu`	`573291` ns	`635375` ns	`0.90`
`array/private/copyto!/cpu_to_gpu`	`675145.5` ns	`786625` ns	`0.86`
`array/private/copyto!/gpu_to_cpu`	`732000` ns	`773833` ns	`0.95`
`array/private/iteration/findall/int`	`1578083` ns	`1620458` ns	`0.97`
`array/private/iteration/findall/bool`	`1468209` ns	`1430125` ns	`1.03`
`array/private/iteration/findfirst/int`	`2091125` ns	`2024937.5` ns	`1.03`
`array/private/iteration/findfirst/bool`	`2028916.5` ns	`2010916` ns	`1.01`
`array/private/iteration/scalar`	`3334625` ns	`5600375` ns	`0.60`
`array/private/iteration/logical`	`2675125` ns	`2504521` ns	`1.07`
`array/private/iteration/findmin/1d`	`2249020.5` ns	`2209917` ns	`1.02`
`array/private/iteration/findmin/2d`	`1536791.5` ns	`1498584` ns	`1.03`
`array/private/copy`	`868542` ns	`558312.5` ns	`1.56`
`array/shared/copyto!/gpu_to_gpu`	`84187.5` ns	`82042` ns	`1.03`
`array/shared/copyto!/cpu_to_gpu`	`83979.5` ns	`79750` ns	`1.05`
`array/shared/copyto!/gpu_to_cpu`	`84083` ns	`82125` ns	`1.02`
`array/shared/iteration/findall/int`	`1578396` ns	`1600354` ns	`0.99`
`array/shared/iteration/findall/bool`	`1479833` ns	`1452458` ns	`1.02`
`array/shared/iteration/findfirst/int`	`1699563` ns	`1621520.5` ns	`1.05`
`array/shared/iteration/findfirst/bool`	`1631667` ns	`1607916.5` ns	`1.01`
`array/shared/iteration/scalar`	`203708` ns	`202916` ns	`1.00`
`array/shared/iteration/logical`	`2309417` ns	`2386416.5` ns	`0.97`
`array/shared/iteration/findmin/1d`	`1882604.5` ns	`1799396` ns	`1.05`
`array/shared/iteration/findmin/2d`	`1552271.5` ns	`1500416.5` ns	`1.03`
`array/shared/copy`	`215291.5` ns	`230791` ns	`0.93`
`array/permutedims/4d`	`2497729.5` ns	`2358000` ns	`1.06`
`array/permutedims/2d`	`1205896` ns	`1133208` ns	`1.06`
`array/permutedims/3d`	`1798750` ns	`1645604` ns	`1.09`
`metal/synchronization/stream`	`19167` ns	`18500` ns	`1.04`
`metal/synchronization/context`	`19750` ns	`19625` ns	`1.01`

This comment was automatically generated by workflow using github-action-benchmark.

The warmup task is intentionally skipped when Threads.nthreads() == 1 to avoid blocking the main thread. Updated tests to: - Check thread count before testing task existence - Test that _warmup_task[] === nothing on single thread - Only run multi-threaded specific tests when nthreads > 1 - API tests (warmup() calls) work in both modes

Removed thread count checks and internal state inspection (_warmup_task[]). Tests now verify: - warmup() returns nothing regardless of configuration - Multiple warmup calls are safe - Kernels compile and execute correctly after warmup This makes tests robust across all thread configurations without branching on implementation details.

codecov · 2025-12-07T12:55:16Z

Codecov Report

❌ Patch coverage is 21.73913% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.54%. Comparing base (239fa4d) to head (d4db4a1).
⚠️ Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
src/warmup.jl	21.05%	15 Missing ⚠️
src/initialization.jl	25.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #721      +/-   ##
==========================================
- Coverage   80.96%   80.54%   -0.42%     
==========================================
  Files          62       63       +1     
  Lines        2837     2858      +21     
==========================================
+ Hits         2297     2302       +5     
- Misses        540      556      +16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

vchuravy · 2025-12-11T08:41:51Z

src/initialization.jl

+    # Only run with multiple threads - with a single thread, the async task would
+    # block the main thread due to Julia's cooperative task runtime.
+    return if functional() && _warmup_enabled && Threads.nthreads() > 1
+        _warmup_task[] = errormonitor(@async _warmup_compilation())


Suggested change

_warmup_task[] = errormonitor(@async _warmup_compilation())

_warmup_task[] = errormonitor(Threads.@spawn _warmup_compilation())

@async is pinned to the same thread as parent.

@Spawn

@async pins the task to the same thread as parent, which would still block thread 1 even with multiple threads available. Threads.@Spawn properly schedules the warmup on a worker thread.

KaanKesginLW · 2025-12-14T18:15:43Z

I overlooked that @async pins to the parent thread. Applied Threads.@spawn in d4db4a1.

Retrigger CI for benchmark failure investigation

4381232

christiangnrd reviewed Dec 5, 2025

View reviewed changes

KaanKesginLW added 2 commits December 5, 2025 17:31

Apply Runic formatting fixes

1889292

github-actions bot reviewed Dec 5, 2025

View reviewed changes

KaanKesginLW added 2 commits December 7, 2025 12:03

vchuravy reviewed Dec 11, 2025

View reviewed changes

Use Threads.@Spawn instead of @async for warmup task

d4db4a1

@async pins the task to the same thread as parent, which would still block thread 1 even with multiple threads available. Threads.@Spawn properly schedules the warmup on a worker thread.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add async background warmup to reduce first-kernel latency #721

Add async background warmup to reduce first-kernel latency #721

KaanKesginLW commented Dec 5, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 5, 2025 •

edited

Loading

Uh oh!

vchuravy commented Dec 5, 2025

Uh oh!

KaanKesginLW commented Dec 5, 2025

Uh oh!

vchuravy commented Dec 5, 2025

Uh oh!

KaanKesginLW commented Dec 5, 2025 •

edited

Loading

Uh oh!

christiangnrd Dec 5, 2025

Uh oh!

github-actions bot left a comment •

edited

Loading

Uh oh!

codecov bot commented Dec 7, 2025 •

edited

Loading

Uh oh!

vchuravy Dec 11, 2025

Uh oh!

KaanKesginLW commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	_warmup_task[] = errormonitor(@async _warmup_compilation())
	_warmup_task[] = errormonitor(Threads.@spawn _warmup_compilation())

Add async background warmup to reduce first-kernel latency #721

Are you sure you want to change the base?

Add async background warmup to reduce first-kernel latency #721

Conversation

KaanKesginLW commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root Cause Analysis

Solution

Key Discovery

Changes

New Files

Modified Files

API Additions

Preferences

Performance

Trade-offs

Why This Matters

Misleading Benchmarks Lead to Wasted Debugging Time

First Impressions for New Users

Testing

Community Concerns

Single-threaded REPL blocking

Uh oh!

github-actions bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vchuravy commented Dec 5, 2025

Uh oh!

KaanKesginLW commented Dec 5, 2025

Uh oh!

vchuravy commented Dec 5, 2025

Uh oh!

KaanKesginLW commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christiangnrd Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Metal Benchmarks

Uh oh!

codecov bot commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vchuravy Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

KaanKesginLW commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KaanKesginLW commented Dec 5, 2025 •

edited

Loading

github-actions bot commented Dec 5, 2025 •

edited

Loading

KaanKesginLW commented Dec 5, 2025 •

edited

Loading

github-actions bot left a comment •

edited

Loading

codecov bot commented Dec 7, 2025 •

edited

Loading