vLLM v0.12.0 Release Notes Highlights

Highlights

This release features 474 commits from 213 contributors (57 new)！

Breaking Changes: This release includes PyTorch 2.9.0 upgrade (CUDA 12.9), V0 deprecations including xformers backend, and scheduled removals - please review the changelog carefully.

Major Features:

EAGLE Speculative Decoding Improvements: Multi-step CUDA graph support (#29559), DP>1 support (#26086), and multimodal support with Qwen3VL (#29594).
Significant Performance Optimizations: 18.1% throughput improvement from batch invariant BMM (#29345), 2.2% throughput improvement from shared experts overlap (#28879).
AMD ROCm Expansion: DeepSeek v3.2 + SparseMLA support (#26670), FP8 MLA decode (#28032), AITER attention backend (#28701).

Model Support

New model families: PLaMo-3 (#28834), OpenCUA-7B (#29068), HunyuanOCR (#29327), Mistral Large 3 and Ministral 3 (#29757).
Format support: Gemma3 GGUF multimodal support (#27772).
Multimodal enhancements: Qwen3 Omni audio-in-video support (#27721), Eagle3 multimodal support for Qwen3VL (#29594).
Performance: QwenVL cos/sin cache optimization (#28798).

Engine Core

GPU Model Runner V2 (Experimental) (#25266): Complete refactoring of model execution pipeline:
- No "reordering" or complex bookkeeping with persistent batch removal
- GPU-persistent block tables for better scalability with max_model_len and num_kv_groups
- Triton-native sampler: no -1 temperature hack, efficient per-request seeds, memory-efficient prompt logprobs
- Simplified DP and CUDA graph implementations
- Efficient structured outputs support
Prefill Context Parallel (PCP) (Preparatory) (#28718): Partitions the sequence dimension during prefill for improved long-sequence inference. Complements existing Decode Context Parallel (DCP). See RFC #25749 for details.
RLHF Support: Pause and Resume Generation for Asynchronous RL Training (#28037).
KV Cache Enhancements: Cross-layer KV blocks support (#27743), KV cache residency metrics (#27793).
Audio support: Audio embeddings support in chat completions (#29059).
Speculative Decoding:
- Multi-step Eagle with CUDA graph (#29559)
- EAGLE DP>1 support (#26086)
- EAGLE3 heads without use_aux_hidden_states (#27688)
- Eagle multimodal CUDA graphs with MRoPE (#28896)
- Logprobs support with spec decode + async scheduling (#29223)
Configuration: Flexible inputs_embeds_size separate from hidden_size (#29741), --fully-sharded-loras for fused_moe (#28761).

Hardware & Performance

NVIDIA Performance:
- Batch invariant BMM optimization: 18.1% throughput improvement, 10.7% TTFT improvement on DeepSeek-V3.1 (#29345)
- Shared Experts Overlap with FlashInfer DeepGEMM: 2.2% throughput improvement, 3.6% TTFT improvement at batch size 32 (#28879)
- DeepGEMM N dim restriction reduced from 128 to 64 multiplier (#28687)
- DeepEP low-latency with round-robin expert placement (#28449)
- NVFP4 MoE CUTLASS support for SM120 (#29242)
- H200 Fused MoE Config improvements (#28992)
AMD ROCm:
- DeepSeek v3.2 and SparseMLA support (#26670)
- FP8 MLA decode support (#28032)
- AITER sampling ops integration (#26084)
- AITER triton attention backend (#28701)
- Bitsandbytes quantization on AMD GPUs with warp size 32 (#27307)
- Fastsafetensors support (#28225)
- Sliding window support for AiterFlashAttentionBackend (#29234)
- Whisper v1 with Aiter Unified/Flash Attention (#28376)
CPU:
- Paged attention GEMM acceleration on ARM CPUs with NEON (#29193)
- Parallelize over tokens in int4 MoE (#29600)
- CPU all reduce optimization for async_scheduling + DP>1 (#29311)
Attention: FlashAttention ViT support, now default backend (#28763).
Long Context: Optimized gather_and_maybe_dequant_cache kernel for extremely long sequences (#28029).
Multi-NUMA: Enhanced NUMA functionality for systems with multiple NUMA nodes per socket (#25559).
Docker: Image size reduced by ~200MB (#29060).

Quantization

W4A8: Marlin kernel support (#24722).
NVFP4:
- MoE CUTLASS support for SM120 (#29242)
- TRTLLM MoE NVFP4 kernel (#28892)
- CuteDSL MoE with NVFP4 DeepEP dispatch (#27141)
- Non-gated activations support in modelopt path (#29004)
AWQ: Compressed-tensors AWQ support for Turing GPUs (#29732).
LoRA: FusedMoE LoRA Triton kernel for MXFP4 (#29708).
Online quantization: Moved to model.load_weights (#26327).

API & Frontend

Responses API:
- Multi-turn support for non-harmony requests (#29175)
- Reasoning item input parsing (#28248)
Tool Calling:
- Parsed tool arguments support (#28820)
- parallel_tool_calls param compliance (#26233)
- Tool filtering support in ToolServer (#29224)
Whisper: verbose_json and timestamp features for transcription/translation (#24209).
Sampling: Flat logprob control moved from env var to SamplingParams (#28914).
GGUF: Improved HuggingFace loading UX with repo_id:quant_type syntax (#29137).
Profiling: Iteration-level profiling for Torch and CUDA profiler (#28987).
Logs: Colorized log output (#29017).
Optimization Levels: -O0, -O1, -O2, -O3 allow trading startup time for performance, more compilation flags will be added in future releases (#26847)

Dependencies

PyTorch 2.9.0 with CUDA 12.9 (#24994) - Breaking change requiring environment updates.
xgrammar: Updated to 0.1.27 (#28221).
Transformers: Updated to 4.57.3 (#29418), preparation for v5 with rope_parameters (#28542).
XPU: torch & IPEX 2.9 upgrade (#29307).

V0 Deprecation & Breaking Changes

Removed Parameters:

num_lookahead_slots (#29000)
best_of (#29090)
LoRA extra vocab (#28545)

Deprecated:

xformers backend (#29262)
seed=None (#29185)

Scheduled Removals (will be removed in future release):

ParallelConfig's direct child EPLB fields (#29324)
guided_* config fields (#29326)
override_pooler_config and disable_log_requests (#29402)
CompilationConfig.use_inductor (#29323)
Deprecated metrics (#29330)

Other Breaking Changes:

PyTorch 2.9.0 upgrade requires CUDA 12.9 environment
Mistral format auto-detection for model loading (#28659)

New Contributors

@jesse996 made their first contribution in #28846
@Nepherpitou made their first contribution in #28960
@Samoed made their first contribution in #27329
@j20120307 made their first contribution in #28999
@vnadathur made their first contribution in #26468
@zhyajie made their first contribution in #28942
@IzzyPutterman made their first contribution in #28896
@rjrock-amd made their first contribution in #28905
@zq1997 made their first contribution in #27715
@shengliangxu made their first contribution in #28076
@prashanth058 made their first contribution in #28972
@qgallouedec made their first contribution in #28820
@zhanggzh made their first contribution in #19347
@pandalee99 made their first contribution in #26628
@dsuhinin made their first contribution in #29100
@xli made their first contribution in #29124
@jeremyteboul made their first contribution in #29059
@soodoshll made their first contribution in #28875
@bhagyashrigai made their first contribution in #28957
@skaraban3807 made their first contribution in #25559
@Victor49152 made their first contribution in #28892
@rjrock made their first contribution in #29205
@FlintyLemming made their first contribution in #29182
@madskildegaard made their first contribution in #29175
@nandan2003 made their first contribution in #29189
@michaelact made their first contribution in #29173
@yongming-qin made their first contribution in #28958
@joshiemoore made their first contribution in #29249
@lim4349 made their first contribution in #29068
@apinge made their first contribution in #28376
@gbyu-amd made their first contribution in #28032
@kflu made their first contribution in #29364
@Inokinoki made their first contribution in #29200
@GOavi101 made their first contribution in #29313
@sts07142 made their first contribution in #29137
@ivanium made their first contribution in #29143
@geodavic made their first contribution in #28795
@Yejing-Lai made their first contribution in #29473
@Adityayxt made their first contribution in #29491
@guodongxiaren made their first contribution in #29620
@askliar made their first contribution in #29426
@scydas made their first contribution in #29589
@EanWang211123 made their first contribution in #29594
@qGentry made their first contribution in #29506
@HappyAmazonian made their first contribution in #29335
@rgommers made their first contribution in #29241
@staugust made their first contribution in #28840
@mertunsall made their first contribution in #29667
@dublc made their first contribution in #29728
@nwaughachukwuma made their first contribution in #29671
@BowTen made their first contribution in #29731
@omera-nv made their first contribution in #29004
@zhangruoxu made their first contribution in #29568
@KKKZOZ made their first contribution in #29783
@FredericOdermatt made their first contribution in #29784
@Abdennacer-Badaoui made their first contribution in #29782
@knlnguyen1802 made their first contribution in #28525
@finbarrtimbers made their first contribution in #29796
@hholtmann made their first contribution in #29711

Full Changelog: v0.11.1...v0.12.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

v0.12.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

vLLM v0.12.0 Release Notes Highlights

Highlights

Model Support

Engine Core

Hardware & Performance

Quantization

API & Frontend

Dependencies

V0 Deprecation & Breaking Changes

New Contributors

Contributors

Uh oh!