vLLM v0.12.0 Release Notes Highlights
Highlights
This release features 474 commits from 213 contributors (57 new)!
Breaking Changes: This release includes PyTorch 2.9.0 upgrade (CUDA 12.9), V0 deprecations including xformers backend, and scheduled removals - please review the changelog carefully.
Major Features:
- EAGLE Speculative Decoding Improvements: Multi-step CUDA graph support (#29559), DP>1 support (#26086), and multimodal support with Qwen3VL (#29594).
- Significant Performance Optimizations: 18.1% throughput improvement from batch invariant BMM (#29345), 2.2% throughput improvement from shared experts overlap (#28879).
- AMD ROCm Expansion: DeepSeek v3.2 + SparseMLA support (#26670), FP8 MLA decode (#28032), AITER attention backend (#28701).
Model Support
- New model families: PLaMo-3 (#28834), OpenCUA-7B (#29068), HunyuanOCR (#29327), Mistral Large 3 and Ministral 3 (#29757).
- Format support: Gemma3 GGUF multimodal support (#27772).
- Multimodal enhancements: Qwen3 Omni audio-in-video support (#27721), Eagle3 multimodal support for Qwen3VL (#29594).
- Performance: QwenVL cos/sin cache optimization (#28798).
Engine Core
-
GPU Model Runner V2 (Experimental) (#25266): Complete refactoring of model execution pipeline:
- No "reordering" or complex bookkeeping with persistent batch removal
- GPU-persistent block tables for better scalability with
max_model_lenandnum_kv_groups - Triton-native sampler: no -1 temperature hack, efficient per-request seeds, memory-efficient prompt logprobs
- Simplified DP and CUDA graph implementations
- Efficient structured outputs support
-
Prefill Context Parallel (PCP) (Preparatory) (#28718): Partitions the sequence dimension during prefill for improved long-sequence inference. Complements existing Decode Context Parallel (DCP). See RFC #25749 for details.
-
RLHF Support: Pause and Resume Generation for Asynchronous RL Training (#28037).
-
KV Cache Enhancements: Cross-layer KV blocks support (#27743), KV cache residency metrics (#27793).
-
Audio support: Audio embeddings support in chat completions (#29059).
-
Speculative Decoding:
-
Configuration: Flexible
inputs_embeds_sizeseparate fromhidden_size(#29741),--fully-sharded-lorasfor fused_moe (#28761).
Hardware & Performance
-
NVIDIA Performance:
- Batch invariant BMM optimization: 18.1% throughput improvement, 10.7% TTFT improvement on DeepSeek-V3.1 (#29345)
- Shared Experts Overlap with FlashInfer DeepGEMM: 2.2% throughput improvement, 3.6% TTFT improvement at batch size 32 (#28879)
- DeepGEMM N dim restriction reduced from 128 to 64 multiplier (#28687)
- DeepEP low-latency with round-robin expert placement (#28449)
- NVFP4 MoE CUTLASS support for SM120 (#29242)
- H200 Fused MoE Config improvements (#28992)
-
AMD ROCm:
- DeepSeek v3.2 and SparseMLA support (#26670)
- FP8 MLA decode support (#28032)
- AITER sampling ops integration (#26084)
- AITER triton attention backend (#28701)
- Bitsandbytes quantization on AMD GPUs with warp size 32 (#27307)
- Fastsafetensors support (#28225)
- Sliding window support for AiterFlashAttentionBackend (#29234)
- Whisper v1 with Aiter Unified/Flash Attention (#28376)
-
CPU:
-
Attention: FlashAttention ViT support, now default backend (#28763).
-
Long Context: Optimized
gather_and_maybe_dequant_cachekernel for extremely long sequences (#28029). -
Multi-NUMA: Enhanced NUMA functionality for systems with multiple NUMA nodes per socket (#25559).
-
Docker: Image size reduced by ~200MB (#29060).
Quantization
- W4A8: Marlin kernel support (#24722).
- NVFP4:
- AWQ: Compressed-tensors AWQ support for Turing GPUs (#29732).
- LoRA: FusedMoE LoRA Triton kernel for MXFP4 (#29708).
- Online quantization: Moved to
model.load_weights(#26327).
API & Frontend
- Responses API:
- Tool Calling:
- Whisper:
verbose_jsonandtimestampfeatures for transcription/translation (#24209). - Sampling: Flat logprob control moved from env var to
SamplingParams(#28914). - GGUF: Improved HuggingFace loading UX with
repo_id:quant_typesyntax (#29137). - Profiling: Iteration-level profiling for Torch and CUDA profiler (#28987).
- Logs: Colorized log output (#29017).
- Optimization Levels:
-O0,-O1,-O2,-O3allow trading startup time for performance, more compilation flags will be added in future releases (#26847)
Dependencies
- PyTorch 2.9.0 with CUDA 12.9 (#24994) - Breaking change requiring environment updates.
- xgrammar: Updated to 0.1.27 (#28221).
- Transformers: Updated to 4.57.3 (#29418), preparation for v5 with
rope_parameters(#28542). - XPU: torch & IPEX 2.9 upgrade (#29307).
V0 Deprecation & Breaking Changes
Removed Parameters:
Deprecated:
Scheduled Removals (will be removed in future release):
ParallelConfig's direct child EPLB fields (#29324)guided_*config fields (#29326)override_pooler_configanddisable_log_requests(#29402)CompilationConfig.use_inductor(#29323)- Deprecated metrics (#29330)
Other Breaking Changes:
- PyTorch 2.9.0 upgrade requires CUDA 12.9 environment
- Mistral format auto-detection for model loading (#28659)
New Contributors
- @jesse996 made their first contribution in #28846
- @Nepherpitou made their first contribution in #28960
- @Samoed made their first contribution in #27329
- @j20120307 made their first contribution in #28999
- @vnadathur made their first contribution in #26468
- @zhyajie made their first contribution in #28942
- @IzzyPutterman made their first contribution in #28896
- @rjrock-amd made their first contribution in #28905
- @zq1997 made their first contribution in #27715
- @shengliangxu made their first contribution in #28076
- @prashanth058 made their first contribution in #28972
- @qgallouedec made their first contribution in #28820
- @zhanggzh made their first contribution in #19347
- @pandalee99 made their first contribution in #26628
- @dsuhinin made their first contribution in #29100
- @xli made their first contribution in #29124
- @jeremyteboul made their first contribution in #29059
- @soodoshll made their first contribution in #28875
- @bhagyashrigai made their first contribution in #28957
- @skaraban3807 made their first contribution in #25559
- @Victor49152 made their first contribution in #28892
- @rjrock made their first contribution in #29205
- @FlintyLemming made their first contribution in #29182
- @madskildegaard made their first contribution in #29175
- @nandan2003 made their first contribution in #29189
- @michaelact made their first contribution in #29173
- @yongming-qin made their first contribution in #28958
- @joshiemoore made their first contribution in #29249
- @lim4349 made their first contribution in #29068
- @apinge made their first contribution in #28376
- @gbyu-amd made their first contribution in #28032
- @kflu made their first contribution in #29364
- @Inokinoki made their first contribution in #29200
- @GOavi101 made their first contribution in #29313
- @sts07142 made their first contribution in #29137
- @ivanium made their first contribution in #29143
- @geodavic made their first contribution in #28795
- @Yejing-Lai made their first contribution in #29473
- @Adityayxt made their first contribution in #29491
- @guodongxiaren made their first contribution in #29620
- @askliar made their first contribution in #29426
- @scydas made their first contribution in #29589
- @EanWang211123 made their first contribution in #29594
- @qGentry made their first contribution in #29506
- @HappyAmazonian made their first contribution in #29335
- @rgommers made their first contribution in #29241
- @staugust made their first contribution in #28840
- @mertunsall made their first contribution in #29667
- @dublc made their first contribution in #29728
- @nwaughachukwuma made their first contribution in #29671
- @BowTen made their first contribution in #29731
- @omera-nv made their first contribution in #29004
- @zhangruoxu made their first contribution in #29568
- @KKKZOZ made their first contribution in #29783
- @FredericOdermatt made their first contribution in #29784
- @Abdennacer-Badaoui made their first contribution in #29782
- @knlnguyen1802 made their first contribution in #28525
- @finbarrtimbers made their first contribution in #29796
- @hholtmann made their first contribution in #29711
Full Changelog: v0.11.1...v0.12.0