vLLM Setup for NVIDIA DGX Spark (Blackwell GB10)

One-command installation of vLLM for NVIDIA DGX Spark systems with GB10 GPUs (Blackwell architecture, sm_121).

This repository provides a dgx-spark tested, ready setup script that handles all the complexities of building vLLM on the DGX Spark platform, including:

CUDA 13.0 support with Blackwell-specific optimizations
Critical fixes for SM100/SM120 MOE kernel compilation
Triton 3.5.0 from main branch (required for sm_121a support)
PyTorch 2.9.0 with CUDA 13.0 bindings
All necessary build fixes and workarounds

Quick Start

One-command installation - installs to ./vllm-install in your current directory:

curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash

Or specify a custom directory:

curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash -s -- --install-dir ~/my/custom/path

Installation time: ~20-30 minutes (mostly compilation)

Alternative: Clone and Install

git clone https://github.com/eelbaz/dgx-spark-vllm-setup.git
cd dgx-spark-vllm-setup
./install.sh

Installation Options

./install.sh [OPTIONS]

Options:
  --install-dir DIR    Installation directory (default: ./vllm-install)
  --vllm-version TAG   vLLM git tag/branch (default: v0.11.1rc3)
  --python-version VER Python version (default: 3.12)
  --skip-tests         Skip post-installation tests
  --help               Show help message

System Requirements

Hardware: NVIDIA DGX Spark with GB10 GPU (Blackwell sm_121)
OS: Ubuntu 22.04+ (tested on Linux 6.11.0 ARM64)
CUDA: 13.0 or later (driver 580.95.05+)
Disk Space: ~50GB free
RAM: 8GB+ recommended during build

What Gets Installed

Installed to ./vllm-install (or your custom directory):

Python 3.12 virtual environment at .vllm/
PyTorch 2.9.0+cu130 with full CUDA 13.0 support
Triton 3.5.0+git from main branch (pre-release with Blackwell support)
vLLM 0.11.1rc3+ with all Blackwell-specific patches
Helper scripts for managing vLLM server
Environment activation script (vllm_env.sh)

Usage

All examples assume you're in the installation directory (default: ./vllm-install).

Activate Environment

cd vllm-install
source vllm_env.sh

Start vLLM Server

./vllm-serve.sh                                    # Default: Qwen2.5-0.5B on port 8000
./vllm-serve.sh "facebook/opt-125m" 8001          # Custom model and port

Check Server Status

./vllm-status.sh

Stop Server

./vllm-stop.sh

Test API

# List models
curl http://localhost:8000/v1/models

# Generate completion
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "prompt": "Hello, how are you?",
    "max_tokens": 50
  }'

Python API

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    trust_remote_code=True,
    gpu_memory_utilization=0.9
)

prompts = ["Tell me about DGX Spark"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)

print(outputs[0].outputs[0].text)

Critical Fixes Applied

This installer automatically applies the following critical fixes:

1. CMakeLists.txt SM100/SM120 MOE Kernel Fix

Issue: vLLM's MOE kernels for SM100/SM120 Blackwell architectures were incomplete Fix: Added 12.0f and 12.1a to SCALED_MM_ARCHS in CMakeLists.txt

# CUDA 13.0+ path (line ~671)
# Before
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f" "${CUDA_ARCHS}")
# After
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}")

# Older CUDA path (line ~673)
# Before
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a" "${CUDA_ARCHS}")
# After
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;12.1a" "${CUDA_ARCHS}")

2. pyproject.toml License Field Format

Issue: Newer setuptools requires structured license format Fix: Convert license string to dict format in both vLLM and flashinfer-python

# Before
license = "Apache-2.0"
license-files = ["LICENSE"]

# After
license = {text = "Apache-2.0"}

Applied to:

vLLM's pyproject.toml
flashinfer-python's pyproject.toml (patched during build)

3. GPT-OSS Triton MOE Kernels for Qwen3/gpt-oss Support

Issue: vLLM's GPT-OSS MOE kernel implementation uses deprecated Triton routing API Fix: Update to new Triton kernel API (topk and SparseMatrix)

Changes:

Replace deprecated routing() with triton_topk()
Replace deprecated routing_from_bitmatrix() with SparseMatrix()
Add support for GatherIndx, ScatterIndx, and new ragged tensor metadata

Enables support for:

Qwen3 models with MOE architecture
gpt-oss models using Triton kernels
Latest Triton kernel optimizations for Blackwell

4. Triton Main Branch Requirement

Issue: Official Triton 3.5.0 release has bugs with sm_121a Fix: Build Triton from main branch with latest Blackwell fixes

Architecture-Specific Configuration

The installer sets these critical environment variables:

TORCH_CUDA_ARCH_LIST=12.1a                      # Blackwell sm_121
VLLM_USE_FLASHINFER_MXFP4_MOE=1                 # Enable FlashInfer MOE optimization
TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas     # CUDA PTX assembler
TIKTOKEN_CACHE_DIR=$INSTALL_DIR/.tiktoken_cache # Cache tiktoken encodings locally

Cluster Mode Setup

To set up multi-node vLLM cluster:

Run this installer on all nodes
Follow CLUSTER.md for configuration

Troubleshooting

Build Fails with "TypeError: can only concatenate str (not 'NoneType') to str"

This is a known Triton editable-mode build issue. The installer works around this by:

Building Triton in non-editable mode
Or copying pre-built Triton from another node

Symbol Error: cutlass_moe_mm_sm100

Symptom: ImportError: undefined symbol: _Z20cutlass_moe_mm_sm100 Solution: Ensure CMakeLists.txt fix is applied (done automatically by installer)

PyTorch CUDA Capability Warning

Symptom: Warning about GPU capability 12.1 vs PyTorch max 12.0 Status: Harmless warning - PyTorch 2.9.0+cu130 works correctly with GB10

ImportError: No module named 'vllm'

Solution:

source vllm-install/vllm_env.sh
python -c "import vllm; print(vllm.__version__)"

File Structure

vllm-install/
├── .vllm/                  # Python virtual environment
├── vllm/                   # vLLM source (editable install)
├── triton/                 # Triton source
├── vllm_env.sh            # Environment activation script
├── vllm-serve.sh          # Start server
├── vllm-stop.sh           # Stop server
├── vllm-status.sh         # Check status
└── vllm-server.log        # Server logs

Manual Installation

If you prefer to understand each step:

# 1. Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"

# 2. Create installation directory and Python virtual environment
mkdir -p vllm-install && cd vllm-install
uv venv .vllm --python 3.12
source .vllm/bin/activate

# 3. Install PyTorch with CUDA 13.0
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

# 4. Clone and build Triton from main
git clone https://github.com/triton-lang/triton.git
cd triton
uv pip install pip cmake ninja pybind11
TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas python -m pip install --no-build-isolation .

# 5. Install additional dependencies
uv pip install xgrammar setuptools-scm apache-tvm-ffi==0.1.0b15 --prerelease=allow

# 6. Clone vLLM
cd ..
git clone --recursive https://github.com/vllm-project/vllm.git
cd vllm
git checkout v0.11.1rc3

# 7. Apply fixes (see scripts/apply-fixes.sh)
# 8. Build vLLM (see install.sh for full process)

Version Information

vLLM: 0.11.1rc4.dev6+g66a168a19.d20251026
PyTorch: 2.9.0+cu130
Triton: 3.5.0+git4caa0328
CUDA: 13.0
Python: 3.12.3
Target Architecture: sm_121 (Blackwell GB10)

Contributing

Issues and pull requests welcome! This installer is maintained by the DGX Spark community.

References

License

MIT License - See LICENSE

Acknowledgments

Developed and tested on NVIDIA DGX Spark systems. Special thanks to the vLLM and Triton communities.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
examples		examples
patches		patches
scripts		scripts
.gitignore		.gitignore
CLUSTER.md		CLUSTER.md
CRITICAL_FIX_ANALYSIS.md		CRITICAL_FIX_ANALYSIS.md
LICENSE		LICENSE
README.md		README.md
SUMMARY.md		SUMMARY.md
install.sh		install.sh
requirements.txt		requirements.txt

License

eelbaz/dgx-spark-vllm-setup

Folders and files

Latest commit

History

Repository files navigation

vLLM Setup for NVIDIA DGX Spark (Blackwell GB10)

Quick Start

Alternative: Clone and Install

Installation Options

System Requirements

What Gets Installed

Usage

Activate Environment

Start vLLM Server

Check Server Status

Stop Server

Test API

Python API

Critical Fixes Applied

1. CMakeLists.txt SM100/SM120 MOE Kernel Fix

2. pyproject.toml License Field Format

3. GPT-OSS Triton MOE Kernels for Qwen3/gpt-oss Support

4. Triton Main Branch Requirement

Architecture-Specific Configuration

Cluster Mode Setup

Troubleshooting

Build Fails with "TypeError: can only concatenate str (not 'NoneType') to str"

Symbol Error: cutlass_moe_mm_sm100

PyTorch CUDA Capability Warning

ImportError: No module named 'vllm'

File Structure

Manual Installation

Version Information

Contributing

References

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages