Skip to content

One-command vLLM installation for NVIDIA DGX Spark with Blackwell GB10 GPUs (sm_121 architecture)

License

Notifications You must be signed in to change notification settings

eelbaz/dgx-spark-vllm-setup

Repository files navigation

vLLM Setup for NVIDIA DGX Spark (Blackwell GB10)

One-command installation of vLLM for NVIDIA DGX Spark systems with GB10 GPUs (Blackwell architecture, sm_121).

This repository provides a dgx-spark tested, ready setup script that handles all the complexities of building vLLM on the DGX Spark platform, including:

  • CUDA 13.0 support with Blackwell-specific optimizations
  • Critical fixes for SM100/SM120 MOE kernel compilation
  • Triton 3.5.0 from main branch (required for sm_121a support)
  • PyTorch 2.9.0 with CUDA 13.0 bindings
  • All necessary build fixes and workarounds

Quick Start

One-command installation - installs to ./vllm-install in your current directory:

curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash

Or specify a custom directory:

curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash -s -- --install-dir ~/my/custom/path

Installation time: ~20-30 minutes (mostly compilation)

Alternative: Clone and Install

git clone https://github.com/eelbaz/dgx-spark-vllm-setup.git
cd dgx-spark-vllm-setup
./install.sh

Installation Options

./install.sh [OPTIONS]

Options:
  --install-dir DIR    Installation directory (default: ./vllm-install)
  --vllm-version TAG   vLLM git tag/branch (default: v0.11.1rc3)
  --python-version VER Python version (default: 3.12)
  --skip-tests         Skip post-installation tests
  --help               Show help message

System Requirements

  • Hardware: NVIDIA DGX Spark with GB10 GPU (Blackwell sm_121)
  • OS: Ubuntu 22.04+ (tested on Linux 6.11.0 ARM64)
  • CUDA: 13.0 or later (driver 580.95.05+)
  • Disk Space: ~50GB free
  • RAM: 8GB+ recommended during build

What Gets Installed

Installed to ./vllm-install (or your custom directory):

  • Python 3.12 virtual environment at .vllm/
  • PyTorch 2.9.0+cu130 with full CUDA 13.0 support
  • Triton 3.5.0+git from main branch (pre-release with Blackwell support)
  • vLLM 0.11.1rc3+ with all Blackwell-specific patches
  • Helper scripts for managing vLLM server
  • Environment activation script (vllm_env.sh)

Usage

All examples assume you're in the installation directory (default: ./vllm-install).

Activate Environment

cd vllm-install
source vllm_env.sh

Start vLLM Server

./vllm-serve.sh                                    # Default: Qwen2.5-0.5B on port 8000
./vllm-serve.sh "facebook/opt-125m" 8001          # Custom model and port

Check Server Status

./vllm-status.sh

Stop Server

./vllm-stop.sh

Test API

# List models
curl http://localhost:8000/v1/models

# Generate completion
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "prompt": "Hello, how are you?",
    "max_tokens": 50
  }'

Python API

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    trust_remote_code=True,
    gpu_memory_utilization=0.9
)

prompts = ["Tell me about DGX Spark"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)

print(outputs[0].outputs[0].text)

Critical Fixes Applied

This installer automatically applies the following critical fixes:

1. CMakeLists.txt SM100/SM120 MOE Kernel Fix

Issue: vLLM's MOE kernels for SM100/SM120 Blackwell architectures were incomplete Fix: Added 12.0f and 12.1a to SCALED_MM_ARCHS in CMakeLists.txt

# CUDA 13.0+ path (line ~671)
# Before
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f" "${CUDA_ARCHS}")
# After
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}")

# Older CUDA path (line ~673)
# Before
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a" "${CUDA_ARCHS}")
# After
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;12.1a" "${CUDA_ARCHS}")

2. pyproject.toml License Field Format

Issue: Newer setuptools requires structured license format Fix: Convert license string to dict format in both vLLM and flashinfer-python

# Before
license = "Apache-2.0"
license-files = ["LICENSE"]

# After
license = {text = "Apache-2.0"}

Applied to:

  • vLLM's pyproject.toml
  • flashinfer-python's pyproject.toml (patched during build)

3. GPT-OSS Triton MOE Kernels for Qwen3/gpt-oss Support

Issue: vLLM's GPT-OSS MOE kernel implementation uses deprecated Triton routing API Fix: Update to new Triton kernel API (topk and SparseMatrix)

Changes:

  • Replace deprecated routing() with triton_topk()
  • Replace deprecated routing_from_bitmatrix() with SparseMatrix()
  • Add support for GatherIndx, ScatterIndx, and new ragged tensor metadata

Enables support for:

  • Qwen3 models with MOE architecture
  • gpt-oss models using Triton kernels
  • Latest Triton kernel optimizations for Blackwell

4. Triton Main Branch Requirement

Issue: Official Triton 3.5.0 release has bugs with sm_121a Fix: Build Triton from main branch with latest Blackwell fixes

Architecture-Specific Configuration

The installer sets these critical environment variables:

TORCH_CUDA_ARCH_LIST=12.1a                      # Blackwell sm_121
VLLM_USE_FLASHINFER_MXFP4_MOE=1                 # Enable FlashInfer MOE optimization
TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas     # CUDA PTX assembler
TIKTOKEN_CACHE_DIR=$INSTALL_DIR/.tiktoken_cache # Cache tiktoken encodings locally

Cluster Mode Setup

To set up multi-node vLLM cluster:

  1. Run this installer on all nodes
  2. Follow CLUSTER.md for configuration

Troubleshooting

Build Fails with "TypeError: can only concatenate str (not 'NoneType') to str"

This is a known Triton editable-mode build issue. The installer works around this by:

  • Building Triton in non-editable mode
  • Or copying pre-built Triton from another node

Symbol Error: cutlass_moe_mm_sm100

Symptom: ImportError: undefined symbol: _Z20cutlass_moe_mm_sm100 Solution: Ensure CMakeLists.txt fix is applied (done automatically by installer)

PyTorch CUDA Capability Warning

Symptom: Warning about GPU capability 12.1 vs PyTorch max 12.0 Status: Harmless warning - PyTorch 2.9.0+cu130 works correctly with GB10

ImportError: No module named 'vllm'

Solution:

source vllm-install/vllm_env.sh
python -c "import vllm; print(vllm.__version__)"

File Structure

vllm-install/
├── .vllm/                  # Python virtual environment
├── vllm/                   # vLLM source (editable install)
├── triton/                 # Triton source
├── vllm_env.sh            # Environment activation script
├── vllm-serve.sh          # Start server
├── vllm-stop.sh           # Stop server
├── vllm-status.sh         # Check status
└── vllm-server.log        # Server logs

Manual Installation

If you prefer to understand each step:

# 1. Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"

# 2. Create installation directory and Python virtual environment
mkdir -p vllm-install && cd vllm-install
uv venv .vllm --python 3.12
source .vllm/bin/activate

# 3. Install PyTorch with CUDA 13.0
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

# 4. Clone and build Triton from main
git clone https://github.com/triton-lang/triton.git
cd triton
uv pip install pip cmake ninja pybind11
TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas python -m pip install --no-build-isolation .

# 5. Install additional dependencies
uv pip install xgrammar setuptools-scm apache-tvm-ffi==0.1.0b15 --prerelease=allow

# 6. Clone vLLM
cd ..
git clone --recursive https://github.com/vllm-project/vllm.git
cd vllm
git checkout v0.11.1rc3

# 7. Apply fixes (see scripts/apply-fixes.sh)
# 8. Build vLLM (see install.sh for full process)

Version Information

  • vLLM: 0.11.1rc4.dev6+g66a168a19.d20251026
  • PyTorch: 2.9.0+cu130
  • Triton: 3.5.0+git4caa0328
  • CUDA: 13.0
  • Python: 3.12.3
  • Target Architecture: sm_121 (Blackwell GB10)

Contributing

Issues and pull requests welcome! This installer is maintained by the DGX Spark community.

References

License

MIT License - See LICENSE

Acknowledgments

Developed and tested on NVIDIA DGX Spark systems. Special thanks to the vLLM and Triton communities.

About

One-command vLLM installation for NVIDIA DGX Spark with Blackwell GB10 GPUs (sm_121 architecture)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages