Run 70B+ models on hardware that shouldn't be able to.
Rig is a distributed inference framework that splits large language models across multiple machines using pipeline parallelism.
Got a MacBook, an old desktop with a GPU, and a work laptop? None of them can run Llama 70B alone, but together they can. Rig coordinates them into a single inference endpoint over your regular WiFi or LAN.
1. Build (pick your platform):
# Apple Silicon
cargo build --release -p rig-cli --features metal
# NVIDIA
cargo build --release -p rig-cli --features cuda
# CPU only
cargo build --release -p rig-cliOptionally, alias for convenience: alias rig=./target/release/rig
2. Download a model:
hf download TinyLlama/TinyLlama-1.1B-Chat-v1.0 --local-dir models/tiny-llama3. Try it:
./target/release/rig demo --model models/tiny-llamaThat's it! This starts a local 2-worker cluster and opens an interactive chat.
Run ./target/release/rig demo --help for options (workers, temperature, system prompt, etc.).
Machines must be able to reach each other (same WiFi, LAN, VPN, etc.).
Machine 1 — Start coordinator:
./scripts/wifi-cluster/cluster.sh coordinatorMachine 1 — Start worker:
MODEL_PATH=models/tiny-llama MODEL_NAME=tiny-llama ./scripts/wifi-cluster/cluster.sh workerMachine 2 — Start worker (use the IP shown by coordinator):
MODEL_PATH=models/tiny-llama MODEL_NAME=tiny-llama ./scripts/wifi-cluster/cluster.sh worker <coordinator-ip>Create pipeline:
./scripts/wifi-cluster/cluster.sh pipelineGenerate:
./scripts/wifi-cluster/cluster.sh generate --chat "Explain quantum computing"- Rust 1.85+ (
rustup update) - For model downloads: Hugging Face CLI
Under active development. Tested on Apple Silicon; CUDA should work but is untested.