A production-ready demonstration of deploying the Mistral 7B Instruct LLM using dstack across multiple cloud providers (RunPod and VastAI). This project showcases best practices for LLM hosting with minimal dependencies and maximum flexibility.
By Toffee AI team.
This project demonstrates how to:
- Deploy a production-grade LLM (Mistral-7B) using dstack
- Support multiple cloud backends (RunPod, VastAI) with a single configuration
- Configure and optimize SGLang for efficient inference
- Automate deployments using environment variables and templates
- Tested Environment
- Prerequisites
- Quick Start
- Configuration
- Project Structure
- Troubleshooting
- Advanced Usage
- Hands-On Exercises
- Learn More
- License
- Contributing
This project was initially developed and tested with:
- OS: Ubuntu 22.04 LTS
- Shell: Bash 5.1+
- Python: 3.11+
- Docker: 24.0+ (for local testing)
While the project should work on other Unix-like systems (macOS, other Linux distributions), the scripts and configurations have been validated on the environment above.
Following the official dstack CLI installation guide:
uv tool install 'dstack[all]' -UYou need a running dstack server with configured cloud backends. This demo supports both RunPod and VastAI.
-
Copy the example server configuration:
mkdir -p ~/.dstack/server # Backup existing config if present: cp ~/.dstack/server/config.yml ~/.dstack/server/backup.config.yml || true # If no existing config, copy example: cp server/example.config.yaml ~/.dstack/server/config.yml # Otherwise, just extend existing config with new project settings from server/example.config.yaml
-
Edit
~/.dstack/server/config.ymland replace the placeholder API keys:- RunPod API Key: Get from RunPod Settings
- VastAI API Key: Get from VastAI Account
-
Start the dstack server:
dstack server
Make sure the server config was applied
[...] INFO Applying ~/.dstack/server/config.yml...The server will start on
http://127.0.0.1:3000
For live deployments, consider hosting dstack server on a cloud platform (AWS ECS, GCP, etc.). See the dstack documentation for details.
Configure your dstack CLI to connect to your server:
dstack config \
--project mistral-7b \
--url http://127.0.0.1:3000 \
--token <your_dstack_admin_token_that_you_set_in_server>Confirm:
Set 'mistral-7b' as your default project? [y/n]: nP.S. If you elect not to set a default project, remember to always pass --project mistral-7b to dstack commands. We do
that below for reproducibility. Otherwise you can set it as default, and never worry about claiming the project again,
but then you will only be able to work with one project at a time.
The service configuration uses environment variables for all settings.
Start by copying the example environment file:
cp example.env .envEdit .env to set the required envs, and overall customize your deployment. Check out example.env for all available
configuration options.
Now that you are done with the prerequisites, you can deploy the Mistral-7B service from you local machine.
The service configuration template (services/mistral-7b/dstack/template.service.yaml) uses environment variable
substitution. Use the provided script to render a deployable configuration:
# Navigate to the service directory
cd ./services/mistral-7b
# Render the configuration using your .env file
./scripts/render-config.bash --env-file ../../.env --output ./dstack/service.yaml🚨 COST WARNING: GPU instances, especially high-end GPUs, can be expensive! Always stop your services when not in use to avoid unexpected charges. The hourly costs adds up quickly if left running.
Deploy the service using the configuration you have just rendered:
# Export all environment variables from .env; this is necessary for dstack to set the envs in the `env:` section:
set -a; source ../../.env; set +a
dstack apply -f ./dstack/service.yaml --project mistral-7bAfter you confirm the plan, dstack will:
- Find available GPU capacity on RunPod or VastAI
- Provision a container with the specified resources
- Install dependencies (uv, SGLang, system packages)
- Download and load the Mistral-7B model
- Start the inference server
dstack ps --project mistral-7b --watchOnce deployed, you can try asking the LLM a question using the standard SGLang completions endpoint:
TOKEN=<your_dstack_admin_token_that_you_set_in_server>
curl -X POST http://127.0.0.1:3000/proxy/services/mistral-7b/mistral-7b/v1/completions \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"prompt": "Explain quantum computing in simple terms:",
"max_tokens": 100
}'Stop a specific service:
dstack stop mistral-7b --project mistral-7bStop all running services:
# List all running services
dstack ps --project mistral-7b
# Stop each service individually
dstack stop <service-name> --project mistral-7bDestroy all resources (complete cleanup):
If you want to ensure all cloud resources are terminated:
# Stop all services in the mistral-7b project
for service in $(dstack ps --project mistral-7b --format json | jq -r '.[].name'); do
echo "Stopping ${service}..."
dstack stop "${service}" --project mistral-7b
doneVerify all resources are stopped:
dstack ps --project mistral-7bThe output should show no running services. If you see any stuck or failed services, you may need to manually terminate them through the RunPod or VastAI web console.
Important notes:
- dstack
stopgracefully terminates the service and releases the cloud resources - Stopped services do not incur compute costs
- Downloaded models and data are not preserved after stopping (will re-download on next deployment)
- For production workloads, consider setting up automatic shutdown schedules or cost alerts
All configuration is done through environment variables, which can be:
- Set in your shell before running
dstack apply - Defined in a
.envfile - Passed when rendering the
dstack/template.service.yamltemplate
| Variable | Default | Description |
|---|---|---|
SERVICE_NAME |
mistral-7b |
Name of the deployed service |
HOST |
0.0.0.0 |
Server host address |
PORT |
8080 |
Server port |
| Variable | Default | Description |
|---|---|---|
CPU |
8.. |
Minimum CPU cores |
MEMORY |
12GB.. |
Minimum RAM |
GPU |
RTX4090:1 |
GPU type and count |
DISK |
50GB.. |
Minimum disk space |
SPOT_POLICY |
on-demand |
Instance type (spot/on-demand) |
| Variable | Default | Description |
|---|---|---|
UV_VERSION |
0.8.18 |
uv package manager version |
SGLANG_VERSION |
0.5.2 |
SGLang version |
DOCKER_IMAGE |
runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 |
Base Docker image |
Edit your .env file or export the variable:
export PORT='8081'
cd services/mistral-7b
./scripts/render-config.bash --env-file ../../.env | dstack apply -f - --project mistral-7btoffee-dstack-demo/
├── server/
│ └── example.config.yaml # Example dstack server configuration
├── services/
│ └── mistral-7b/
│ ├── dstack/
│ │ └── template.service.yaml # dstack service template (requires rendering)
│ └── scripts/
│ ├── setup.bash # Dependency installation
│ ├── start.bash # Service startup
│ └── render-config.bash # Renders template with env vars
├── example.env # Example environment variables
├── .dstackignore # Files excluded from deployment
└── README.md # This file
- Change GPU type in
.env:GPU="RTX4090:1"orGPU="A100:1" - Enable spot instances in
.env:SPOT_POLICY="spot" - Increase retry duration in
.env:RETRY_DURATION="24h" - Check that your RunPod/VastAI API keys are valid in server config
- Reduce memory fraction:
export MEM_FRACTION_STATIC="0.7" - Reduce context length:
export CONTEXT_LENGTH="1024" - Use a larger GPU:
export GPU="A100:1"
- Set HuggingFace token (we tested with
Readscope token):export HF_TOKEN="your_token" - Accept model license on HuggingFace
- Verify model ID is correct
Check dstack logs:
dstack logs mistral-7b --project mistral-7bCommon issues:
- Missing HF_TOKEN for gated models
- Insufficient GPU memory
- Network connectivity issues
dstack logs mistral-7b --follow --project mistral-7bSGLang exposes Prometheus metrics at /metrics:
curl http://<service-endpoint>/metricsNow that you have a working deployment, try these exercises to deepen your understanding of dstack and service configuration. Each exercise builds on the core concepts and helps you learn how to customize deployments for different scenarios.
Objective: Modify the service health check intervals to understand how dstack monitors service health.
Task:
- Open
services/mistral-7b/dstack/template.service.yaml - Locate the
probessection (currently set to check every 10s) - Modify the probe configuration:
probes: - type: http url: /health interval: 30s # Change from 10s to 30s timeout: 10s # Change from 5s to 10s
- Re-render and redeploy the service
- Monitor the logs to see how the health check interval affects deployment
Questions to consider:
- How does increasing the interval affect deployment time?
- What happens if you set the timeout too low?
- When would you want more frequent health checks?
Objective: Learn how to adapt the deployment for different GPU availability and pricing.
Task:
- Edit your
.envfile and change the GPU type:# Try different GPUs GPU=RTX4090:1 # Consumer GPU (usually cheaper) # GPU=L40:1 # Alternative datacenter GPU # GPU=A100:1 # High-end option
- Adjust memory fraction if needed (smaller GPUs may need lower values):
MEM_FRACTION_STATIC=0.7 # For GPUs with less memory - Re-render, export envs, and redeploy
- Compare performance and costs between GPU types
Questions to consider:
- Which GPU provides the best price/performance ratio?
- How does GPU memory affect model loading and inference?
- What happens if you try to deploy on a GPU with insufficient memory?
Objective: Learn how dstack handles scaling by incrementing replicas and observing rolling deployments.
Task:
- Start with a single replica deployment (default:
REPLICAS=1) - Update your
.envfile to scale up:REPLICAS=3
- Re-render the configuration:
cd services/mistral-7b ./scripts/render-config.bash --env-file ../../.env --output ./dstack/service.yaml - Apply the updated configuration:
set -a; source ../../.env; set +a dstack apply -f ./dstack/service.yaml --project mistral-7b
- Watch the rolling deployment in real-time:
watch -n 5 'dstack ps --project mistral-7b' - Observe the status changes as new replicas spin up:
provisioning→building→running
- Once all replicas are running, scale down to 1 and observe the termination process
Questions to consider:
- How does dstack handle the rollout of new replicas?
- What happens to existing replicas during scaling?
- How long does it take for a new replica to become
running? - When would you need multiple replicas in production?
Objective: Understand the relationship between context length and memory usage.
Task:
- Modify context length in
.env:# Try different context lengths CONTEXT_LENGTH=1024 # Shorter context, less memory # CONTEXT_LENGTH=4096 # Longer context, more memory # CONTEXT_LENGTH=8192 # Maximum context (may require more GPU memory)
- Adjust max prefill tokens accordingly:
MAX_PREFILL_TOKENS=32768 # Typically 2-4x context length - Re-render and redeploy
- Test with prompts of varying lengths
Questions to consider:
- How does context length affect memory usage?
- What's the trade-off between context length and throughput?
- When would you need longer context windows?
Objective: Tune SGLang performance parameters for your workload.
Task:
- Modify performance settings in
.env:# Disable torch compile for faster startup (but slower inference) ENABLE_TORCH_COMPILE=false # Change scheduling policy SCHEDULE_POLICY=fcfs # First-come-first-served instead of LPM # Adjust conservativeness (0.0 = aggressive, 1.0 = conservative) SCHEDULE_CONSERVATIVENESS=0.5
- Re-render and redeploy
- Benchmark inference latency and throughput
- Compare with default settings
Questions to consider:
- How do these parameters affect cold start time?
- What's the trade-off between startup speed and inference performance?
- Which scheduling policy works best for your use case?
Objective: Create separate dev and prod configurations.
Task:
- Create
.env.devand.env.prodfiles:# .env.dev SERVICE_NAME=mistral-7b-dev SPOT_POLICY=spot GPU=RTX4090:1 REPLICAS=1 # .env.prod SERVICE_NAME=mistral-7b-prod SPOT_POLICY=on-demand GPU=RTX4090:1 REPLICAS=2
- Deploy to both environments:
# Dev deployment ./scripts/render-config.bash --env-file ../../.env.dev --output dev.yaml set -a; source ../../.env.dev; set +a dstack apply -f dev.yaml --project mistral-7b # Prod deployment ./scripts/render-config.bash --env-file ../../.env.prod --output prod.yaml set -a; source ../../.env.prod; set +a dstack apply -f prod.yaml --project mistral-7b
- Manage both deployments independently
Questions to consider:
- How do you organize multiple environment configs?
- What should differ between dev and prod?
- How do you prevent accidentally deploying to the wrong environment?
- dstack Documentation
- SGLang Documentation
- Mistral Model Card
- RunPod Documentation
- VastAI Documentation
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! We appreciate bug reports, feature suggestions, documentation improvements, and code contributions.
Please read our Contributing Guide for details on:
- How to report bugs and suggest enhancements
- Development setup and workflow
- Style guidelines and best practices
- Testing your changes
- Submitting pull requests
For quick contributions, feel free to submit a Pull Request directly.