Mistral-7B LLM Deployment with dstack

A production-ready demonstration of deploying the Mistral 7B Instruct LLM using dstack across multiple cloud providers (RunPod and VastAI). This project showcases best practices for LLM hosting with minimal dependencies and maximum flexibility.

By Toffee AI team.

Overview

This project demonstrates how to:

Deploy a production-grade LLM (Mistral-7B) using dstack
Support multiple cloud backends (RunPod, VastAI) with a single configuration
Configure and optimize SGLang for efficient inference
Automate deployments using environment variables and templates

Tested Environment

This project was initially developed and tested with:

OS: Ubuntu 22.04 LTS
Shell: Bash 5.1+
Python: 3.11+
Docker: 24.0+ (for local testing)

While the project should work on other Unix-like systems (macOS, other Linux distributions), the scripts and configurations have been validated on the environment above.

Prerequisites

1. Install dstack CLI

Following the official dstack CLI installation guide:

uv tool install 'dstack[all]' -U

2. Set up dstack Server

You need a running dstack server with configured cloud backends. This demo supports both RunPod and VastAI.

Option A: Local dstack Server (Recommended for Testing)

Copy the example server configuration:

mkdir -p ~/.dstack/server

# Backup existing config if present:
cp ~/.dstack/server/config.yml ~/.dstack/server/backup.config.yml || true

# If no existing config, copy example:
cp server/example.config.yaml ~/.dstack/server/config.yml
# Otherwise, just extend existing config with new project settings from server/example.config.yaml

Edit ~/.dstack/server/config.yml and replace the placeholder API keys:
- RunPod API Key: Get from RunPod Settings
- VastAI API Key: Get from VastAI Account
Start the dstack server:
```
dstack server
```
Make sure the server config was applied [...] INFO Applying ~/.dstack/server/config.yml...

The server will start on http://127.0.0.1:3000

Option B: Remote dstack Server

For live deployments, consider hosting dstack server on a cloud platform (AWS ECS, GCP, etc.). See the dstack documentation for details.

3. Configure dstack CLI

Configure your dstack CLI to connect to your server:

dstack config \
  --project mistral-7b \
  --url http://127.0.0.1:3000 \
  --token <your_dstack_admin_token_that_you_set_in_server>

Confirm:

Set 'mistral-7b' as your default project? [y/n]: n

P.S. If you elect not to set a default project, remember to always pass --project mistral-7b to dstack commands. We do that below for reproducibility. Otherwise you can set it as default, and never worry about claiming the project again, but then you will only be able to work with one project at a time.

4. Configure Environment Variables

The service configuration uses environment variables for all settings.

Start by copying the example environment file:

cp example.env .env

Edit .env to set the required envs, and overall customize your deployment. Check out example.env for all available configuration options.

Quick Start

Now that you are done with the prerequisites, you can deploy the Mistral-7B service from you local machine.

Render Service Configuration

The service configuration template (services/mistral-7b/dstack/template.service.yaml) uses environment variable substitution. Use the provided script to render a deployable configuration:

# Navigate to the service directory
cd ./services/mistral-7b

# Render the configuration using your .env file
./scripts/render-config.bash --env-file ../../.env --output ./dstack/service.yaml

Deploy the Service

🚨 COST WARNING: GPU instances, especially high-end GPUs, can be expensive! Always stop your services when not in use to avoid unexpected charges. The hourly costs adds up quickly if left running.

Deploy the service using the configuration you have just rendered:

# Export all environment variables from .env; this is necessary for dstack to set the envs in the `env:` section:
set -a; source ../../.env; set +a

dstack apply -f ./dstack/service.yaml --project mistral-7b

After you confirm the plan, dstack will:

Find available GPU capacity on RunPod or VastAI
Provision a container with the specified resources
Install dependencies (uv, SGLang, system packages)
Download and load the Mistral-7B model
Start the inference server

Monitor Deployment

dstack ps --project mistral-7b --watch

Test the Service

Once deployed, you can try asking the LLM a question using the standard SGLang completions endpoint:

TOKEN=<your_dstack_admin_token_that_you_set_in_server>

curl -X POST http://127.0.0.1:3000/proxy/services/mistral-7b/mistral-7b/v1/completions \
  -H "Authorization: Bearer ${TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "prompt": "Explain quantum computing in simple terms:",
    "max_tokens": 100
  }'

Stop the Service

Stop a specific service:

dstack stop mistral-7b --project mistral-7b

Stop all running services:

⚠️ WARNING:: Be cautious with this command as it will stop all services in your current dstack project.

# List all running services
dstack ps --project mistral-7b

# Stop each service individually
dstack stop <service-name> --project mistral-7b

Destroy all resources (complete cleanup):

If you want to ensure all cloud resources are terminated:

# Stop all services in the mistral-7b project
for service in $(dstack ps --project mistral-7b --format json | jq -r '.[].name'); do
    echo "Stopping ${service}..."
    dstack stop "${service}" --project mistral-7b
done

Verify all resources are stopped:

dstack ps --project mistral-7b

The output should show no running services. If you see any stuck or failed services, you may need to manually terminate them through the RunPod or VastAI web console.

Important notes:

dstack stop gracefully terminates the service and releases the cloud resources
Stopped services do not incur compute costs
Downloaded models and data are not preserved after stopping (will re-download on next deployment)
For production workloads, consider setting up automatic shutdown schedules or cost alerts

Configuration

Environment Variables

All configuration is done through environment variables, which can be:

Set in your shell before running dstack apply
Defined in a .env file
Passed when rendering the dstack/template.service.yaml template

Service Configuration

Variable	Default	Description
`SERVICE_NAME`	`mistral-7b`	Name of the deployed service
`HOST`	`0.0.0.0`	Server host address
`PORT`	`8080`	Server port

Resource Requirements

Variable	Default	Description
`CPU`	`8..`	Minimum CPU cores
`MEMORY`	`12GB..`	Minimum RAM
`GPU`	`RTX4090:1`	GPU type and count
`DISK`	`50GB..`	Minimum disk space
`SPOT_POLICY`	`on-demand`	Instance type (spot/on-demand)

Dependencies

Variable	Default	Description
`UV_VERSION`	`0.8.18`	uv package manager version
`SGLANG_VERSION`	`0.5.2`	SGLang version
`DOCKER_IMAGE`	`runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04`	Base Docker image

Customizing the Deployment

Edit your .env file or export the variable:

export PORT='8081'
cd services/mistral-7b
./scripts/render-config.bash --env-file ../../.env | dstack apply -f - --project mistral-7b

Project Structure

toffee-dstack-demo/
├── server/
│   └── example.config.yaml         # Example dstack server configuration
├── services/
│   └── mistral-7b/
│       ├── dstack/
│       │   └── template.service.yaml   # dstack service template (requires rendering)
│       └── scripts/
│           ├── setup.bash              # Dependency installation
│           ├── start.bash              # Service startup
│           └── render-config.bash      # Renders template with env vars
├── example.env                    # Example environment variables
├── .dstackignore                   # Files excluded from deployment
└── README.md                       # This file

Troubleshooting

No Capacity Available

Change GPU type in .env: GPU="RTX4090:1" or GPU="A100:1"
Enable spot instances in .env: SPOT_POLICY="spot"
Increase retry duration in .env: RETRY_DURATION="24h"
Check that your RunPod/VastAI API keys are valid in server config

Out of Memory Errors

Reduce memory fraction: export MEM_FRACTION_STATIC="0.7"
Reduce context length: export CONTEXT_LENGTH="1024"
Use a larger GPU: export GPU="A100:1"

Model Download Fails

Set HuggingFace token (we tested with Read scope token): export HF_TOKEN="your_token"
Accept model license on HuggingFace
Verify model ID is correct

Service Won't Start

Check dstack logs:

dstack logs mistral-7b --project mistral-7b

Common issues:

Missing HF_TOKEN for gated models
Insufficient GPU memory
Network connectivity issues

Advanced Usage

Viewing Service Logs

dstack logs mistral-7b --follow --project mistral-7b

Accessing Service Metrics

SGLang exposes Prometheus metrics at /metrics:

curl http://<service-endpoint>/metrics

Hands-On Exercises

Now that you have a working deployment, try these exercises to deepen your understanding of dstack and service configuration. Each exercise builds on the core concepts and helps you learn how to customize deployments for different scenarios.

Exercise 1: Adjust Health Check Probes

Objective: Modify the service health check intervals to understand how dstack monitors service health.

Task:

Open services/mistral-7b/dstack/template.service.yaml
Locate the probes section (currently set to check every 10s)

Modify the probe configuration:

probes:
  - type: http
    url: /health
    interval: 30s    # Change from 10s to 30s
    timeout: 10s     # Change from 5s to 10s

Re-render and redeploy the service
Monitor the logs to see how the health check interval affects deployment

Questions to consider:

How does increasing the interval affect deployment time?
What happens if you set the timeout too low?
When would you want more frequent health checks?

Exercise 2: Deploy with Different GPU Types

Objective: Learn how to adapt the deployment for different GPU availability and pricing.

Task:

Edit your .env file and change the GPU type:

# Try different GPUs
GPU=RTX4090:1    # Consumer GPU (usually cheaper)
# GPU=L40:1      # Alternative datacenter GPU
# GPU=A100:1     # High-end option

Adjust memory fraction if needed (smaller GPUs may need lower values):
```
MEM_FRACTION_STATIC=0.7  # For GPUs with less memory
```
Re-render, export envs, and redeploy
Compare performance and costs between GPU types

Questions to consider:

Which GPU provides the best price/performance ratio?
How does GPU memory affect model loading and inference?
What happens if you try to deploy on a GPU with insufficient memory?

Exercise 3: Rolling Deployments with Replicas

Objective: Learn how dstack handles scaling by incrementing replicas and observing rolling deployments.

Task:

Start with a single replica deployment (default: REPLICAS=1)
Update your .env file to scale up:
```
REPLICAS=3
```

Re-render the configuration:

cd services/mistral-7b
./scripts/render-config.bash --env-file ../../.env --output ./dstack/service.yaml

Apply the updated configuration:

set -a; source ../../.env; set +a
dstack apply -f ./dstack/service.yaml --project mistral-7b

Watch the rolling deployment in real-time:

watch -n 5 'dstack ps --project mistral-7b'

Observe the status changes as new replicas spin up:
- provisioning → building → running
Once all replicas are running, scale down to 1 and observe the termination process

Questions to consider:

How does dstack handle the rollout of new replicas?
What happens to existing replicas during scaling?
How long does it take for a new replica to become running?
When would you need multiple replicas in production?

Exercise 4: Optimize Context Length

Objective: Understand the relationship between context length and memory usage.

Task:

Modify context length in .env:

# Try different context lengths
CONTEXT_LENGTH=1024   # Shorter context, less memory
# CONTEXT_LENGTH=4096 # Longer context, more memory
# CONTEXT_LENGTH=8192 # Maximum context (may require more GPU memory)

Adjust max prefill tokens accordingly:

MAX_PREFILL_TOKENS=32768  # Typically 2-4x context length

Re-render and redeploy
Test with prompts of varying lengths

Questions to consider:

How does context length affect memory usage?
What's the trade-off between context length and throughput?
When would you need longer context windows?

Exercise 5: Experiment with Model Parameters

Objective: Tune SGLang performance parameters for your workload.

Task:

Modify performance settings in .env:

# Disable torch compile for faster startup (but slower inference)
ENABLE_TORCH_COMPILE=false

# Change scheduling policy
SCHEDULE_POLICY=fcfs  # First-come-first-served instead of LPM

# Adjust conservativeness (0.0 = aggressive, 1.0 = conservative)
SCHEDULE_CONSERVATIVENESS=0.5

Re-render and redeploy
Benchmark inference latency and throughput
Compare with default settings

Questions to consider:

How do these parameters affect cold start time?
What's the trade-off between startup speed and inference performance?
Which scheduling policy works best for your use case?

Challenge Exercise: Multi-Environment Setup

Objective: Create separate dev and prod configurations.

Task:

Create .env.dev and .env.prod files:

# .env.dev
SERVICE_NAME=mistral-7b-dev
SPOT_POLICY=spot
GPU=RTX4090:1
REPLICAS=1

# .env.prod
SERVICE_NAME=mistral-7b-prod
SPOT_POLICY=on-demand
GPU=RTX4090:1
REPLICAS=2

Deploy to both environments:

# Dev deployment
./scripts/render-config.bash --env-file ../../.env.dev --output dev.yaml
set -a; source ../../.env.dev; set +a
dstack apply -f dev.yaml --project mistral-7b

# Prod deployment
./scripts/render-config.bash --env-file ../../.env.prod --output prod.yaml
set -a; source ../../.env.prod; set +a
dstack apply -f prod.yaml --project mistral-7b

Manage both deployments independently

Questions to consider:

How do you organize multiple environment configs?
What should differ between dev and prod?
How do you prevent accidentally deploying to the wrong environment?

Learn More

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! We appreciate bug reports, feature suggestions, documentation improvements, and code contributions.

Please read our Contributing Guide for details on:

How to report bugs and suggest enhancements
Development setup and workflow
Style guidelines and best practices
Testing your changes
Submitting pull requests

For quick contributions, feel free to submit a Pull Request directly.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
docs/img		docs/img
server		server
services/mistral-7b		services/mistral-7b
.dstackignore		.dstackignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
example.env		example.env

License

Toffee-AI/toffee-dstack-demo

Folders and files

Latest commit

History

Repository files navigation

Mistral-7B LLM Deployment with dstack

Overview

Table of Contents

Tested Environment

Prerequisites

1. Install dstack CLI

2. Set up dstack Server

Option A: Local dstack Server (Recommended for Testing)

Option B: Remote dstack Server

3. Configure dstack CLI

4. Configure Environment Variables

Quick Start

Render Service Configuration

Deploy the Service

Monitor Deployment

Test the Service

Stop the Service

Configuration

Environment Variables

Service Configuration

Resource Requirements

Dependencies

Customizing the Deployment

Project Structure

Troubleshooting

No Capacity Available

Out of Memory Errors

Model Download Fails

Service Won't Start

Advanced Usage

Viewing Service Logs

Accessing Service Metrics

Hands-On Exercises

Exercise 1: Adjust Health Check Probes

Exercise 2: Deploy with Different GPU Types

Exercise 3: Rolling Deployments with Replicas

Exercise 4: Optimize Context Length

Exercise 5: Experiment with Model Parameters

Challenge Exercise: Multi-Environment Setup

Learn More

License

Contributing

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages