The HUD SDK is an open-source Python toolkit for building, evaluating, and training AI agents. Use a unified API for any model provider, wrap your code as MCP environments, run A/B evals at scale, and train with reinforcement learning.
To learn more, check out our Documentation and API Reference.
pip install hud-pythonGet your API key at hud.ai and set it:
export HUD_API_KEY=your-key-hereFor CLI tools (
hud init,hud dev, etc.):uv tool install hud-python --python 3.12
Use Claude, GPT, Gemini, or Grok through one OpenAI-compatible endpoint:
from openai import AsyncOpenAI
import os
client = AsyncOpenAI(
base_url="https://inference.hud.ai",
api_key=os.environ["HUD_API_KEY"]
)
response = await client.chat.completions.create(
model="claude-sonnet-4-5", # or gpt-4o, gemini-2.5-pro (https://hud.ai/models)
messages=[{"role": "user", "content": "Hello!"}]
)Every call is traced at hud.ai. → Docs
Turn your code into tools agents can call. Define how to evaluate them:
from hud import Environment
env = Environment("my-env")
@env.tool()
def add(a: int, b: int) -> int:
"""Add two numbers."""
return a + b
@env.scenario("solve-math")
async def solve_math(problem: str, answer: int):
response = yield problem # Prompt
yield 1.0 if str(answer) in response else 0.0 # Reward
async with env("solve-math", problem="What is 2+2?", answer=4) as ctx:
# Your agent logic here - call tools, get response
result = await ctx.call_tool("add", a=2, b=2)
await ctx.submit(f"The answer is {result}")
print(ctx.reward) # 1.0The agent runs between the yields. First yield sends the prompt, second yield scores the result. → Docs · Templates
Test different models. Repeat runs to see the distribution:
from openai import AsyncOpenAI
import os
client = AsyncOpenAI(
base_url="https://inference.hud.ai",
api_key=os.environ["HUD_API_KEY"]
)
# Using the env from above
async with env("solve-math", problem="What is 2+2?", answer=4, variants={"model": ["gpt-4o", "claude-sonnet-4-5"]}, group=5) as ctx:
response = await client.chat.completions.create(
model=ctx.variants["model"],
messages=[{"role": "user", "content": ctx.prompt}],
tools=ctx.tools # Environment tools available to the model
)
await ctx.submit(response.choices[0].message.content)Variants test configurations. Groups repeat for distribution. Results stream to hud.ai. → Docs
Push to GitHub, connect on hud.ai, run at scale:
hud init # Scaffold environment
git push # Push to GitHub
# Connect on hud.ai → New → Environment
hud eval my-eval --model gpt-4o --group-size 100
# Or create and run tasks on the platformEvery run generates training data. Use it to fine-tune or run RL. → Docs
- 📖 Documentation
- ⌨️ CLI Reference
- 🏆 Leaderboards
- 🌐 Environment Templates
- 🤖 Supported Models
- 💬 Discord
Building agents at scale? We work with teams on custom environments, benchmarks, and training.
📅 Book a call · 📧 founders@hud.ai
We welcome contributions! See CONTRIBUTING.md.
Key areas: Agents · Tools · Environments
@software{hud2025agentevalplatform,
author = {HUD and Jay Ram and Lorenss Martinsons and Parth Patel and Govind Pimpale and Dylan Bowman and Jaideep and Nguyen Nhat Minh},
title = {HUD: An Evaluation and RL Envrionments Platform for Agents},
date = {2025-04},
url = {https://github.com/hud-evals/hud-python},
langid = {en}
}MIT License · LICENSE

