Skip to content

hud-evals/hud-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HUD

The HUD SDK is an open-source Python toolkit for building, evaluating, and training AI agents. Use a unified API for any model provider, wrap your code as MCP environments, run A/B evals at scale, and train with reinforcement learning.

To learn more, check out our Documentation and API Reference.

PyPI License Add docs to Cursor Discord X Follow Shop Scarf Docs

Install

pip install hud-python

Get your API key at hud.ai and set it:

export HUD_API_KEY=your-key-here

For CLI tools (hud init, hud dev, etc.): uv tool install hud-python --python 3.12

Agent running on SheetBench

Usage

Unified Model API

Use Claude, GPT, Gemini, or Grok through one OpenAI-compatible endpoint:

from openai import AsyncOpenAI
import os

client = AsyncOpenAI(
    base_url="https://inference.hud.ai",
    api_key=os.environ["HUD_API_KEY"]
)

response = await client.chat.completions.create(
    model="claude-sonnet-4-5",  # or gpt-4o, gemini-2.5-pro (https://hud.ai/models)
    messages=[{"role": "user", "content": "Hello!"}]
)

Every call is traced at hud.ai. → Docs

Environments

Turn your code into tools agents can call. Define how to evaluate them:

from hud import Environment

env = Environment("my-env")

@env.tool()
def add(a: int, b: int) -> int:
    """Add two numbers."""
    return a + b

@env.scenario("solve-math")
async def solve_math(problem: str, answer: int):
    response = yield problem                    # Prompt
    yield 1.0 if str(answer) in response else 0.0  # Reward

async with env("solve-math", problem="What is 2+2?", answer=4) as ctx:
    # Your agent logic here - call tools, get response
    result = await ctx.call_tool("add", a=2, b=2)
    await ctx.submit(f"The answer is {result}")

print(ctx.reward)  # 1.0

The agent runs between the yields. First yield sends the prompt, second yield scores the result. → Docs · Templates

A/B Evals

Test different models. Repeat runs to see the distribution:

from openai import AsyncOpenAI
import os

client = AsyncOpenAI(
    base_url="https://inference.hud.ai",
    api_key=os.environ["HUD_API_KEY"]
)

# Using the env from above
async with env("solve-math", problem="What is 2+2?", answer=4, variants={"model": ["gpt-4o", "claude-sonnet-4-5"]}, group=5) as ctx:
    response = await client.chat.completions.create(
        model=ctx.variants["model"],
        messages=[{"role": "user", "content": ctx.prompt}],
        tools=ctx.tools  # Environment tools available to the model
    )
    await ctx.submit(response.choices[0].message.content)

Variants test configurations. Groups repeat for distribution. Results stream to hud.ai. → Docs

Deploy & Train

Push to GitHub, connect on hud.ai, run at scale:

hud init                  # Scaffold environment
git push                  # Push to GitHub
# Connect on hud.ai → New → Environment
hud eval my-eval --model gpt-4o --group-size 100
# Or create and run tasks on the platform

Every run generates training data. Use it to fine-tune or run RL. → Docs

Links

Enterprise

Building agents at scale? We work with teams on custom environments, benchmarks, and training.

📅 Book a call · 📧 founders@hud.ai

Contributing

We welcome contributions! See CONTRIBUTING.md.

Key areas: Agents · Tools · Environments

Citation

@software{hud2025agentevalplatform,
  author = {HUD and Jay Ram and Lorenss Martinsons and Parth Patel and Govind Pimpale and Dylan Bowman and Jaideep and Nguyen Nhat Minh},
  title  = {HUD: An Evaluation and RL Envrionments Platform for Agents},
  date   = {2025-04},
  url    = {https://github.com/hud-evals/hud-python},
  langid = {en}
}

MIT License · LICENSE