diff --git a/README.md b/README.md index 85f049d..d8ddd31 100644 --- a/README.md +++ b/README.md @@ -1,38 +1,18 @@ # pytest-codingagents -A pytest plugin for testing coding agents via their native SDKs. +**Combatting cargo cult programming in Agent Instructions, Skills, and Custom Agents for GitHub Copilot and other coding agents since 2026.** -You give a coding agent a task. Did it pick the right tools? Did it produce working code? Did it follow your instructions? **pytest-codingagents lets you answer these questions with automated tests.** +Everyone's copying instruction files from blog posts, pasting "you are a senior engineer" into agent configs, and adding skills they found on Reddit. But does any of it actually work? Are your instructions making your coding agent better — or just longer? Is that skill helping, or is the agent ignoring it entirely? -## Why? +**You don't know, because you're not testing it.** -You're rolling out GitHub Copilot to your team. But which model works best for your codebase? Do your custom instructions improve quality? Does the agent use your MCP servers correctly? Can it operate your CLI tools? Do your custom agents and skills actually help? +pytest-codingagents is a pytest plugin that runs your actual coding agent configuration against real tasks — then uses AI analysis to tell you **why** things failed and **what to fix**. -You can't answer these questions by trying things manually. You need **repeatable, automated tests** that evaluate: - -- **Instructions** — Do your custom instructions produce the desired behavior? -- **MCP Servers** — Can the agent discover and use your custom tools? -- **CLI Tools** — Can the agent operate command-line interfaces correctly? -- **Custom Agents** — Do your sub-agents handle delegated tasks? -- **Skills** — Does domain knowledge improve agent performance? -- **Models** — Which model works best for your use case and budget? - -## Supported Agents - -| Agent | SDK | Status | -|-------|-----|--------| -| GitHub Copilot | `github-copilot-sdk` | ✅ Implemented | - -## Quick Start - -```bash -uv add pytest-codingagents -``` +Currently supports **GitHub Copilot** via [copilot-sdk](https://www.npmjs.com/package/github-copilot-sdk). More agents (Claude Code, etc.) coming soon. ```python from pytest_codingagents import CopilotAgent - async def test_create_file(copilot_run, tmp_path): agent = CopilotAgent( instructions="Create files as requested.", @@ -41,387 +21,45 @@ async def test_create_file(copilot_run, tmp_path): result = await copilot_run(agent, "Create hello.py with print('hello')") assert result.success assert result.tool_was_called("create_file") - assert (tmp_path / "hello.py").exists() ``` -## Authentication - -The plugin authenticates with GitHub Copilot in this order: - -1. **`GITHUB_TOKEN` env var** — ideal for CI (set via `gh auth token` or a GitHub Actions secret) -2. **Logged-in user** via `gh` CLI / OAuth — works automatically for local development +## Install ```bash -# CI: set the token in your workflow -export GITHUB_TOKEN=$(gh auth token) - -# Local: just make sure you're logged into gh CLI -gh auth status -``` - -## Features - -### Native SDK Integration - -Tests run against the real Copilot CLI — no mocks, no wrappers: - -```python -async def test_review(copilot_run, tmp_path): - agent = CopilotAgent( - name="reviewer", - model="claude-sonnet-4", - instructions="You are a Python code reviewer.", - working_directory=str(tmp_path), - max_turns=10, - ) - result = await copilot_run(agent, "Review main.py for bugs") - assert result.success -``` - -### Rich Results - -Every execution returns a `CopilotResult` with full observability: - -```python -result = await copilot_run(agent, "Create a fibonacci function") - -# Tool tracking -assert result.tool_was_called("create_file") -assert result.tool_call_count("create_file") == 1 -assert len(result.all_tool_calls) > 0 -print(result.tool_names_called) # {"create_file", "read_file"} - -# Final response -print(result.final_response) - -# Token usage & cost -print(result.total_tokens) # input + output -print(result.total_cost_usd) # aggregated across all turns - -# Reasoning traces (when available) -print(result.reasoning_traces) - -# Subagent invocations -print(result.subagent_invocations) - -# Model actually used -print(result.model_used) - -# Duration -print(f"{result.duration_ms:.0f}ms") -``` - -### Model Comparison - -```python -import pytest -from pytest_codingagents import CopilotAgent - -MODELS = ["claude-sonnet-4", "gpt-4.1"] - -@pytest.mark.parametrize("model", MODELS) -async def test_fibonacci(copilot_run, tmp_path, model): - agent = CopilotAgent( - name=f"model-{model}", - model=model, - instructions="Create files as requested.", - working_directory=str(tmp_path), - ) - result = await copilot_run(agent, "Create fibonacci.py") - assert result.success -``` - -### Instruction Testing - -```python -import pytest -from pytest_codingagents import CopilotAgent - -@pytest.mark.parametrize("style,instructions", [ - ("concise", "Write minimal code, no comments."), - ("verbose", "Write well-documented code with docstrings."), -]) -async def test_coding_style(copilot_run, tmp_path, style, instructions): - agent = CopilotAgent( - name=f"style-{style}", - instructions=instructions, - working_directory=str(tmp_path), - ) - result = await copilot_run(agent, "Create calculator.py with add/subtract") - assert result.success -``` - -### MCP Server Testing - -Attach MCP servers to test how Copilot uses your custom tools: - -```python -async def test_with_mcp_server(copilot_run, tmp_path): - agent = CopilotAgent( - instructions="Use the database tools to answer questions.", - working_directory=str(tmp_path), - mcp_servers={ - "my-db-server": { - "command": "python", - "args": ["-m", "my_db_mcp_server"], - } - }, - ) - result = await copilot_run(agent, "List all users in the database") - assert result.success - assert result.tool_was_called("list_users") -``` - -### Skill Testing - -Provide domain knowledge via skill directories and test whether it improves agent behavior: - -```python -async def test_with_coding_standards(copilot_run, tmp_path): - # Create a skill with coding standards - skill_dir = tmp_path / "skills" - skill_dir.mkdir() - (skill_dir / "coding-standards.md").write_text( - "# Coding Standards\n\n" - "All Python functions MUST have type hints and docstrings.\n" - ) - - agent = CopilotAgent( - name="with-skill", - instructions="Follow all coding standards from your skills.", - working_directory=str(tmp_path), - skill_directories=[str(skill_dir)], - ) - result = await copilot_run(agent, "Create math_utils.py with add and multiply") - assert result.success - content = (tmp_path / "math_utils.py").read_text() - assert "def add" in content -``` - -Compare with and without skills to measure their impact: - -```python -@pytest.mark.parametrize("use_skill", [True, False]) -async def test_skill_impact(copilot_run, tmp_path, use_skill): - skill_dirs = [str(tmp_path / "skills")] if use_skill else [] - agent = CopilotAgent( - name=f"skill-{'on' if use_skill else 'off'}", - skill_directories=skill_dirs, - working_directory=str(tmp_path), - ) - result = await copilot_run(agent, "Create a utility module") - assert result.success -``` - -### CLI Tool Testing - -Test that the agent can operate command-line tools: - -```python -async def test_git_operations(copilot_run, tmp_path): - agent = CopilotAgent( - name="git-operator", - instructions="Use git commands as requested.", - working_directory=str(tmp_path), - ) - result = await copilot_run( - agent, - "Initialize a git repo, create a .gitignore for Python, and make an initial commit.", - ) - assert result.success - assert (tmp_path / ".git").is_dir() +uv add pytest-codingagents ``` -### Tool Control +Authenticate via `GITHUB_TOKEN` env var (CI) or `gh auth status` (local). -Restrict which tools the agent can use: +## What You Can Test -```python -# Only allow specific tools -agent = CopilotAgent( - instructions="Create files only.", - allowed_tools=["create_file", "read_file"], -) +| Capability | What it proves | Guide | +|---|---|---| +| **Instructions** | Your custom instructions actually produce the desired behavior — not just vibes | [Getting Started](https://sbroenne.github.io/pytest-codingagents/getting-started/) | +| **Skills** | That domain knowledge file is helping, not being ignored | [Skill Testing](https://sbroenne.github.io/pytest-codingagents/how-to/skills/) | +| **Models** | Which model works best for your use case and budget | [Model Comparison](https://sbroenne.github.io/pytest-codingagents/getting-started/model-comparison/) | +| **Custom Agents** | Your custom agent configurations actually work as intended | [Getting Started](https://sbroenne.github.io/pytest-codingagents/getting-started/) | +| **MCP Servers** | The agent discovers and uses your custom tools | [MCP Server Testing](https://sbroenne.github.io/pytest-codingagents/how-to/mcp-servers/) | +| **CLI Tools** | The agent operates command-line interfaces correctly | [CLI Tool Testing](https://sbroenne.github.io/pytest-codingagents/how-to/cli-tools/) | -# Block specific tools -agent = CopilotAgent( - instructions="Review code without modifying it.", - excluded_tools=["create_file", "replace_string_in_file"], -) -``` - -### pytest-aitest Integration +## AI Analysis > **See it in action:** [Basic Report](https://sbroenne.github.io/pytest-codingagents/demo/basic-report.html) · [Model Comparison](https://sbroenne.github.io/pytest-codingagents/demo/model-comparison-report.html) · [Instruction Testing](https://sbroenne.github.io/pytest-codingagents/demo/instruction-testing-report.html) -Install with the `aitest` extra to get HTML reports with AI analysis: - -```bash -uv add "pytest-codingagents[aitest]" -``` - -Results automatically integrate with pytest-aitest's reporting pipeline. The `copilot_run` fixture stashes results in a format compatible with pytest-aitest, giving you leaderboards, failure analysis, Mermaid diagrams, and AI insights — for free. - -To generate reports with AI-powered analysis, pass `--aitest-summary-model` and `--aitest-html`: - -```bash -# Run tests and generate an HTML report with AI insights -uv run pytest tests/ -m copilot \ - --aitest-html=report.html \ - --aitest-summary-model=azure/gpt-5.2-chat -``` - -Or configure in `pyproject.toml`: - -```toml -[tool.pytest.ini_options] -addopts = """ - --aitest-html=aitest-reports/report.html - --aitest-summary-model=azure/gpt-5.2-chat -""" -``` - -## Configuration - -### CopilotAgent Fields - -| Field | Type | Default | Description | -|-------|------|---------|-------------| -| `name` | `str` | `"copilot"` | Agent identifier for reports | -| `model` | `str \| None` | `None` | Model to use (e.g., `claude-sonnet-4`) | -| `reasoning_effort` | `Literal["low", "medium", "high", "xhigh"] \| None` | `None` | Reasoning effort level | -| `instructions` | `str \| None` | `None` | Instructions for the agent (maps to SDK `system_message.content`) | -| `system_message_mode` | `Literal["append", "replace"]` | `"append"` | `"append"` adds to Copilot's built-in system message; `"replace"` overrides it entirely (removes SDK guardrails) | -| `working_directory` | `str \| None` | `None` | Working directory for file operations | -| `allowed_tools` | `list[str] \| None` | `None` | Allowlist of tools (None = all) | -| `excluded_tools` | `list[str] \| None` | `None` | Blocklist of tools | -| `max_turns` | `int` | `25` | Maximum conversation turns (informational — enforced via `timeout_s`, not in SDK) | -| `timeout_s` | `float` | `300.0` | Timeout in seconds | -| `auto_confirm` | `bool` | `True` | Auto-approve tool permissions | -| `mcp_servers` | `dict[str, Any]` | `{}` | MCP server configurations | -| `custom_agents` | `list[dict[str, Any]]` | `[]` | Custom sub-agent configurations | -| `skill_directories` | `list[str]` | `[]` | Paths to skill directories | -| `disabled_skills` | `list[str]` | `[]` | Skills to disable | -| `extra_config` | `dict[str, Any]` | `{}` | SDK passthrough for unmapped fields | - -### CopilotResult - -#### Fields - -| Field | Type | Description | -|-------|------|-------------| -| `success` | `bool` | Whether execution completed without errors | -| `error` | `str \| None` | Error message if failed | -| `model_used` | `str \| None` | Model actually used by Copilot | -| `duration_ms` | `float` | Execution duration in milliseconds | -| `turns` | `list[Turn]` | All conversation turns | -| `usage` | `list[UsageInfo]` | Per-turn token usage and cost | -| `reasoning_traces` | `list[str]` | Captured reasoning traces | -| `subagent_invocations` | `list[SubagentInvocation]` | Subagent delegations | -| `permission_requested` | `bool` | Whether any permissions were requested | -| `permissions` | `list[dict]` | Permission requests made during execution | -| `raw_events` | `list[Any]` | All raw SDK events (for debugging) | - -#### Properties - -| Property | Type | Description | -|----------|------|-------------| -| `final_response` | `str \| None` | Last assistant message | -| `all_responses` | `list[str]` | All assistant messages | -| `all_tool_calls` | `list[ToolCall]` | All tool calls across all turns | -| `tool_names_called` | `set[str]` | Set of tool names used | -| `total_tokens` | `int` | Total tokens (input + output) | -| `total_input_tokens` | `int` | Total input tokens | -| `total_output_tokens` | `int` | Total output tokens | -| `total_cost_usd` | `float` | Total cost in USD | -| `cost_usd` | `float` | Alias for `total_cost_usd` (pytest-aitest compat) | -| `token_usage` | `dict[str, int]` | Token counts dict (pytest-aitest compat) | - -#### Methods - -| Method | Return Type | Description | -|--------|-------------|-------------| -| `tool_was_called(name)` | `bool` | Check if a specific tool was called | -| `tool_call_count(name)` | `int` | Count calls to a specific tool | -| `tool_calls_for(name)` | `list[ToolCall]` | All calls to a specific tool | - -### Turn - -Represents a single conversation turn (assistant response + tool calls). - -| Field | Type | Description | -|-------|------|-------------| -| `role` | `str` | Turn role (`"assistant"`, `"user"`, `"tool"`) | -| `content` | `str` | Text content | -| `tool_calls` | `list[ToolCall]` | Tool calls made in this turn | - -### ToolCall +Every test run produces an HTML report with AI-powered insights: -Represents a single tool invocation. - -| Field | Type | Description | -|-------|------|-------------| -| `name` | `str` | Tool name (e.g., `"create_file"`) | -| `arguments` | `dict[str, Any] \| str` | Arguments passed to the tool | -| `result` | `str \| None` | Tool result (if captured) | -| `error` | `str \| None` | Error message (if tool failed) | -| `duration_ms` | `float \| None` | Tool execution duration | -| `tool_call_id` | `str \| None` | SDK tool call identifier | - -### UsageInfo - -Token usage and cost for a single LLM call. - -| Field | Type | Description | -|-------|------|-------------| -| `model` | `str` | Model used for this call | -| `input_tokens` | `int` | Input tokens | -| `output_tokens` | `int` | Output tokens | -| `cache_read_tokens` | `int` | Cached input tokens | -| `cost_usd` | `float` | Cost in USD (computed from litellm pricing) | -| `duration_ms` | `float` | Duration of the LLM call | - -### SubagentInvocation - -Represents an observed sub-agent delegation. - -| Field | Type | Description | -|-------|------|-------------| -| `name` | `str` | Name of the sub-agent | -| `status` | `str` | Status (`"selected"`, `"started"`, `"completed"`, `"failed"`) | -| `duration_ms` | `float \| None` | Duration of the sub-agent call | - -## Hooks - -### `pytest_aitest_analysis_prompt` - -When used with pytest-aitest, this plugin implements the `pytest_aitest_analysis_prompt` hook to inject Copilot-specific context into AI analysis. The hook provides: - -- **Coding-agent framing** — the AI analyzer understands it's evaluating models, instructions, and tools (not MCP servers) -- **Dynamic pricing table** — model pricing data is pulled live from litellm's `model_cost` database, so cost analysis stays current without manual updates - -This happens automatically — no configuration needed. Just install both plugins: +- **Diagnoses failures** — root cause analysis with suggested fixes +- **Compares models** — leaderboards ranked by pass rate and cost +- **Evaluates instructions** — which instructions produce better results +- **Recommends improvements** — actionable changes to tools, prompts, and skills ```bash -uv add "pytest-codingagents[aitest]" +uv run pytest tests/ --aitest-html=report.html --aitest-summary-model=azure/gpt-5.2-chat ``` -## Development - -```bash -# Install in development mode -uv sync --all-extras - -# Run unit tests (fast, no Copilot needed) -uv run pytest tests/unit/ -v +## Documentation -# Run integration tests (requires Copilot credentials) -uv run pytest tests/ -v -m copilot - -# Lint & type check -uv run ruff check src tests -uv run pyright src -``` +Full docs at **[sbroenne.github.io/pytest-codingagents](https://sbroenne.github.io/pytest-codingagents/)** — API reference, how-to guides, and demo reports. ## License diff --git a/docs/demo/index.md b/docs/demo/index.md index f8b2847..b0ab2fc 100644 --- a/docs/demo/index.md +++ b/docs/demo/index.md @@ -8,7 +8,7 @@ Live HTML reports generated by pytest-codingagents with [pytest-aitest](https:// |--------|-------------| | [Basic Report](basic-report.html) | Core file operations — create modules, refactor code | | [Model Comparison](model-comparison-report.html) | Same tasks across different models (GPT-5.2 vs Claude Opus 4.5) | -| [Instruction Testing](instruction-testing-report.html) | How different system prompts affect agent behavior | +| [Instruction Testing](instruction-testing-report.html) | How different instructions affect agent behavior | ## How These Are Generated diff --git a/docs/getting-started/index.md b/docs/getting-started/index.md index 4785d6e..d1a2585 100644 --- a/docs/getting-started/index.md +++ b/docs/getting-started/index.md @@ -40,4 +40,4 @@ async def test_hello_world(copilot_run, tmp_path): ## What's Next - [Model Comparison](model-comparison.md) — Compare different models -- [Instruction Testing](instruction-testing.md) — Test different system prompts +- [Instruction Testing](instruction-testing.md) — Test different instructions diff --git a/docs/getting-started/instruction-testing.md b/docs/getting-started/instruction-testing.md index 3647a1e..3515d9a 100644 --- a/docs/getting-started/instruction-testing.md +++ b/docs/getting-started/instruction-testing.md @@ -1,6 +1,6 @@ # Instruction Testing -Test how different system prompts affect agent behavior. +Test how different instructions affect agent behavior. ## Example @@ -31,5 +31,5 @@ async def test_coding_style(copilot_run, tmp_path, style, instructions): ## What To Look For - **Do instructions change behavior?** Compare output files across styles. -- **Token efficiency** — Verbose prompts may cost more but produce better results. +- **Token efficiency** — Verbose instructions may cost more but produce better results. - **Tool patterns** — Does TDD-style actually write tests first? diff --git a/docs/how-to/aitest-integration.md b/docs/how-to/aitest-integration.md index 19efcca..80d534d 100644 --- a/docs/how-to/aitest-integration.md +++ b/docs/how-to/aitest-integration.md @@ -1,18 +1,12 @@ # pytest-aitest Integration -Get HTML reports with AI-powered analysis by integrating with [pytest-aitest](https://github.com/sbroenne/pytest-aitest). +HTML reports with AI-powered analysis are included automatically — [pytest-aitest](https://github.com/sbroenne/pytest-aitest) is a core dependency. > **See example reports:** [Basic Report](../demo/basic-report.html) · [Model Comparison](../demo/model-comparison-report.html) · [Instruction Testing](../demo/instruction-testing-report.html) -## Installation - -```bash -uv add "pytest-codingagents[aitest]" -``` - ## How It Works -When `pytest-aitest` is installed, `CopilotResult` automatically bridges to `AgentResult`, enabling: +When tests run, `CopilotResult` automatically bridges to `AgentResult`, enabling: - **HTML reports** with test results, tool call details, and Mermaid sequence diagrams - **AI analysis** with failure root causes and improvement suggestions tailored for coding agents @@ -27,4 +21,23 @@ Use pytest-aitest's standard CLI options: uv run pytest tests/ --aitest-html=report.html --aitest-summary-model=azure/gpt-5-mini ``` +Or configure in `pyproject.toml`: + +```toml +[tool.pytest.ini_options] +addopts = """ + --aitest-html=aitest-reports/report.html + --aitest-summary-model=azure/gpt-5.2-chat +""" +``` + No code changes needed — the integration is automatic via the plugin system. + +## Analysis Prompt Hook + +The plugin implements the `pytest_aitest_analysis_prompt` hook to inject Copilot-specific context into AI analysis: + +- **Coding-agent framing** — the AI analyzer understands it's evaluating models, instructions, and tools (not MCP servers) +- **Dynamic pricing table** — model pricing data is pulled live from litellm's `model_cost` database, so cost analysis stays current without manual updates + +This happens automatically — no configuration needed. diff --git a/docs/how-to/cli-tools.md b/docs/how-to/cli-tools.md new file mode 100644 index 0000000..5d9970b --- /dev/null +++ b/docs/how-to/cli-tools.md @@ -0,0 +1,96 @@ +# CLI Tool Testing + +Test that GitHub Copilot can operate command-line tools correctly. + +## Basic Usage + +Give the agent a task that requires CLI tools and verify the outcome: + +```python +from pytest_codingagents import CopilotAgent + + +async def test_git_init(copilot_run, tmp_path): + agent = CopilotAgent( + instructions="Use git commands as requested.", + working_directory=str(tmp_path), + ) + result = await copilot_run( + agent, + "Initialize a git repo, create a .gitignore for Python, and make an initial commit.", + ) + assert result.success + assert (tmp_path / ".git").is_dir() + assert (tmp_path / ".gitignore").exists() +``` + +## Verifying File Output + +Check that CLI operations produce the expected files: + +```python +async def test_project_scaffold(copilot_run, tmp_path): + agent = CopilotAgent( + instructions="Create project structures as requested.", + working_directory=str(tmp_path), + ) + result = await copilot_run( + agent, + "Create a Python package called 'mylib' with __init__.py, " + "a pyproject.toml using hatchling, and a tests/ directory.", + ) + assert result.success + assert (tmp_path / "src" / "mylib" / "__init__.py").exists() or ( + tmp_path / "mylib" / "__init__.py" + ).exists() + assert (tmp_path / "pyproject.toml").exists() +``` + +## Testing Complex Workflows + +Chain multiple CLI operations into a single task: + +```python +async def test_git_workflow(copilot_run, tmp_path): + agent = CopilotAgent( + instructions="Perform git operations as requested. Use git commands directly.", + working_directory=str(tmp_path), + ) + result = await copilot_run( + agent, + "Initialize a git repo, create hello.py with print('hello'), " + "add it, commit with message 'initial', then create a 'feature' branch.", + ) + assert result.success + assert (tmp_path / "hello.py").exists() +``` + +## Comparing Instructions for CLI Tasks + +Test which instructions produce better CLI usage: + +```python +import pytest +from pytest_codingagents import CopilotAgent + + +@pytest.mark.parametrize( + "style,instructions", + [ + ("minimal", "Execute commands as requested."), + ("guided", "You are a DevOps assistant. Use standard CLI tools. " + "Always verify operations succeed before proceeding."), + ], +) +async def test_cli_instructions(copilot_run, tmp_path, style, instructions): + agent = CopilotAgent( + name=f"cli-{style}", + instructions=instructions, + working_directory=str(tmp_path), + ) + result = await copilot_run( + agent, + "Create a Python virtual environment and install requests", + ) + assert result.success +``` diff --git a/docs/how-to/index.md b/docs/how-to/index.md index 2bcf470..e17e7c4 100644 --- a/docs/how-to/index.md +++ b/docs/how-to/index.md @@ -2,5 +2,9 @@ Practical guides for common tasks. -- [pytest-aitest Integration](aitest-integration.md) — Get HTML reports with AI analysis +- [MCP Server Testing](mcp-servers.md) — Test that the agent uses your custom tools +- [Skill Testing](skills.md) — Measure the impact of domain knowledge +- [CLI Tool Testing](cli-tools.md) — Verify the agent operates CLI tools correctly +- [Tool Control](tool-control.md) — Restrict tools with allowlists and blocklists +- [pytest-aitest Integration](aitest-integration.md) — HTML reports with AI analysis - [CI/CD Integration](ci-cd.md) — Run tests in GitHub Actions diff --git a/docs/how-to/mcp-servers.md b/docs/how-to/mcp-servers.md new file mode 100644 index 0000000..ae650be --- /dev/null +++ b/docs/how-to/mcp-servers.md @@ -0,0 +1,128 @@ +# MCP Server Testing + +Test that GitHub Copilot can discover and use your MCP server tools correctly. + +## Basic Usage + +Attach an MCP server to a `CopilotAgent` and verify the agent calls the right tools: + +```python +from pytest_codingagents import CopilotAgent + + +async def test_database_query(copilot_run, tmp_path): + agent = CopilotAgent( + instructions="Use the database tools to answer questions.", + working_directory=str(tmp_path), + mcp_servers={ + "my-db-server": { + "command": "python", + "args": ["-m", "my_db_mcp_server"], + } + }, + ) + result = await copilot_run(agent, "List all users in the database") + assert result.success + assert result.tool_was_called("list_users") +``` + +## Multiple Servers + +Attach multiple MCP servers to test interactions between tools: + +```python +async def test_multi_server(copilot_run, tmp_path): + agent = CopilotAgent( + instructions="Use the available tools to complete tasks.", + working_directory=str(tmp_path), + mcp_servers={ + "database": { + "command": "python", + "args": ["-m", "db_server"], + }, + "notifications": { + "command": "node", + "args": ["notification_server.js"], + }, + }, + ) + result = await copilot_run( + agent, + "Find users who signed up today and send them a welcome notification", + ) + assert result.success + assert result.tool_was_called("query_users") + assert result.tool_was_called("send_notification") +``` + +## A/B Server Comparison + +Compare two versions of the same MCP server to validate improvements: + +```python +import pytest +from pytest_codingagents import CopilotAgent + +SERVER_VERSIONS = { + "v1": {"command": "python", "args": ["-m", "my_server_v1"]}, + "v2": {"command": "python", "args": ["-m", "my_server_v2"]}, +} + + +@pytest.mark.parametrize("version", SERVER_VERSIONS.keys()) +async def test_server_version(copilot_run, tmp_path, version): + agent = CopilotAgent( + name=f"server-{version}", + instructions="Use the available tools to answer questions.", + working_directory=str(tmp_path), + mcp_servers={"my-server": SERVER_VERSIONS[version]}, + ) + result = await copilot_run(agent, "What's the current inventory count?") + assert result.success + assert result.tool_was_called("get_inventory") +``` + +The AI analysis report will compare pass rates and tool usage across server versions, highlighting which performs better. + +## Verifying Tool Arguments + +Check not just that a tool was called, but how it was called: + +```python +async def test_correct_arguments(copilot_run, tmp_path): + agent = CopilotAgent( + instructions="Use database tools to query data.", + working_directory=str(tmp_path), + mcp_servers={ + "db": {"command": "python", "args": ["-m", "db_server"]}, + }, + ) + result = await copilot_run(agent, "Find users named Alice") + assert result.success + + # Check specific tool calls + calls = result.tool_calls_for("query_users") + assert len(calls) >= 1 + assert "Alice" in str(calls[0].arguments) +``` + +## Environment Variables + +Pass environment variables to your MCP server process: + +```python +agent = CopilotAgent( + instructions="Use the API tools.", + working_directory=str(tmp_path), + mcp_servers={ + "api-server": { + "command": "python", + "args": ["-m", "api_server"], + "env": { + "API_KEY": "test-key", + "DATABASE_URL": "sqlite:///test.db", + }, + } + }, +) +``` diff --git a/docs/how-to/skills.md b/docs/how-to/skills.md new file mode 100644 index 0000000..fd0d5e2 --- /dev/null +++ b/docs/how-to/skills.md @@ -0,0 +1,133 @@ +# Skill Testing + +Test whether domain knowledge injected via skill directories improves agent behavior. + +## What Are Skills? + +Skills are directories containing markdown files with domain-specific knowledge — coding standards, API conventions, architectural guidelines, etc. When attached to a `CopilotAgent`, these files are loaded into the agent's context. + +## Basic Usage + +Create a skill directory and test that the agent follows its guidance: + +```python +from pytest_codingagents import CopilotAgent + + +async def test_coding_standards(copilot_run, tmp_path): + # Create a skill with coding standards + skill_dir = tmp_path / "skills" + skill_dir.mkdir() + (skill_dir / "coding-standards.md").write_text( + "# Coding Standards\n\n" + "All Python functions MUST have type hints and docstrings.\n" + "Use snake_case for all function names.\n" + ) + + agent = CopilotAgent( + name="with-standards", + instructions="Follow all coding standards from your skills.", + working_directory=str(tmp_path), + skill_directories=[str(skill_dir)], + ) + result = await copilot_run(agent, "Create math_utils.py with add and multiply") + assert result.success + content = (tmp_path / "math_utils.py").read_text() + assert "def add" in content +``` + +## Measuring Skill Impact + +Parametrize tests with and without skills to measure whether they improve results: + +```python +import pytest +from pytest_codingagents import CopilotAgent + + +@pytest.mark.parametrize("use_skill", [True, False], ids=["with-skill", "without-skill"]) +async def test_skill_impact(copilot_run, tmp_path, use_skill): + # Set up skill directory + skill_dir = tmp_path / "skills" + skill_dir.mkdir() + (skill_dir / "standards.md").write_text( + "All functions MUST have type hints and docstrings." + ) + + skill_dirs = [str(skill_dir)] if use_skill else [] + agent = CopilotAgent( + name=f"skill-{'on' if use_skill else 'off'}", + instructions="Create clean Python code.", + working_directory=str(tmp_path), + skill_directories=skill_dirs, + ) + result = await copilot_run(agent, "Create a utility module with 3 helper functions") + assert result.success +``` + +The AI analysis report highlights differences in behavior and quality between skill-on and skill-off runs. + +## Multiple Skill Directories + +Load skills from several directories: + +```python +agent = CopilotAgent( + instructions="Follow all project guidelines.", + working_directory=str(tmp_path), + skill_directories=[ + "skills/coding-standards", + "skills/api-conventions", + "skills/testing-patterns", + ], +) +``` + +## Disabling Specific Skills + +Selectively disable skills to isolate their effects: + +```python +agent = CopilotAgent( + instructions="Follow project guidelines.", + working_directory=str(tmp_path), + skill_directories=["skills/"], + disabled_skills=["deprecated-patterns"], +) +``` + +## Comparing Skill Versions + +Test different versions of the same skill to find the most effective phrasing: + +```python +import pytest +from pytest_codingagents import CopilotAgent + +SKILL_VERSIONS = { + "concise": "Use type hints. Use docstrings. Use snake_case.", + "detailed": ( + "# Coding Standards\n\n" + "## Type Hints\n" + "Every function parameter and return value MUST have a type hint.\n\n" + "## Docstrings\n" + "Every public function MUST have a Google-style docstring.\n" + ), +} + + +@pytest.mark.parametrize("style", SKILL_VERSIONS.keys()) +async def test_skill_phrasing(copilot_run, tmp_path, style): + skill_dir = tmp_path / "skills" + skill_dir.mkdir() + (skill_dir / "standards.md").write_text(SKILL_VERSIONS[style]) + + agent = CopilotAgent( + name=f"skill-{style}", + instructions="Follow the coding standards.", + working_directory=str(tmp_path), + skill_directories=[str(skill_dir)], + ) + result = await copilot_run(agent, "Create a data processing module") + assert result.success +``` diff --git a/docs/how-to/tool-control.md b/docs/how-to/tool-control.md new file mode 100644 index 0000000..d68e4a6 --- /dev/null +++ b/docs/how-to/tool-control.md @@ -0,0 +1,85 @@ +# Tool Control + +Restrict which tools the agent can use with allowlists and blocklists. + +## Allowlist (Whitelist) + +Only permit specific tools — the agent cannot use anything else: + +```python +from pytest_codingagents import CopilotAgent + + +async def test_read_only(copilot_run, tmp_path): + agent = CopilotAgent( + instructions="Answer questions about the codebase.", + working_directory=str(tmp_path), + allowed_tools=["read_file", "grep_search", "list_dir"], + ) + result = await copilot_run(agent, "What files are in this directory?") + assert result.success + # Agent can only read — no file creation or modification +``` + +## Blocklist (Blacklist) + +Block specific tools while allowing everything else: + +```python +async def test_no_write(copilot_run, tmp_path): + agent = CopilotAgent( + instructions="Review code without modifying it.", + working_directory=str(tmp_path), + excluded_tools=["create_file", "replace_string_in_file", "run_in_terminal"], + ) + result = await copilot_run(agent, "Review this project for potential bugs") + assert result.success +``` + +## Comparing Tool Restrictions + +Test whether restricting tools changes agent behavior: + +```python +import pytest +from pytest_codingagents import CopilotAgent + + +@pytest.mark.parametrize( + "mode,allowed,excluded", + [ + ("unrestricted", None, None), + ("read-only", ["read_file", "grep_search", "list_dir"], None), + ("no-terminal", None, ["run_in_terminal"]), + ], +) +async def test_tool_restrictions(copilot_run, tmp_path, mode, allowed, excluded): + agent = CopilotAgent( + name=f"tools-{mode}", + instructions="Complete the task using available tools.", + working_directory=str(tmp_path), + allowed_tools=allowed, + excluded_tools=excluded, + ) + result = await copilot_run(agent, "Create a hello.py file") + # In read-only mode, this should fail (no create_file tool) + if mode == "read-only": + assert not result.tool_was_called("create_file") + else: + assert result.success +``` + +## Use Cases + +| Scenario | Configuration | +|---|---| +| Code review agent | `allowed_tools=["read_file", "grep_search", "list_dir"]` | +| File creation only | `allowed_tools=["create_file", "read_file"]` | +| No terminal access | `excluded_tools=["run_in_terminal"]` | +| No file modification | `excluded_tools=["create_file", "replace_string_in_file"]` | + +## Notes + +- `allowed_tools` and `excluded_tools` are mutually exclusive — use one or the other +- When `allowed_tools` is `None` (default), all tools are available +- Tool names correspond to the Copilot SDK tool names (e.g., `create_file`, `read_file`, `run_in_terminal`) diff --git a/docs/index.md b/docs/index.md index ce4c795..c8c5ed7 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,32 +1,10 @@ # pytest-codingagents -A pytest plugin for testing coding agents via their native SDKs. - -You give a coding agent a task. Did it pick the right tools? Did it produce working code? Did it follow your instructions? **pytest-codingagents lets you answer these questions with automated tests.** - -## Why? - -You're rolling out GitHub Copilot to your team. But which model works best for your codebase? Do your custom instructions improve quality? Does the agent use your MCP servers correctly? Can it operate your CLI tools? Do your custom agents and skills actually help? - -You can't answer these questions by trying things manually. You need **repeatable, automated tests** that evaluate: - -- **Instructions** — Do your system prompts produce the desired behavior? -- **MCP Servers** — Can the agent discover and use your custom tools? -- **CLI Tools** — Can the agent operate command-line interfaces correctly? -- **Custom Agents** — Do your sub-agents handle delegated tasks? -- **Skills** — Does domain knowledge improve agent performance? -- **Models** — Which model works best for your use case and budget? - -## Quick Start - -```bash -uv add pytest-codingagents -``` +Automated testing for GitHub Copilot configurations. Test your instructions, MCP servers, skills, and models — then get AI analysis that tells you **why** things failed and **what to fix**. ```python from pytest_codingagents import CopilotAgent - async def test_create_file(copilot_run, tmp_path): agent = CopilotAgent( instructions="Create files as requested.", @@ -35,18 +13,45 @@ async def test_create_file(copilot_run, tmp_path): result = await copilot_run(agent, "Create hello.py with print('hello')") assert result.success assert result.tool_was_called("create_file") - assert (tmp_path / "hello.py").exists() ``` -## Supported Agents +## Install + +```bash +uv add pytest-codingagents +``` + +Authenticate via `GITHUB_TOKEN` env var (CI) or `gh auth status` (local). + +## What You Can Test + +| Capability | What it proves | Guide | +|---|---|---| +| **Instructions** | Custom instructions produce the desired behavior | [Getting Started](getting-started/index.md) | +| **Models** | Which model works best for your use case and budget | [Model Comparison](getting-started/model-comparison.md) | +| **MCP Servers** | The agent discovers and uses your custom tools | [MCP Server Testing](how-to/mcp-servers.md) | +| **Skills** | Domain knowledge improves agent performance | [Skill Testing](how-to/skills.md) | +| **CLI Tools** | The agent operates command-line interfaces correctly | [CLI Tool Testing](how-to/cli-tools.md) | +| **Tool Control** | Allowlists and blocklists restrict tool usage | [Tool Control](how-to/tool-control.md) | -| Agent | SDK | Status | -|-------|-----|--------| -| GitHub Copilot | `github-copilot-sdk` | :white_check_mark: Implemented | +## AI Analysis + +> **See it in action:** [Basic Report](demo/basic-report.html) · [Model Comparison](demo/model-comparison-report.html) · [Instruction Testing](demo/instruction-testing-report.html) + +Every test run produces an HTML report with AI-powered insights: + +- **Diagnoses failures** — root cause analysis with suggested fixes +- **Compares models** — leaderboards ranked by pass rate and cost +- **Evaluates instructions** — which instructions produce better results +- **Recommends improvements** — actionable changes to tools, prompts, and skills + +```bash +uv run pytest tests/ --aitest-html=report.html --aitest-summary-model=azure/gpt-5.2-chat +``` ## Next Steps - [Getting Started](getting-started/index.md) — Install and write your first test +- [How-To Guides](how-to/index.md) — MCP servers, skills, CLI tools, and more - [Demo Reports](demo/index.md) — See real HTML reports with AI analysis - [API Reference](reference/api.md) — Full API documentation -- [Contributing](contributing/index.md) — How to contribute diff --git a/docs/reference/configuration.md b/docs/reference/configuration.md index 454a82b..d103819 100644 --- a/docs/reference/configuration.md +++ b/docs/reference/configuration.md @@ -6,7 +6,7 @@ |-------|------|---------|-------------| | `name` | `str` | `"copilot"` | Agent identifier for reports | | `model` | `str \| None` | `None` | Model to use (e.g., `claude-sonnet-4`) | -| `instructions` | `str \| None` | `None` | System prompt / instructions | +| `instructions` | `str \| None` | `None` | Instructions for the agent | | `system_message_mode` | `Literal["append", "replace"]` | `"append"` | `"append"` adds to Copilot's built-in system message; `"replace"` overrides it | | `working_directory` | `str \| None` | `None` | Working directory for file operations | | `max_turns` | `int` | `25` | Maximum conversation turns (informational — enforced via `timeout_s`, not in SDK) | diff --git a/mkdocs.yml b/mkdocs.yml index 364a59d..8f014e1 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -76,6 +76,10 @@ nav: - Instruction Testing: getting-started/instruction-testing.md - How-To Guides: - Overview: how-to/index.md + - MCP Server Testing: how-to/mcp-servers.md + - Skill Testing: how-to/skills.md + - CLI Tool Testing: how-to/cli-tools.md + - Tool Control: how-to/tool-control.md - pytest-aitest Integration: how-to/aitest-integration.md - CI/CD Integration: how-to/ci-cd.md - Reference: diff --git a/pyproject.toml b/pyproject.toml index b6aec85..82c0151 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -28,6 +28,7 @@ classifiers = [ dependencies = [ "pytest>=9.0", "github-copilot-sdk>=0.1.23", + "pytest-aitest>=0.2.9", ] [project.optional-dependencies] @@ -37,9 +38,6 @@ dev = [ "pyright>=1.1.408", "pre-commit>=4.5", ] -aitest = [ - "pytest-aitest>=0.2.9", -] docs = [ "mkdocs>=1.6", "mkdocs-material>=9.7", @@ -92,5 +90,5 @@ markers = [ [dependency-groups] dev = [ - "pytest-codingagents[dev,aitest,docs]", + "pytest-codingagents[dev,docs]", ] diff --git a/src/pytest_codingagents/copilot/agent.py b/src/pytest_codingagents/copilot/agent.py index 63a9b06..df7b09d 100644 --- a/src/pytest_codingagents/copilot/agent.py +++ b/src/pytest_codingagents/copilot/agent.py @@ -68,7 +68,7 @@ class CopilotAgent: # MCP servers to attach to the session mcp_servers: dict[str, Any] = field(default_factory=dict) - # Custom sub-agents (SDK CustomAgentConfig: name, prompt, description, + # Custom agents (SDK CustomAgentConfig: name, prompt, description, # display_name, tools, mcp_servers, infer) custom_agents: list[dict[str, Any]] = field(default_factory=list) diff --git a/src/pytest_codingagents/copilot/result.py b/src/pytest_codingagents/copilot/result.py index 7900386..b8d0f5b 100644 --- a/src/pytest_codingagents/copilot/result.py +++ b/src/pytest_codingagents/copilot/result.py @@ -25,7 +25,7 @@ def __repr__(self) -> str: @dataclass(slots=True) class SubagentInvocation: - """A subagent delegation observed during execution.""" + """A subagent invocation observed during execution.""" name: str status: str # "selected", "started", "completed", "failed" @@ -90,7 +90,7 @@ class CopilotResult: # Reasoning traces (from assistant.reasoning events) reasoning_traces: list[str] = field(default_factory=list) - # Subagent delegations + # Subagent invocations subagent_invocations: list[SubagentInvocation] = field(default_factory=list) # Permission requests (True if any were requested) diff --git a/src/pytest_codingagents/prompts/coding_agent_analysis.md b/src/pytest_codingagents/prompts/coding_agent_analysis.md index 7f22e34..9b959de 100644 --- a/src/pytest_codingagents/prompts/coding_agent_analysis.md +++ b/src/pytest_codingagents/prompts/coding_agent_analysis.md @@ -8,17 +8,17 @@ You are analyzing test results for **pytest-codingagents**, a framework that tes A **CopilotAgent** is a test configuration consisting of: - **Model**: The LLM backing the agent (e.g., `claude-sonnet-4`, `gpt-4.1`) -- **Instructions**: System prompt that configures agent behavior +- **Instructions**: Instructions that configure agent behavior - **Skills**: Optional domain knowledge directories - **MCP Servers**: Custom tool servers the agent can use -- **Custom Agents**: Sub-agents that handle delegated tasks +- **Custom Agents**: Specialized agent configurations that perform specific tasks - **Tool Control**: Allowed/excluded tools to constrain behavior **What we test** (testing dimensions): -- **Instructions** — Do system prompts produce the desired behavior? +- **Instructions** — Do instructions produce the desired behavior? - **MCP Servers** — Can the agent discover and use custom tools? - **CLI Tools** — Can the agent operate command-line interfaces? -- **Custom Agents** — Do sub-agents handle delegated tasks? +- **Custom Agents** — Do custom agents perform their intended tasks correctly? - **Skills** — Does domain knowledge improve performance? - **Models** — Which model works best for the use case and budget? diff --git a/tests/README.md b/tests/README.md index 8368811..1757bb1 100644 --- a/tests/README.md +++ b/tests/README.md @@ -10,9 +10,9 @@ against real coding tasks. Every test uses the actual Copilot API — no mocks. | `test_basic.py` | File creation, code quality, refactoring | `gpt-5.2`, `claude-opus-4.5` | 4 | | `test_models.py` | Model comparison on identical tasks | `gpt-5.2`, `claude-opus-4.5` | 4 | | `test_matrix.py` | Model × Instructions grid | `gpt-5.2`, `claude-opus-4.5` | 4 | -| `test_instructions.py` | System prompt style comparison, constraints | default | 4 | +| `test_instructions.py` | Instruction style comparison, constraints | default | 4 | | `test_cli_tools.py` | Terminal commands, git, tool exclusion | default | 4 | -| `test_custom_agents.py` | Custom agent delegation, tool restrictions, subagent tracking | default | 3 | +| `test_custom_agents.py` | Custom agent configs, tool restrictions, subagent tracking | default | 3 | | `test_events.py` | Reasoning traces, permissions, usage, events | default | 6 | | `test_skills.py` | Skill directories, disabled skills | default | 3 | @@ -62,7 +62,7 @@ Cross-product of 2 models × 2 instruction styles (minimal, detailed). - **`test_create_utility`** — Create `string_utils.py` with `reverse`, `capitalize_words`, `count_vowels`. Every combination must produce working code. -### test_instructions.py — System Prompt Testing +### test_instructions.py — Instruction Testing Compares instruction variants and constraints. @@ -80,10 +80,10 @@ Verifies agents can operate terminal commands. - **`test_cli_tool_output_captured`** — Run command, process output, report findings. - **`test_no_terminal_when_excluded`** — `excluded_tools=["run_in_terminal"]` prevents usage. -### test_custom_agents.py — Custom Agent Delegation +### test_custom_agents.py — Custom Agent Testing Uses the SDK's `CustomAgentConfig` (fields: `name`, `prompt`, `description`, -`display_name`, `tools`, `mcp_servers`, `infer`). Delegation is non-deterministic, +`display_name`, `tools`, `mcp_servers`, `infer`). Invocation is non-deterministic, so tests focus on verifiable outcomes rather than asserting invocation counts. - **`test_custom_agent_code_and_tests`** — Defines a "test-writer" custom agent diff --git a/tests/test_custom_agents.py b/tests/test_custom_agents.py index bc9a249..cf18f2f 100644 --- a/tests/test_custom_agents.py +++ b/tests/test_custom_agents.py @@ -1,18 +1,18 @@ -"""Custom agent / subagent tests. +"""Custom agent tests. -Tests Copilot's custom agent routing and subagent invocation. +Tests Copilot's custom agent routing and invocation. -Custom agents (SDK ``CustomAgentConfig``) are specialized sub-agents defined -in the session config. The main agent can delegate tasks to them. +Custom agents (SDK ``CustomAgentConfig``) are specialized agents defined +in the session config. The main agent can invoke them for specific tasks. Key fields: name (str): Unique agent name (required) - prompt (str): The agent's system prompt (required) - description (str): What the agent does — helps model decide when to delegate + prompt (str): The agent's instructions (required) + description (str): What the agent does — helps model decide when to invoke it tools (list[str]): Tools the agent can use (optional) mcp_servers (dict): MCP servers specific to this agent (optional) -Note: Delegation is non-deterministic — the LLM decides whether to invoke the +Note: Invocation is non-deterministic — the LLM decides whether to invoke the custom agent. Tests focus on verifiable outcomes rather than asserting subagent invocation counts. """ @@ -26,7 +26,7 @@ @pytest.mark.copilot class TestCustomAgents: - """Test custom agent configurations and delegation.""" + """Test custom agent configurations.""" async def test_custom_agent_code_and_tests(self, copilot_run, tmp_path): """Custom test-writer agent produces tests alongside code. @@ -39,7 +39,7 @@ async def test_custom_agent_code_and_tests(self, copilot_run, tmp_path): name="with-test-writer", instructions=( "You are a senior developer. When you create code, " - "always have tests written for it. Delegate test writing " + "always have tests written for it. Use the test-writer " "to the test-writer agent when available." ), working_directory=str(tmp_path), @@ -78,7 +78,7 @@ async def test_custom_agent_with_restricted_tools(self, copilot_run, tmp_path): """ agent = CopilotAgent( name="with-docs-writer", - instructions="You are a project lead. Create code and delegate documentation tasks.", + instructions="You are a project lead. Create code and have documentation written.", working_directory=str(tmp_path), custom_agents=[ { @@ -105,13 +105,13 @@ async def test_custom_agent_with_restricted_tools(self, copilot_run, tmp_path): async def test_subagent_invocations_captured(self, copilot_run, tmp_path): """Subagent events are captured in result.subagent_invocations. - Even if the model doesn't delegate, the field should be present - and correctly typed. If it does delegate, we should see entries. + Even if the model doesn't use the custom agent, the field should be present + and correctly typed. If it does, we should see entries. """ agent = CopilotAgent( - name="delegation-test", + name="custom-agent-test", instructions=( - "You manage a team. Always delegate code review to the " + "You manage a team. Always have the reviewer agent check " "reviewer agent before finalizing." ), working_directory=str(tmp_path), @@ -134,7 +134,7 @@ async def test_subagent_invocations_captured(self, copilot_run, tmp_path): sort_files = list(tmp_path.rglob("sort.py")) assert len(sort_files) > 0, "sort.py was not created" - # subagent_invocations is always a list (may be empty if model didn't delegate) + # subagent_invocations is always a list (may be empty if model didn't use custom agent) assert isinstance(result.subagent_invocations, list) # If any subagent was invoked, verify the structure diff --git a/tests/test_instructions.py b/tests/test_instructions.py index 247d6df..a37ea84 100644 --- a/tests/test_instructions.py +++ b/tests/test_instructions.py @@ -1,4 +1,4 @@ -"""System prompt / instructions testing. +"""Instruction testing. Tests that different instructions produce different agent behaviors. Use pytest.mark.parametrize to compare instruction variants.