Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
414 changes: 26 additions & 388 deletions README.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/demo/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Live HTML reports generated by pytest-codingagents with [pytest-aitest](https://
|--------|-------------|
| [Basic Report](basic-report.html) | Core file operations — create modules, refactor code |
| [Model Comparison](model-comparison-report.html) | Same tasks across different models (GPT-5.2 vs Claude Opus 4.5) |
| [Instruction Testing](instruction-testing-report.html) | How different system prompts affect agent behavior |
| [Instruction Testing](instruction-testing-report.html) | How different instructions affect agent behavior |

## How These Are Generated

Expand Down
2 changes: 1 addition & 1 deletion docs/getting-started/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,4 +40,4 @@ async def test_hello_world(copilot_run, tmp_path):
## What's Next

- [Model Comparison](model-comparison.md) — Compare different models
- [Instruction Testing](instruction-testing.md) — Test different system prompts
- [Instruction Testing](instruction-testing.md) — Test different instructions
4 changes: 2 additions & 2 deletions docs/getting-started/instruction-testing.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Instruction Testing

Test how different system prompts affect agent behavior.
Test how different instructions affect agent behavior.

## Example

Expand Down Expand Up @@ -31,5 +31,5 @@ async def test_coding_style(copilot_run, tmp_path, style, instructions):
## What To Look For

- **Do instructions change behavior?** Compare output files across styles.
- **Token efficiency** — Verbose prompts may cost more but produce better results.
- **Token efficiency** — Verbose instructions may cost more but produce better results.
- **Tool patterns** — Does TDD-style actually write tests first?
29 changes: 21 additions & 8 deletions docs/how-to/aitest-integration.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,12 @@
# pytest-aitest Integration

Get HTML reports with AI-powered analysis by integrating with [pytest-aitest](https://github.com/sbroenne/pytest-aitest).
HTML reports with AI-powered analysis are included automatically — [pytest-aitest](https://github.com/sbroenne/pytest-aitest) is a core dependency.

> **See example reports:** [Basic Report](../demo/basic-report.html) · [Model Comparison](../demo/model-comparison-report.html) · [Instruction Testing](../demo/instruction-testing-report.html)

## Installation

```bash
uv add "pytest-codingagents[aitest]"
```

## How It Works

When `pytest-aitest` is installed, `CopilotResult` automatically bridges to `AgentResult`, enabling:
When tests run, `CopilotResult` automatically bridges to `AgentResult`, enabling:

- **HTML reports** with test results, tool call details, and Mermaid sequence diagrams
- **AI analysis** with failure root causes and improvement suggestions tailored for coding agents
Expand All @@ -27,4 +21,23 @@ Use pytest-aitest's standard CLI options:
uv run pytest tests/ --aitest-html=report.html --aitest-summary-model=azure/gpt-5-mini
```

Or configure in `pyproject.toml`:

```toml
[tool.pytest.ini_options]
addopts = """
--aitest-html=aitest-reports/report.html
--aitest-summary-model=azure/gpt-5.2-chat
"""
```

No code changes needed — the integration is automatic via the plugin system.

## Analysis Prompt Hook

The plugin implements the `pytest_aitest_analysis_prompt` hook to inject Copilot-specific context into AI analysis:

- **Coding-agent framing** — the AI analyzer understands it's evaluating models, instructions, and tools (not MCP servers)
- **Dynamic pricing table** — model pricing data is pulled live from litellm's `model_cost` database, so cost analysis stays current without manual updates

This happens automatically — no configuration needed.
96 changes: 96 additions & 0 deletions docs/how-to/cli-tools.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# CLI Tool Testing

Test that GitHub Copilot can operate command-line tools correctly.

## Basic Usage

Give the agent a task that requires CLI tools and verify the outcome:

```python
from pytest_codingagents import CopilotAgent


async def test_git_init(copilot_run, tmp_path):
agent = CopilotAgent(
instructions="Use git commands as requested.",
working_directory=str(tmp_path),
)
result = await copilot_run(
agent,
"Initialize a git repo, create a .gitignore for Python, and make an initial commit.",
)
assert result.success
assert (tmp_path / ".git").is_dir()
assert (tmp_path / ".gitignore").exists()
```

## Verifying File Output

Check that CLI operations produce the expected files:

```python
async def test_project_scaffold(copilot_run, tmp_path):
agent = CopilotAgent(
instructions="Create project structures as requested.",
working_directory=str(tmp_path),
)
result = await copilot_run(
agent,
"Create a Python package called 'mylib' with __init__.py, "
"a pyproject.toml using hatchling, and a tests/ directory.",
)
assert result.success
assert (tmp_path / "src" / "mylib" / "__init__.py").exists() or (
tmp_path / "mylib" / "__init__.py"
).exists()
assert (tmp_path / "pyproject.toml").exists()
```

## Testing Complex Workflows

Chain multiple CLI operations into a single task:

```python
async def test_git_workflow(copilot_run, tmp_path):
agent = CopilotAgent(
instructions="Perform git operations as requested. Use git commands directly.",
working_directory=str(tmp_path),
)
result = await copilot_run(
agent,
"Initialize a git repo, create hello.py with print('hello'), "
"add it, commit with message 'initial', then create a 'feature' branch.",
)
assert result.success
assert (tmp_path / "hello.py").exists()
```

## Comparing Instructions for CLI Tasks

Test which instructions produce better CLI usage:

```python
import pytest
from pytest_codingagents import CopilotAgent


@pytest.mark.parametrize(
"style,instructions",
[
("minimal", "Execute commands as requested."),
("guided", "You are a DevOps assistant. Use standard CLI tools. "
"Always verify operations succeed before proceeding."),
],
)
async def test_cli_instructions(copilot_run, tmp_path, style, instructions):
agent = CopilotAgent(
name=f"cli-{style}",
instructions=instructions,
working_directory=str(tmp_path),
)
result = await copilot_run(
agent,
"Create a Python virtual environment and install requests",
)
assert result.success
```
6 changes: 5 additions & 1 deletion docs/how-to/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,9 @@

Practical guides for common tasks.

- [pytest-aitest Integration](aitest-integration.md) — Get HTML reports with AI analysis
- [MCP Server Testing](mcp-servers.md) — Test that the agent uses your custom tools
- [Skill Testing](skills.md) — Measure the impact of domain knowledge
- [CLI Tool Testing](cli-tools.md) — Verify the agent operates CLI tools correctly
- [Tool Control](tool-control.md) — Restrict tools with allowlists and blocklists
- [pytest-aitest Integration](aitest-integration.md) — HTML reports with AI analysis
- [CI/CD Integration](ci-cd.md) — Run tests in GitHub Actions
128 changes: 128 additions & 0 deletions docs/how-to/mcp-servers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# MCP Server Testing

Test that GitHub Copilot can discover and use your MCP server tools correctly.

## Basic Usage

Attach an MCP server to a `CopilotAgent` and verify the agent calls the right tools:

```python
from pytest_codingagents import CopilotAgent


async def test_database_query(copilot_run, tmp_path):
agent = CopilotAgent(
instructions="Use the database tools to answer questions.",
working_directory=str(tmp_path),
mcp_servers={
"my-db-server": {
"command": "python",
"args": ["-m", "my_db_mcp_server"],
}
},
)
result = await copilot_run(agent, "List all users in the database")
assert result.success
assert result.tool_was_called("list_users")
```

## Multiple Servers

Attach multiple MCP servers to test interactions between tools:

```python
async def test_multi_server(copilot_run, tmp_path):
agent = CopilotAgent(
instructions="Use the available tools to complete tasks.",
working_directory=str(tmp_path),
mcp_servers={
"database": {
"command": "python",
"args": ["-m", "db_server"],
},
"notifications": {
"command": "node",
"args": ["notification_server.js"],
},
},
)
result = await copilot_run(
agent,
"Find users who signed up today and send them a welcome notification",
)
assert result.success
assert result.tool_was_called("query_users")
assert result.tool_was_called("send_notification")
```

## A/B Server Comparison

Compare two versions of the same MCP server to validate improvements:

```python
import pytest
from pytest_codingagents import CopilotAgent

SERVER_VERSIONS = {
"v1": {"command": "python", "args": ["-m", "my_server_v1"]},
"v2": {"command": "python", "args": ["-m", "my_server_v2"]},
}


@pytest.mark.parametrize("version", SERVER_VERSIONS.keys())
async def test_server_version(copilot_run, tmp_path, version):
agent = CopilotAgent(
name=f"server-{version}",
instructions="Use the available tools to answer questions.",
working_directory=str(tmp_path),
mcp_servers={"my-server": SERVER_VERSIONS[version]},
)
result = await copilot_run(agent, "What's the current inventory count?")
assert result.success
assert result.tool_was_called("get_inventory")
```

The AI analysis report will compare pass rates and tool usage across server versions, highlighting which performs better.

## Verifying Tool Arguments

Check not just that a tool was called, but how it was called:

```python
async def test_correct_arguments(copilot_run, tmp_path):
agent = CopilotAgent(
instructions="Use database tools to query data.",
working_directory=str(tmp_path),
mcp_servers={
"db": {"command": "python", "args": ["-m", "db_server"]},
},
)
result = await copilot_run(agent, "Find users named Alice")
assert result.success

# Check specific tool calls
calls = result.tool_calls_for("query_users")
assert len(calls) >= 1
assert "Alice" in str(calls[0].arguments)
```

## Environment Variables

Pass environment variables to your MCP server process:

```python
agent = CopilotAgent(
instructions="Use the API tools.",
working_directory=str(tmp_path),
mcp_servers={
"api-server": {
"command": "python",
"args": ["-m", "api_server"],
"env": {
"API_KEY": "test-key",
"DATABASE_URL": "sqlite:///test.db",
},
}
},
)
```
Loading
Loading