Skip to content

feat: initial release of pytest-codingagents#1

Merged
sbroenne merged 11 commits intomainfrom
initial-release
Feb 11, 2026
Merged

feat: initial release of pytest-codingagents#1
sbroenne merged 11 commits intomainfrom
initial-release

Conversation

@sbroenne
Copy link
Owner

Summary

A pytest plugin for testing coding agents via their native SDKs. Tests run against the real Copilot CLI — no mocks, no wrappers.

What's Included

Core Plugin

  • *\CopilotAgent* dataclass — model, instructions, tools, skills, custom agents, MCP servers
  • *\CopilotResult* — full observability: tool calls, token usage, cost, reasoning traces, subagent invocations
  • *\copilot_run* fixture — execute prompts and capture structured results
  • EventMapper — maps all 38 SDK event types to structured data
  • Cost computation via litellm pricing (SDK's cost field is unreliable)
  • Auto-confirm permissions for deterministic testing

pytest-aitest Integration

  • Automatic bridging of \CopilotResult\ → \AgentResult\ for HTML reports
  • Custom \pytest_aitest_analysis_prompt\ hook with coding-agent-specific framing
  • Dynamic pricing table injected from litellm's \model_cost\ data

Integration Tests (32 tests across 8 files)

File What It Tests
\ est_basic.py\ File creation, code quality, refactoring
\ est_models.py\ Model comparison (GPT-5.2 vs Claude Opus 4.5)
\ est_matrix.py\ Model × Instructions grid
\ est_instructions.py\ System prompt variants and constraints
\ est_cli_tools.py\ Terminal commands, git operations
\ est_custom_agents.py\ Custom agent delegation
\ est_events.py\ Reasoning traces, permissions, usage tracking
\ est_skills.py\ Skill directories, disabled skills

Unit Tests (37 tests)

Pure logic tests for agent config, event mapping, result properties, and plugin hooks.

Documentation

  • Full README with examples for every feature
  • mkdocs site with Getting Started, How-To, Reference, Contributing
  • 3 demo HTML reports linked from docs

Tooling

  • \scripts/run_all.py\ — per-file report generation
  • Pre-commit hooks: ruff lint/format + pyright type checking

Key Design Decisions

  • timeout_s = 300s — coding agent tasks need more time than simple tool calls
  • Relaxed assertions — accept \powershell\ or
    un_in_terminal, use
    glob\ for file search
  • No azure-identity dep — not used in the codebase
  • addopts removed from pyproject.toml — per-file reports via
    un_all.py\ instead

Stefan Broenner added 11 commits February 11, 2026 08:44
- Move agent creation inside test function (tmp_path scope)
- Fix config table: name defaults to 'copilot', mcp_servers is dict,
  custom_agents/skill_directories/disabled_skills are lists with defaults
- Add missing fields: system_message_mode, extra_config
- Add reasoning_effort 'xhigh' option
- Add Authentication section (GITHUB_TOKEN + gh CLI)
- Add CopilotResult properties reference table
- Add MCP server and tool control examples
- Remove ... placeholder code, use complete examples
- Fix same bug in docs/index.md
The old framing ('the agent is the test harness, not the thing being
tested') was copied from pytest-aitest. That's the opposite of this
project's purpose: pytest-codingagents evaluates coding agents — which
model works best, do instructions improve quality, are MCP tools used
correctly.
- Use snake_case keys in build_session_config() (SDK TypedDict, not camelCase)
- Handle ToolRequest objects in events.py (not plain dicts)
- Fix permission handler signature and return value
- Replace bogus SDK cost field with litellm model_cost computation
- Add azure-identity dependency for Azure AD authentication
- Add _lookup_model_cost() with name normalization (dot-to-dash for Claude)
- Update unit tests for new cost computation
- Add coding_agent_analysis.md prompt template for AI insights
- Implement pytest_aitest_analysis_prompt hook in plugin.py
- Build pricing table dynamically from litellm model_cost at runtime
- Replace hardcoded tier table with {{PRICING_TABLE}} placeholder
- Add test_plugin.py unit tests for hook and pricing table
- Parametrize tests across gpt-5.2 and claude-opus-4.5
- Add conftest.py with MODELS list and copilot marker
- Add CLI tools and skills integration test stubs
- Fix CopilotResult fields table (usage not token_usage, cost_usd is property)
- Fix Turn fields (remove phantom reasoning field)
- Fix ToolCall fields (add error, duration_ms, tool_call_id; fix arguments type)
- Fix UsageInfo fields (add cache_read_tokens, cost_usd, duration_ms)
- Fix SubagentInvocation fields (name/status/duration_ms, not agent_name/prompt/result)
- Add system_message_mode to configuration reference
- Add prompts/ directory to contributing project structure
- Update hooks section to describe dynamic pricing
- Update aitest integration doc with dynamic pricing bullet
- reference/result.md: add missing SubagentInvocation class
- reference/configuration.md: reasoning_effort type Literal, not str
- README.md: raw_events type list[Any], not list[dict]
… reports

Core changes:
- Increase default timeout_s from 120s to 300s
- Relax test assertions: accept powershell/run_in_terminal, use rglob
- Rewrite coding_agent_analysis.md prompt for visual impact (tables, scorecards)
- Remove unused azure-identity dependency
- Remove addopts from pyproject.toml (per-file reports via run_all.py)

Docs & demo:
- Add demo reports (basic, model-comparison, instruction-testing)
- Link demo reports from README, docs index, aitest integration page
- Add Demo Reports to mkdocs nav
- Update all docs to reflect timeout_s=300.0
- Add run_all.py to contributing docs

Tooling:
- Add scripts/run_all.py for per-file report generation
- Update tests/README.md with accurate test descriptions
@github-actions
Copy link

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@sbroenne sbroenne merged commit 169b318 into main Feb 11, 2026
9 checks passed
@sbroenne sbroenne deleted the initial-release branch February 11, 2026 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant