This repository contains the research framework and validation tools for studying the impact of API documentation quality on Large Language Model (LLM) code generation success. Our groundbreaking research has discovered a "documentation sweet spot" phenomenon where moderate-quality documentation achieves better LLM performance than comprehensive documentation.
Counter-intuitive Discovery: LLMs generate more functional code when provided with moderate-quality API documentation (3.0-4.0/5.0) compared to excellent documentation (5.0/5.0).
- 58% better performance with average vs excellent documentation in controlled experiments
- Consistent pattern across OpenAI GPT-4, Claude Sonnet, and Gemini Pro
- "Over-engineering effect" where comprehensive documentation leads to complex but failure-prone implementations
-
Stripe API (Excellent Documentation)
- Complex payment processing API
- Comprehensive interactive documentation
- Extensive code examples and error handling
-
GitHub API (Good Documentation)
- GraphQL API for repository management
- Well-structured schema documentation
- Good examples with some gaps
-
OpenWeatherMap API (Average Documentation)
- Weather data API with API key authentication
- Decent endpoint documentation
- Limited advanced usage patterns
-
JSONPlaceholder API (Basic Documentation)
- Simple REST API for testing
- Minimal documentation
- Basic endpoint descriptions only
-
Cat Facts API (Poor Documentation)
- Very simple GET requests
- Minimal documentation
- No examples or error handling
- Completeness (25%): Endpoint coverage, parameter documentation, response schemas
- Clarity (20%): Language clarity, organization, terminology consistency
- Examples (20%): Code examples, language support, real-world use cases
- Error Handling (15%): Error codes, troubleshooting guides
- Authentication (10%): Auth instructions, security practices
- Code Quality (10%): Best practices, production-ready examples
- Authentication Setup: Implement API authentication
- Data Retrieval: Basic and parameterized GET requests
- CRUD Operations: Create, update, delete resources
- Error Handling: Rate limiting, validation errors
- Edge Cases: Large datasets, network failures
- Integration: Multi-step workflows
api-doc-quality-tests/
βββ apis/ # API selection and analysis
β βββ api_selection_analysis.md
βββ controls/ # Experiment execution
β βββ experiment_execution.py
βββ documentation/ # Documentation variants
β βββ variants/
βββ evaluation/ # Analysis frameworks
β βββ documentation_quality_metrics.py
β βββ llm_testing_infrastructure.py
β βββ code_validation_system.py
β βββ data_analysis_framework.py
β βββ results_analysis.py
βββ tasks/ # Standardized task definitions
β βββ standardized_tasks.py
βββ README.md
pip install pandas numpy matplotlib seaborn scipy scikit-learn
pip install anthropic openai google-generativeai # For LLM APIsCreate a .env file with your API keys:
ANTHROPIC_API_KEY=your_claude_api_key
OPENAI_API_KEY=your_openai_api_key
GOOGLE_API_KEY=your_gemini_api_key- Execute the complete experiment:
cd controls
python experiment_execution.py- Analyze results:
cd evaluation
python results_analysis.py --results-dir ../experiment_results- View results:
- Check
experiment_results/insights_report.mdfor detailed findings - View visualizations in
experiment_results/visualizations/ - Access raw data in JSON format for further analysis
- Check
- Correlation coefficients between documentation metrics and code quality
- Statistical significance testing (p-values)
- Provider-specific performance comparisons
- API-specific success rates
- Correlation heatmaps
- Scatter plots showing documentation vs code quality relationships
- Box plots of success rates by documentation quality quartiles
- Provider performance comparisons
- Specific documentation features that most impact LLM performance
- Recommendations for documentation improvement priorities
- Provider-specific strengths and weaknesses
- Evidence-based best practices for API documentation
- Automated scoring based on predefined criteria
- Weighted metrics reflecting real-world importance
- Consistent evaluation across all APIs
- Identical prompts across all LLM providers
- Standardized task requirements
- Controlled testing environment
- Syntax Validation: Python syntax correctness
- Functionality: Meeting task requirements
- Best Practices: Coding standards compliance
- Security: Secure coding practices
- Completeness: Implementation thoroughness
- Pearson correlation analysis
- Significance testing at p < 0.05
- Provider and API-specific breakdowns
- Regression analysis for predictive insights
-
Primary: How does documentation quality correlate with LLM code generation success?
-
Secondary:
- Which documentation aspects most impact LLM performance?
- Do different LLM providers show varying sensitivity to documentation quality?
- What documentation quality threshold ensures reliable code generation?
- How do complex vs simple APIs respond to documentation improvements?
- Evidence-based documentation improvement priorities
- ROI justification for documentation investments
- Specific guidelines for LLM-friendly documentation
- Better understanding of documentation quality impact
- Improved code generation success rates
- More efficient API integration workflows
- Empirical data on LLM performance factors
- Methodology for similar studies
- Baseline metrics for future research
- Update
experiment_config.jsonwith new API details - Define expected documentation quality level
- Run the experiment with expanded API set
- Implement new provider in
llm_testing_infrastructure.py - Add provider configuration
- Update analysis framework for new provider
- Define new tasks in
standardized_tasks.py - Specify requirements and success criteria
- Update validation system for new task types
- Fork the repository
- Create a feature branch
- Implement improvements or extensions
- Add tests and documentation
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- API providers for public documentation access
- LLM providers for research-friendly APIs
- Open source community for analysis tools
For questions, suggestions, or collaboration opportunities, please open an issue or contact the research team.
This experiment aims to bridge the gap between documentation quality and AI-assisted development, providing actionable insights for the developer community.