Skip to content

An AI-powered system that generates comprehensive documentation for any GitHub repository.

License

Notifications You must be signed in to change notification settings

Asirwad/docsmitr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Docsmitr

Autonomous Codebase Documentation Engine

Transform any GitHub repository into production-ready technical documentation in minutes, not hours.

License Python FastAPI


Overview

Docsmitr is an AI-powered documentation generation system that analyzes codebases at the AST level and produces comprehensive, interconnected documentation deployed as a static Docusaurus site.

Unlike simple code-to-text generators, Docsmitr implements a multi-stage agentic pipeline with:

  • 6-language AST parsing via Tree-sitter for accurate structural analysis
  • Cross-file dependency graphs to understand module relationships
  • Token-aware chunking for large file handling without context loss
  • Parallel LLM orchestration with semaphore-based concurrency control
  • Enterprise-grade output with Confluence-inspired theming

Architecture

flowchart TB
    subgraph Input["Input Layer"]
        GH[GitHub URL]
        ZIP[ZIP Upload]
    end
    
    subgraph API["API Gateway"]
        FastAPI[FastAPI Server]
        WS[WebSocket Progress]
    end
    
    subgraph Queue["Task Queue"]
        Redis[(Redis)]
        Celery[Celery Workers]
    end
    
    subgraph AI["AI Pipeline"]
        TS[Tree-sitter Parser]
        DG[Dependency Graph]
        LG[LangGraph Workflow]
        LLM[LLM Provider]
    end
    
    subgraph Output["Output Layer"]
        Docusaurus[Docusaurus Builder]
        S3[(AWS S3)]
    end
    
    subgraph Storage["Persistence"]
        PG[(PostgreSQL)]
    end
    
    GH --> FastAPI
    ZIP --> FastAPI
    FastAPI --> Redis
    FastAPI <--> WS
    Redis --> Celery
    Celery --> TS
    TS --> DG
    DG --> LG
    LG <--> LLM
    LG --> Docusaurus
    Docusaurus --> S3
    FastAPI <--> PG
    
    style AI fill:#e1f5fe
    style Queue fill:#fff3e0
    style Output fill:#e8f5e9
Loading

Core Pipeline

sequenceDiagram
    participant User
    participant API as FastAPI
    participant Queue as Celery/Redis
    participant Parser as Tree-sitter
    participant Graph as Dependency Graph
    participant Agent as LangGraph Agent
    participant LLM as LLM Provider
    participant Builder as Docusaurus
    participant S3 as AWS S3
    
    User->>API: POST /api/jobs {github_url}
    API->>Queue: Enqueue job
    API-->>User: {job_id, status: pending}
    
    Queue->>Parser: Clone & parse repository
    Parser->>Parser: Extract AST (functions, classes, imports)
    Parser->>Graph: Build dependency graph
    
    Graph->>Agent: CodebaseAnalysis + dependency context
    
    loop Parallel file processing
        Agent->>Agent: Chunk large files (>500 lines)
        Agent->>LLM: Generate documentation (semaphore-limited)
        LLM-->>Agent: Markdown content
    end
    
    Agent->>Builder: Scaffold Docusaurus site
    Builder->>Builder: npm run build
    Builder->>S3: Upload static assets
    
    S3-->>User: Documentation site URL
Loading

Technical Highlights

Multi-Language AST Parsing

Tree-sitter provides language-agnostic structural analysis with O(n) parsing complexity:

Language Functions Classes Imports Source Extraction
Python
JavaScript
TypeScript
Java
Go
Rust

Concurrency Model

# Semaphore-based parallel LLM calls
semaphore = asyncio.Semaphore(MAX_CONCURRENT_LLM_CALLS)

async def _generate_single_file_doc(file, llm, semaphore, codebase):
    async with semaphore:  # Rate-limited concurrency
        if estimate_tokens(file.source_code) > MAX_TOKENS:
            chunks = _chunk_file_content(file)  # Smart boundary detection
            return await _generate_chunk_doc(chunks, llm)
        return await llm.ainvoke(prompt)

Key optimizations:

  • AsyncIO semaphore prevents LLM rate limit exhaustion
  • Token estimation (~4 chars/token) for intelligent chunking
  • Logical boundary detection (class/function splits) preserves context
  • Parallel chunk processing with result aggregation

Tier-Based Model Routing

# Intelligent model selection based on file complexity
def get_model_tier_for_file(line_count: int, class_count: int, function_count: int) -> ModelTier:
    complexity_score = line_count + (class_count * 50) + (function_count * 10)
    
    if complexity_score > 1000:
        return ModelTier.LARGE_CONTEXT  # Claude for complex files
    elif complexity_score > 300:
        return ModelTier.COMPLEX         # Nova Pro for medium files
    else:
        return ModelTier.SIMPLE          # Nova Lite for simple files

Multi-tier architecture:

Tier Model Use Case Context Window
SIMPLE Nova Lite Small files, configs 128K
COMPLEX Nova Pro Multi-class files 300K
LARGE_CONTEXT Claude 3.5 Large codebases 200K
SUMMARY Nova Lite Folder READMEs 128K

Circuit Breaker & Graceful Degradation

# Production-grade fault tolerance
@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    reset_timeout: float = 60.0
    
    def can_proceed(self) -> bool:
        if self._state == "open":
            if time.time() - self._last_failure_time > self.reset_timeout:
                self._state = "half-open"  # Allow test request
                return True
            return False
        return True

Graceful degradation strategy:

  1. Tier fallback: LARGE_CONTEXT → COMPLEX → SIMPLE
  2. Circuit breaker: Opens after 5 consecutive failures
  3. Placeholder docs: AST-based documentation when LLM unavailable

Cross-File Dependency Graph

# Module resolution with import tracking
def build_dependency_graph(files: list[FileAnalysis]) -> None:
    module_to_file = {get_module_name(f.file_path): f for f in files}
    
    for file in files:
        for imp in file.imports:
            if imp.module in module_to_file:
                file.imports_from.append(module_to_file[imp.module])
                module_to_file[imp.module].imported_by.append(file)

Enables:

  • "Related Files" context in LLM prompts
  • Bidirectional import/export tracking
  • Module-level documentation coherence

Externalized Prompt System

# prompts/file_documentation.yaml - Versioned, configurable prompts
version: "1.0"
file_prompt:
  base: |
    You are a senior technical writer creating documentation for:
    {filename} ({language})
    
  language_additions:
    python: "Include type hints and docstring standards."
    javascript: "Document async patterns and event handlers."

Benefits:

  • Version-controlled prompt evolution
  • Language-specific customization without code changes
  • Hot-reload capability for prompt tuning

Intelligent File Categorization

class FileCategory(str, Enum):
    CODE = "code"           # Parsed by Tree-sitter
    CONFIG = "config"       # YAML, JSON, TOML, ENV, Dockerfile
    DOCS = "docs"           # Markdown, RST - passthrough
    DATA = "data"           # CSV, SQL, fixtures
    ASSET = "asset"         # Images, fonts - skip
    BUILD = "build"         # CI/CD, Makefile

Smart handling:

  • Markdown passthrough: Existing README.md files preserved as intro page
  • Config documentation: Custom prompts for infrastructure files
  • Asset detection: Binary files automatically excluded

LangGraph Workflow

stateDiagram-v2
    [*] --> generate_files: CodebaseAnalysis
    generate_files --> generate_folders: file_docs
    generate_folders --> assemble: folder_docs
    assemble --> [*]: output_files
    
    state generate_files {
        [*] --> tier_routing
        tier_routing --> invoke_llm: with fallback
        invoke_llm --> circuit_check
        circuit_check --> placeholder: if open
        circuit_check --> success: if closed
    }
    
    note right of generate_files
        • Tier-based model routing
        • Circuit breaker protection
        • Graceful placeholder fallback
    end note
Loading

Tech Stack

Component Technology Purpose
API FastAPI + Pydantic REST endpoints, WebSocket progress
Queue Celery + Redis Distributed task processing
Database PostgreSQL (Neon) Job persistence, status tracking
Parsing Tree-sitter Multi-language AST extraction
AI Orchestration LangGraph Stateful agent workflow
LLM Multi-provider (Bedrock/OpenRouter) Documentation generation
Static Site Docusaurus 3.0 Enterprise-themed output
Storage AWS S3 Deployed documentation hosting
Observability LangSmith Agent tracing and debugging

Quick Start

Prerequisites

  • Python 3.12+ with uv
  • Docker and Docker Compose
  • AWS credentials (for S3) or OpenRouter API key

Local Development

# Clone repository
git clone https://github.com/Asirwad/docsmitr.git
cd docsmitr

# Start infrastructure
docker-compose up -d  # Redis + PostgreSQL

# Backend setup
cd backend
cp .env.example .env.development
uv sync
uv run alembic upgrade head

# Start services (separate terminals)
uv run uvicorn docsmitr.main:app --reload --port 8000
uv run celery -A docsmitr.worker.celery_app worker --loglevel=info -P solo

Generate Documentation

# Create job
curl -X POST http://localhost:8000/api/jobs \
  -H "Content-Type: application/json" \
  -d '{"github_url": "https://github.com/owner/repo"}'

# Check status
curl http://localhost:8000/api/jobs/{job_id}

Project Structure

docsmitr/
├── backend/
│   └── src/docsmitr/
│       ├── api/              # REST + WebSocket endpoints
│       ├── agent/            # LangGraph workflow + LLM providers
│       ├── parsers/          # Tree-sitter AST extraction
│       ├── services/         # Docusaurus, S3, repository cloning
│       ├── worker/           # Celery task definitions
│       └── core/             # Config, logging, dependencies
├── frontend/                 # Next.js application (coming soon)
└── docker-compose.yml        # Local dev infrastructure

Configuration

Variable Description Default
LLM_PROVIDER bedrock or openrouter bedrock
AWS_REGION AWS region for Bedrock us-east-1
BEDROCK_MODEL Bedrock model ID us.amazon.nova-lite-v1:0
OPENROUTER_MODEL OpenRouter model deepseek/deepseek-r1-0528:free
S3_BUCKET_NAME Documentation output bucket
REDIS_URL Celery broker URL redis://localhost:6379/0
DATABASE_URL PostgreSQL connection

Roadmap

  • V1: Core pipeline (parsing, generation, deployment)
  • V2: Enhanced prompts, chunking, cross-file context, UI theming
  • V2.1: Tier-based model routing, circuit breakers, graceful degradation
  • V2.2: YAML prompt system, file categorization, README-powered intros
  • V3: Task cancellation, job deletion with S3 cleanup
  • V4: Frontend dashboard with real-time progress
  • V5: Custom branding, enterprise SSO, team workspaces

Contributing

Contributions are welcome. Please open an issue to discuss proposed changes before submitting a PR.


License

This project is licensed under the Apache License 2.0 with Attribution Requirement.

See LICENSE for details.


Built by Asirwad

If Docsmitr saves you time, consider starring the repo.