Docsmitr

Autonomous Codebase Documentation Engine

Transform any GitHub repository into production-ready technical documentation in minutes, not hours.

Overview

Docsmitr is an AI-powered documentation generation system that analyzes codebases at the AST level and produces comprehensive, interconnected documentation deployed as a static Docusaurus site.

Unlike simple code-to-text generators, Docsmitr implements a multi-stage agentic pipeline with:

6-language AST parsing via Tree-sitter for accurate structural analysis
Cross-file dependency graphs to understand module relationships
Token-aware chunking for large file handling without context loss
Parallel LLM orchestration with semaphore-based concurrency control
Enterprise-grade output with Confluence-inspired theming

Architecture

flowchart TB
    subgraph Input["Input Layer"]
        GH[GitHub URL]
        ZIP[ZIP Upload]
    end
    
    subgraph API["API Gateway"]
        FastAPI[FastAPI Server]
        WS[WebSocket Progress]
    end
    
    subgraph Queue["Task Queue"]
        Redis[(Redis)]
        Celery[Celery Workers]
    end
    
    subgraph AI["AI Pipeline"]
        TS[Tree-sitter Parser]
        DG[Dependency Graph]
        LG[LangGraph Workflow]
        LLM[LLM Provider]
    end
    
    subgraph Output["Output Layer"]
        Docusaurus[Docusaurus Builder]
        S3[(AWS S3)]
    end
    
    subgraph Storage["Persistence"]
        PG[(PostgreSQL)]
    end
    
    GH --> FastAPI
    ZIP --> FastAPI
    FastAPI --> Redis
    FastAPI <--> WS
    Redis --> Celery
    Celery --> TS
    TS --> DG
    DG --> LG
    LG <--> LLM
    LG --> Docusaurus
    Docusaurus --> S3
    FastAPI <--> PG
    
    style AI fill:#e1f5fe
    style Queue fill:#fff3e0
    style Output fill:#e8f5e9

Core Pipeline

sequenceDiagram
    participant User
    participant API as FastAPI
    participant Queue as Celery/Redis
    participant Parser as Tree-sitter
    participant Graph as Dependency Graph
    participant Agent as LangGraph Agent
    participant LLM as LLM Provider
    participant Builder as Docusaurus
    participant S3 as AWS S3
    
    User->>API: POST /api/jobs {github_url}
    API->>Queue: Enqueue job
    API-->>User: {job_id, status: pending}
    
    Queue->>Parser: Clone & parse repository
    Parser->>Parser: Extract AST (functions, classes, imports)
    Parser->>Graph: Build dependency graph
    
    Graph->>Agent: CodebaseAnalysis + dependency context
    
    loop Parallel file processing
        Agent->>Agent: Chunk large files (>500 lines)
        Agent->>LLM: Generate documentation (semaphore-limited)
        LLM-->>Agent: Markdown content
    end
    
    Agent->>Builder: Scaffold Docusaurus site
    Builder->>Builder: npm run build
    Builder->>S3: Upload static assets
    
    S3-->>User: Documentation site URL

Technical Highlights

Multi-Language AST Parsing

Tree-sitter provides language-agnostic structural analysis with O(n) parsing complexity:

Language	Functions	Classes	Imports	Source Extraction
Python	✅	✅	✅	✅
JavaScript	✅	✅	✅	✅
TypeScript	✅	✅	✅	✅
Java	✅	✅	✅	✅
Go	✅	✅	✅	✅
Rust	✅	✅	✅	✅

Concurrency Model

# Semaphore-based parallel LLM calls
semaphore = asyncio.Semaphore(MAX_CONCURRENT_LLM_CALLS)

async def _generate_single_file_doc(file, llm, semaphore, codebase):
    async with semaphore:  # Rate-limited concurrency
        if estimate_tokens(file.source_code) > MAX_TOKENS:
            chunks = _chunk_file_content(file)  # Smart boundary detection
            return await _generate_chunk_doc(chunks, llm)
        return await llm.ainvoke(prompt)

Key optimizations:

AsyncIO semaphore prevents LLM rate limit exhaustion
Token estimation (~4 chars/token) for intelligent chunking
Logical boundary detection (class/function splits) preserves context
Parallel chunk processing with result aggregation

Tier-Based Model Routing

# Intelligent model selection based on file complexity
def get_model_tier_for_file(line_count: int, class_count: int, function_count: int) -> ModelTier:
    complexity_score = line_count + (class_count * 50) + (function_count * 10)
    
    if complexity_score > 1000:
        return ModelTier.LARGE_CONTEXT  # Claude for complex files
    elif complexity_score > 300:
        return ModelTier.COMPLEX         # Nova Pro for medium files
    else:
        return ModelTier.SIMPLE          # Nova Lite for simple files

Multi-tier architecture:

Tier	Model	Use Case	Context Window
SIMPLE	Nova Lite	Small files, configs	128K
COMPLEX	Nova Pro	Multi-class files	300K
LARGE_CONTEXT	Claude 3.5	Large codebases	200K
SUMMARY	Nova Lite	Folder READMEs	128K

Circuit Breaker & Graceful Degradation

# Production-grade fault tolerance
@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    reset_timeout: float = 60.0
    
    def can_proceed(self) -> bool:
        if self._state == "open":
            if time.time() - self._last_failure_time > self.reset_timeout:
                self._state = "half-open"  # Allow test request
                return True
            return False
        return True

Graceful degradation strategy:

Tier fallback: LARGE_CONTEXT → COMPLEX → SIMPLE
Circuit breaker: Opens after 5 consecutive failures
Placeholder docs: AST-based documentation when LLM unavailable

Cross-File Dependency Graph

# Module resolution with import tracking
def build_dependency_graph(files: list[FileAnalysis]) -> None:
    module_to_file = {get_module_name(f.file_path): f for f in files}
    
    for file in files:
        for imp in file.imports:
            if imp.module in module_to_file:
                file.imports_from.append(module_to_file[imp.module])
                module_to_file[imp.module].imported_by.append(file)

Enables:

"Related Files" context in LLM prompts
Bidirectional import/export tracking
Module-level documentation coherence

Externalized Prompt System

# prompts/file_documentation.yaml - Versioned, configurable prompts
version: "1.0"
file_prompt:
  base: |
    You are a senior technical writer creating documentation for:
    {filename} ({language})
    
  language_additions:
    python: "Include type hints and docstring standards."
    javascript: "Document async patterns and event handlers."

Benefits:

Version-controlled prompt evolution
Language-specific customization without code changes
Hot-reload capability for prompt tuning

Intelligent File Categorization

class FileCategory(str, Enum):
    CODE = "code"           # Parsed by Tree-sitter
    CONFIG = "config"       # YAML, JSON, TOML, ENV, Dockerfile
    DOCS = "docs"           # Markdown, RST - passthrough
    DATA = "data"           # CSV, SQL, fixtures
    ASSET = "asset"         # Images, fonts - skip
    BUILD = "build"         # CI/CD, Makefile

Smart handling:

Markdown passthrough: Existing README.md files preserved as intro page
Config documentation: Custom prompts for infrastructure files
Asset detection: Binary files automatically excluded

LangGraph Workflow

stateDiagram-v2
    [*] --> generate_files: CodebaseAnalysis
    generate_files --> generate_folders: file_docs
    generate_folders --> assemble: folder_docs
    assemble --> [*]: output_files
    
    state generate_files {
        [*] --> tier_routing
        tier_routing --> invoke_llm: with fallback
        invoke_llm --> circuit_check
        circuit_check --> placeholder: if open
        circuit_check --> success: if closed
    }
    
    note right of generate_files
        • Tier-based model routing
        • Circuit breaker protection
        • Graceful placeholder fallback
    end note

Tech Stack

Component	Technology	Purpose
API	FastAPI + Pydantic	REST endpoints, WebSocket progress
Queue	Celery + Redis	Distributed task processing
Database	PostgreSQL (Neon)	Job persistence, status tracking
Parsing	Tree-sitter	Multi-language AST extraction
AI Orchestration	LangGraph	Stateful agent workflow
LLM	Multi-provider (Bedrock/OpenRouter)	Documentation generation
Static Site	Docusaurus 3.0	Enterprise-themed output
Storage	AWS S3	Deployed documentation hosting
Observability	LangSmith	Agent tracing and debugging

Quick Start

Prerequisites

Python 3.12+ with uv
Docker and Docker Compose
AWS credentials (for S3) or OpenRouter API key

Local Development

# Clone repository
git clone https://github.com/Asirwad/docsmitr.git
cd docsmitr

# Start infrastructure
docker-compose up -d  # Redis + PostgreSQL

# Backend setup
cd backend
cp .env.example .env.development
uv sync
uv run alembic upgrade head

# Start services (separate terminals)
uv run uvicorn docsmitr.main:app --reload --port 8000
uv run celery -A docsmitr.worker.celery_app worker --loglevel=info -P solo

Generate Documentation

# Create job
curl -X POST http://localhost:8000/api/jobs \
  -H "Content-Type: application/json" \
  -d '{"github_url": "https://github.com/owner/repo"}'

# Check status
curl http://localhost:8000/api/jobs/{job_id}

Project Structure

docsmitr/
├── backend/
│   └── src/docsmitr/
│       ├── api/              # REST + WebSocket endpoints
│       ├── agent/            # LangGraph workflow + LLM providers
│       ├── parsers/          # Tree-sitter AST extraction
│       ├── services/         # Docusaurus, S3, repository cloning
│       ├── worker/           # Celery task definitions
│       └── core/             # Config, logging, dependencies
├── frontend/                 # Next.js application (coming soon)
└── docker-compose.yml        # Local dev infrastructure

Configuration

Variable	Description	Default
`LLM_PROVIDER`	`bedrock` or `openrouter`	`bedrock`
`AWS_REGION`	AWS region for Bedrock	`us-east-1`
`BEDROCK_MODEL`	Bedrock model ID	`us.amazon.nova-lite-v1:0`
`OPENROUTER_MODEL`	OpenRouter model	`deepseek/deepseek-r1-0528:free`
`S3_BUCKET_NAME`	Documentation output bucket	—
`REDIS_URL`	Celery broker URL	`redis://localhost:6379/0`
`DATABASE_URL`	PostgreSQL connection	—

Roadmap

V1: Core pipeline (parsing, generation, deployment)
V2: Enhanced prompts, chunking, cross-file context, UI theming
V2.1: Tier-based model routing, circuit breakers, graceful degradation
V2.2: YAML prompt system, file categorization, README-powered intros
V3: Task cancellation, job deletion with S3 cleanup
V4: Frontend dashboard with real-time progress
V5: Custom branding, enterprise SSO, team workspaces

Contributing

Contributions are welcome. Please open an issue to discuss proposed changes before submitting a PR.

License

This project is licensed under the Apache License 2.0 with Attribution Requirement.

See LICENSE for details.

Built by Asirwad

If Docsmitr saves you time, consider starring the repo.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docsmitr

Autonomous Codebase Documentation Engine

Overview

Architecture

Core Pipeline

Technical Highlights

Multi-Language AST Parsing

Concurrency Model

Tier-Based Model Routing

Circuit Breaker & Graceful Degradation

Cross-File Dependency Graph

Externalized Prompt System

Intelligent File Categorization

LangGraph Workflow

Tech Stack

Quick Start

Prerequisites

Local Development

Generate Documentation

Project Structure

Configuration

Roadmap

Contributing

License

About

Uh oh!

Languages

License

Asirwad/docsmitr

Folders and files

Latest commit

History

Repository files navigation

Docsmitr

Autonomous Codebase Documentation Engine

Overview

Architecture

Core Pipeline

Technical Highlights

Multi-Language AST Parsing

Concurrency Model

Tier-Based Model Routing

Circuit Breaker & Graceful Degradation

Cross-File Dependency Graph

Externalized Prompt System

Intelligent File Categorization

LangGraph Workflow

Tech Stack

Quick Start

Prerequisites

Local Development

Generate Documentation

Project Structure

Configuration

Roadmap

Contributing

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Languages