Transform any GitHub repository into production-ready technical documentation in minutes, not hours.
Docsmitr is an AI-powered documentation generation system that analyzes codebases at the AST level and produces comprehensive, interconnected documentation deployed as a static Docusaurus site.
Unlike simple code-to-text generators, Docsmitr implements a multi-stage agentic pipeline with:
- 6-language AST parsing via Tree-sitter for accurate structural analysis
- Cross-file dependency graphs to understand module relationships
- Token-aware chunking for large file handling without context loss
- Parallel LLM orchestration with semaphore-based concurrency control
- Enterprise-grade output with Confluence-inspired theming
flowchart TB
subgraph Input["Input Layer"]
GH[GitHub URL]
ZIP[ZIP Upload]
end
subgraph API["API Gateway"]
FastAPI[FastAPI Server]
WS[WebSocket Progress]
end
subgraph Queue["Task Queue"]
Redis[(Redis)]
Celery[Celery Workers]
end
subgraph AI["AI Pipeline"]
TS[Tree-sitter Parser]
DG[Dependency Graph]
LG[LangGraph Workflow]
LLM[LLM Provider]
end
subgraph Output["Output Layer"]
Docusaurus[Docusaurus Builder]
S3[(AWS S3)]
end
subgraph Storage["Persistence"]
PG[(PostgreSQL)]
end
GH --> FastAPI
ZIP --> FastAPI
FastAPI --> Redis
FastAPI <--> WS
Redis --> Celery
Celery --> TS
TS --> DG
DG --> LG
LG <--> LLM
LG --> Docusaurus
Docusaurus --> S3
FastAPI <--> PG
style AI fill:#e1f5fe
style Queue fill:#fff3e0
style Output fill:#e8f5e9
sequenceDiagram
participant User
participant API as FastAPI
participant Queue as Celery/Redis
participant Parser as Tree-sitter
participant Graph as Dependency Graph
participant Agent as LangGraph Agent
participant LLM as LLM Provider
participant Builder as Docusaurus
participant S3 as AWS S3
User->>API: POST /api/jobs {github_url}
API->>Queue: Enqueue job
API-->>User: {job_id, status: pending}
Queue->>Parser: Clone & parse repository
Parser->>Parser: Extract AST (functions, classes, imports)
Parser->>Graph: Build dependency graph
Graph->>Agent: CodebaseAnalysis + dependency context
loop Parallel file processing
Agent->>Agent: Chunk large files (>500 lines)
Agent->>LLM: Generate documentation (semaphore-limited)
LLM-->>Agent: Markdown content
end
Agent->>Builder: Scaffold Docusaurus site
Builder->>Builder: npm run build
Builder->>S3: Upload static assets
S3-->>User: Documentation site URL
Tree-sitter provides language-agnostic structural analysis with O(n) parsing complexity:
| Language | Functions | Classes | Imports | Source Extraction |
|---|---|---|---|---|
| Python | ✅ | ✅ | ✅ | ✅ |
| JavaScript | ✅ | ✅ | ✅ | ✅ |
| TypeScript | ✅ | ✅ | ✅ | ✅ |
| Java | ✅ | ✅ | ✅ | ✅ |
| Go | ✅ | ✅ | ✅ | ✅ |
| Rust | ✅ | ✅ | ✅ | ✅ |
# Semaphore-based parallel LLM calls
semaphore = asyncio.Semaphore(MAX_CONCURRENT_LLM_CALLS)
async def _generate_single_file_doc(file, llm, semaphore, codebase):
async with semaphore: # Rate-limited concurrency
if estimate_tokens(file.source_code) > MAX_TOKENS:
chunks = _chunk_file_content(file) # Smart boundary detection
return await _generate_chunk_doc(chunks, llm)
return await llm.ainvoke(prompt)Key optimizations:
- AsyncIO semaphore prevents LLM rate limit exhaustion
- Token estimation (~4 chars/token) for intelligent chunking
- Logical boundary detection (class/function splits) preserves context
- Parallel chunk processing with result aggregation
# Intelligent model selection based on file complexity
def get_model_tier_for_file(line_count: int, class_count: int, function_count: int) -> ModelTier:
complexity_score = line_count + (class_count * 50) + (function_count * 10)
if complexity_score > 1000:
return ModelTier.LARGE_CONTEXT # Claude for complex files
elif complexity_score > 300:
return ModelTier.COMPLEX # Nova Pro for medium files
else:
return ModelTier.SIMPLE # Nova Lite for simple filesMulti-tier architecture:
| Tier | Model | Use Case | Context Window |
|---|---|---|---|
| SIMPLE | Nova Lite | Small files, configs | 128K |
| COMPLEX | Nova Pro | Multi-class files | 300K |
| LARGE_CONTEXT | Claude 3.5 | Large codebases | 200K |
| SUMMARY | Nova Lite | Folder READMEs | 128K |
# Production-grade fault tolerance
@dataclass
class CircuitBreaker:
failure_threshold: int = 5
reset_timeout: float = 60.0
def can_proceed(self) -> bool:
if self._state == "open":
if time.time() - self._last_failure_time > self.reset_timeout:
self._state = "half-open" # Allow test request
return True
return False
return TrueGraceful degradation strategy:
- Tier fallback: LARGE_CONTEXT → COMPLEX → SIMPLE
- Circuit breaker: Opens after 5 consecutive failures
- Placeholder docs: AST-based documentation when LLM unavailable
# Module resolution with import tracking
def build_dependency_graph(files: list[FileAnalysis]) -> None:
module_to_file = {get_module_name(f.file_path): f for f in files}
for file in files:
for imp in file.imports:
if imp.module in module_to_file:
file.imports_from.append(module_to_file[imp.module])
module_to_file[imp.module].imported_by.append(file)Enables:
- "Related Files" context in LLM prompts
- Bidirectional import/export tracking
- Module-level documentation coherence
# prompts/file_documentation.yaml - Versioned, configurable prompts
version: "1.0"
file_prompt:
base: |
You are a senior technical writer creating documentation for:
{filename} ({language})
language_additions:
python: "Include type hints and docstring standards."
javascript: "Document async patterns and event handlers."Benefits:
- Version-controlled prompt evolution
- Language-specific customization without code changes
- Hot-reload capability for prompt tuning
class FileCategory(str, Enum):
CODE = "code" # Parsed by Tree-sitter
CONFIG = "config" # YAML, JSON, TOML, ENV, Dockerfile
DOCS = "docs" # Markdown, RST - passthrough
DATA = "data" # CSV, SQL, fixtures
ASSET = "asset" # Images, fonts - skip
BUILD = "build" # CI/CD, MakefileSmart handling:
- Markdown passthrough: Existing README.md files preserved as intro page
- Config documentation: Custom prompts for infrastructure files
- Asset detection: Binary files automatically excluded
stateDiagram-v2
[*] --> generate_files: CodebaseAnalysis
generate_files --> generate_folders: file_docs
generate_folders --> assemble: folder_docs
assemble --> [*]: output_files
state generate_files {
[*] --> tier_routing
tier_routing --> invoke_llm: with fallback
invoke_llm --> circuit_check
circuit_check --> placeholder: if open
circuit_check --> success: if closed
}
note right of generate_files
• Tier-based model routing
• Circuit breaker protection
• Graceful placeholder fallback
end note
| Component | Technology | Purpose |
|---|---|---|
| API | FastAPI + Pydantic | REST endpoints, WebSocket progress |
| Queue | Celery + Redis | Distributed task processing |
| Database | PostgreSQL (Neon) | Job persistence, status tracking |
| Parsing | Tree-sitter | Multi-language AST extraction |
| AI Orchestration | LangGraph | Stateful agent workflow |
| LLM | Multi-provider (Bedrock/OpenRouter) | Documentation generation |
| Static Site | Docusaurus 3.0 | Enterprise-themed output |
| Storage | AWS S3 | Deployed documentation hosting |
| Observability | LangSmith | Agent tracing and debugging |
- Python 3.12+ with uv
- Docker and Docker Compose
- AWS credentials (for S3) or OpenRouter API key
# Clone repository
git clone https://github.com/Asirwad/docsmitr.git
cd docsmitr
# Start infrastructure
docker-compose up -d # Redis + PostgreSQL
# Backend setup
cd backend
cp .env.example .env.development
uv sync
uv run alembic upgrade head
# Start services (separate terminals)
uv run uvicorn docsmitr.main:app --reload --port 8000
uv run celery -A docsmitr.worker.celery_app worker --loglevel=info -P solo# Create job
curl -X POST http://localhost:8000/api/jobs \
-H "Content-Type: application/json" \
-d '{"github_url": "https://github.com/owner/repo"}'
# Check status
curl http://localhost:8000/api/jobs/{job_id}docsmitr/
├── backend/
│ └── src/docsmitr/
│ ├── api/ # REST + WebSocket endpoints
│ ├── agent/ # LangGraph workflow + LLM providers
│ ├── parsers/ # Tree-sitter AST extraction
│ ├── services/ # Docusaurus, S3, repository cloning
│ ├── worker/ # Celery task definitions
│ └── core/ # Config, logging, dependencies
├── frontend/ # Next.js application (coming soon)
└── docker-compose.yml # Local dev infrastructure
| Variable | Description | Default |
|---|---|---|
LLM_PROVIDER |
bedrock or openrouter |
bedrock |
AWS_REGION |
AWS region for Bedrock | us-east-1 |
BEDROCK_MODEL |
Bedrock model ID | us.amazon.nova-lite-v1:0 |
OPENROUTER_MODEL |
OpenRouter model | deepseek/deepseek-r1-0528:free |
S3_BUCKET_NAME |
Documentation output bucket | — |
REDIS_URL |
Celery broker URL | redis://localhost:6379/0 |
DATABASE_URL |
PostgreSQL connection | — |
- V1: Core pipeline (parsing, generation, deployment)
- V2: Enhanced prompts, chunking, cross-file context, UI theming
- V2.1: Tier-based model routing, circuit breakers, graceful degradation
- V2.2: YAML prompt system, file categorization, README-powered intros
- V3: Task cancellation, job deletion with S3 cleanup
- V4: Frontend dashboard with real-time progress
- V5: Custom branding, enterprise SSO, team workspaces
Contributions are welcome. Please open an issue to discuss proposed changes before submitting a PR.
This project is licensed under the Apache License 2.0 with Attribution Requirement.
See LICENSE for details.
Built by Asirwad
If Docsmitr saves you time, consider starring the repo.