This project provides a starter Flask web application to help evaluate document data extraction providers for formats such as images and PDFs. The development environment relies on Docker and GNU Make.
- Docker
- GNU Make
Build and run the application image directly:
make docker-runAlternatively, start the development stack with Docker Compose:
make compose-upStop the Docker Compose stack with:
make compose-downThe application exposes a single homepage with Bootstrap styling, ready for expansion with additional routes and features. Once running, visit http://localhost:8000 to access the app.
Set the FLASK_SECRET_KEY environment variable to override the default development secret key used for session management.
Configure logging output via the LOG_LEVEL environment variable:
DEBUG: Detailed diagnostic informationINFO: General informational messages (default)WARNING: Warning messagesERROR: Error messagesCRITICAL: Critical errors
Example: LOG_LEVEL=DEBUG
The application uses MongoDB to store documents and experiments. Connection settings:
MONGODB_URI: MongoDB connection string (default:mongodb://localhost:27017/)MONGODB_DATABASE: Database name (default:document_data_extration)
These are automatically configured in Docker Compose.
The application supports multiple document parsing providers for evaluation and comparison.
AWS Textract (aws_textract)
- Traditional OCR-based document parsing
- Requires AWS credentials and S3 bucket configuration
- Asynchronous processing with job polling
- Best for forms and tables
- Processes all pages of multi-page PDFs
- Configuration:
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_TEXTRACT_BUCKET
Azure AI Document Intelligence (azure_doc_intelligence)
- Microsoft's OCR and document analysis service
- Supports multiple prebuilt models (document, layout, invoice, receipt, custom)
- Synchronous processing with built-in polling
- Handles images and multi-page PDFs natively
- Extracts key-value pairs with confidence scores
- Most cost-effective OCR option ($0.0015/page with 500 pages/month free tier)
- Configuration:
AZURE_DOC_INTELLIGENCE_ENDPOINT,AZURE_DOC_INTELLIGENCE_KEY, optionalAZURE_DOC_INTELLIGENCE_MODEL
OpenAI Vision (openai_vision)
- LLM-based document understanding using GPT-4 Vision API
- Simpler setup (API key only, no S3 required)
- Synchronous processing (single API call)
- ~3x more cost-effective than AWS Textract
- Better semantic understanding of document content
- PDFs: First page only (Vision API limitation)
- Configuration:
OPENAI_API_KEY,OPENAI_MODEL(optional),OPENAI_MAX_TOKENS(optional)
OpenAI Responses (openai_responses)
- Uses OpenAI Files + Responses APIs with structured JSON output
- Uploads the original PDF (up to 4 pages) via the Files API for analysis
- Rejects PDFs with more than 4 pages to control cost/latency
- Shares prompts and taxonomy normalization with other LLM providers
- Useful for evaluating multi-page PDF performance under the Responses API
- Configuration:
OPENAI_API_KEY,OPENAI_MODEL(optional, gpt-4o/gpt-4o-mini),OPENAI_MAX_TOKENS(optional)
Google Document AI (google_docai)
- Google Cloud OCR/understanding platform
- Handles scanned images and multi-page PDFs natively
- Extracts form fields and entities with confidence scores
- Requires GCP project, processor ID, and service account credentials
- Configuration:
GOOGLE_DOC_AI_PROJECT_ID,GOOGLE_DOC_AI_PROCESSOR_ID, optionalGOOGLE_DOC_AI_LOCATION,GOOGLE_APPLICATION_CREDENTIALS
Anthropic Claude (anthropic_claude)
- LLM-based document understanding using Claude 3.5 Sonnet
- API key only setup (no S3 required)
- Synchronous processing (single API call)
- Excellent instruction following and JSON output
- 200K token context window (vs 128K for GPT-4o)
- Competitive pricing between GPT-4o-mini and AWS Textract
- PDFs: First page only (Vision API limitation)
- Configuration:
ANTHROPIC_API_KEY,ANTHROPIC_MODEL(optional),ANTHROPIC_MAX_TOKENS(optional)
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_TEXTRACT_BUCKET=your-s3-bucket-name
AWS_DEFAULT_REGION=us-east-1AZURE_DOC_INTELLIGENCE_ENDPOINT=https://YOUR_RESOURCE.cognitiveservices.azure.com/
AZURE_DOC_INTELLIGENCE_KEY=your_azure_api_key
AZURE_DOC_INTELLIGENCE_MODEL=prebuilt-document # Optional, default: prebuilt-documentOPENAI_API_KEY=your_openai_api_key
OPENAI_MODEL=gpt-4o-mini # Optional, default: gpt-4o-mini
OPENAI_MAX_TOKENS=1500 # Optional, default: 1500GOOGLE_DOC_AI_PROJECT_ID=your_project_id
GOOGLE_DOC_AI_PROCESSOR_ID=your_processor_id
GOOGLE_DOC_AI_LOCATION=us # Optional, default: us
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account.jsonANTHROPIC_API_KEY=your_anthropic_api_key
ANTHROPIC_MODEL=claude-3-5-sonnet-20240620 # Optional, default: claude-3-5-sonnet-20240620
ANTHROPIC_MAX_TOKENS=2048 # Optional, default: 2048
ANTHROPIC_FALLBACK_MODELS=claude-3-sonnet-20240229,claude-3-haiku-20240307 # Optional, comma-separated fallback listAvailable Claude Models:
claude-3-5-sonnet-20240620(default) - Best balance of speed, quality, and costclaude-3-opus-20240229- Highest quality, most expensiveclaude-3-sonnet-20240229- Fast and capableclaude-3-haiku-20240307- Fastest, most economical
Different providers handle multi-page PDFs differently:
| Provider | PDF Processing | Approach |
|---|---|---|
| AWS Textract | All pages | Native PDF support, OCR-based processing |
| Azure Doc Intelligence | All pages | Native PDF support with Azure Document Intelligence |
| OpenAI Vision | First page only | Converts PDF to image, Vision API |
| OpenAI Responses | Up to 4 pages | Files + Responses API with single PDF upload |
| Google Doc AI | All pages | Google Document AI processor |
| Anthropic Claude | First page only | Converts PDF to image, Vision API |
Why first page only for Vision API providers?
- Vision APIs are designed for single image analysis
- Multi-page support would require N separate API calls (N × cost)
- Most identity documents (passports, licenses, birth certificates) are single page
For multi-page documents: Use AWS Textract, Azure Document Intelligence, or Google Document AI (all pages), or OpenAI Responses (up to 4 pages).
Comparing Approaches:
- OpenAI Vision vs OpenAI PDF: Same Chat Completions API, different input - Vision sends base64 image (first page), PDF sends base64 PDF (all pages)
The application uses a standardized field taxonomy to normalize extracted data across different document formats and providers.
Taxonomy Location:
- NEW: Taxonomy files now reside in
app/services/taxonomy/(modular, document-type-specific organization) - Multiple taxonomy categories:
common,identity,birth_cert,drivers_license,marriage_certificate,form_1040,form_w2 - Each category has its own
.taxonomy.yml(field definitions) and.mapping.yml(name/value variations)
Features:
- Dynamic Prompt Generation: LLM prompts automatically generated from taxonomy definitions (single source of truth)
- Multi-language Support: Handles field names in English, Spanish, French, and Portuguese
- Fuzzy Field Matching: Handles OCR errors and typos (85% similarity threshold, library-agnostic implementation)
- Separator Handling: Recognizes
/,-, and|as field separators (e.g., "Sex/Sexo/Genre") - Date Normalization: Automatically parses various date formats and converts to ISO format (YYYY-MM-DD)
- Supports: Text formats (
05 FEB 1965), US format (02/05/1965), European format (05/02/1965), compact format (19650205) - Applied to fields with
format: YYYY-MM-DDin taxonomy (e.g.,date_of_birth,issue_date,expiration_date)
- Supports: Text formats (
- Value Normalization: Converts variants (e.g., "M" → "male", "BRN" → "brown")
The application automatically calculates comprehensive metrics for each experiment to enable objective provider comparison.
Latency:
- Processing time in milliseconds from experiment start to completion
- Includes provider-specific processing (API calls, job polling, etc.)
Cost:
- Estimated processing cost in USD based on provider pricing models (as of 2024)
- AWS Textract: $0.05 per page
- Azure Document Intelligence: $0.0015 per page (prebuilt), $0.004 per page (custom), 500 pages/month free
- OpenAI gpt-4o-mini: ~$0.00054 per document (token-based)
- Anthropic Claude 3.5 Sonnet: ~$0.0126 per document (token-based)
- Google Document AI: ~$0.10 per page (Form Parser pricing)
Accuracy:
- Percentage of correctly extracted fields vs. expected values
- Counts: matched fields, mismatched fields, missing fields
- Compares normalized/canonical data for consistency
MetricsCalculatorservice orchestrates all metric calculations- Provider-specific cost calculators in
app/providers/cost/module - Metrics stored in experiment
metricsfield for historical tracking - Automatic calculation after experiment completion
Populate MongoDB with initial data from YAML files via the web interface:
- Start the application:
make compose-up - Navigate to the Documents page:
http://localhost:8000/documents - Upload documents and define expected extraction fields
- Create experiments to test providers
Taxonomy Customization:
- Edit taxonomy files in
app/services/taxonomy/to customize field definitions - Add new document types by creating
<doctype>.taxonomy.ymland<doctype>.mapping.ymlfiles - No code changes needed - files are auto-discovered and loaded
- See
app/services/taxonomy/README.mdfor detailed format documentation
The application includes a test suite covering taxonomy functionality, field mapping, and prompt generation.
Run all tests:
make testRun only unit tests (fast):
make test-unitRun only integration tests:
make test-integrationRun taxonomy-specific tests:
make test-taxonomyRun tests with coverage report:
make test-coveragetests/
├── conftest.py # Pytest configuration and fixtures
├── unit/services/ # Unit tests (50 tests)
│ ├── test_taxonomy_service.py # TaxonomyService tests
│ └── test_prompt_generator.py # PromptGeneratorService tests
└── integration/ # Integration tests (13 tests)
└── test_taxonomy_integration.py # End-to-end workflow tests
Tests follow pytest conventions and use shared fixtures from conftest.py.
Example unit test:
# tests/unit/services/test_my_service.py
import pytest
class TestMyService:
def test_my_function(self, taxonomy_service):
# Arrange
data = {'field': 'value'}
# Act
result = taxonomy_service.map_extracted_data(data, 'passport')
# Assert
assert result is not NoneRun specific test:
docker compose exec web pytest tests/unit/services/test_taxonomy_service.py::TestFieldMapping::test_map_passport_data -vAvailable fixtures:
taxonomy_service- Fresh TaxonomyService instanceprompt_generator_service- Fresh PromptGeneratorService instancesample_passport_data- Sample passport test datasample_drivers_license_data- Sample DL test datasample_birth_certificate_data- Sample birth cert test data
For more details, see tests/README.md and TEST_SUITE_SUMMARY.md.