Document Data Extraction Benchmark

This project provides a starter Flask web application to help evaluate document data extraction providers for formats such as images and PDFs. The development environment relies on Docker and GNU Make.

Requirements

Docker
GNU Make

Getting Started

Build and run the application image directly:

make docker-run

Alternatively, start the development stack with Docker Compose:

make compose-up

Stop the Docker Compose stack with:

make compose-down

The application exposes a single homepage with Bootstrap styling, ready for expansion with additional routes and features. Once running, visit http://localhost:8000 to access the app.

Configuration

Set the FLASK_SECRET_KEY environment variable to override the default development secret key used for session management.

Logging

Configure logging output via the LOG_LEVEL environment variable:

DEBUG: Detailed diagnostic information
INFO: General informational messages (default)
WARNING: Warning messages
ERROR: Error messages
CRITICAL: Critical errors

Example: LOG_LEVEL=DEBUG

MongoDB Configuration

The application uses MongoDB to store documents and experiments. Connection settings:

MONGODB_URI: MongoDB connection string (default: mongodb://localhost:27017/)
MONGODB_DATABASE: Database name (default: document_data_extration)

These are automatically configured in Docker Compose.

Document Data Extraction Providers

The application supports multiple document parsing providers for evaluation and comparison.

Available Providers

AWS Textract (aws_textract)

Traditional OCR-based document parsing
Requires AWS credentials and S3 bucket configuration
Asynchronous processing with job polling
Best for forms and tables
Processes all pages of multi-page PDFs
Configuration: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_TEXTRACT_BUCKET

Azure AI Document Intelligence (azure_doc_intelligence)

Microsoft's OCR and document analysis service
Supports multiple prebuilt models (document, layout, invoice, receipt, custom)
Synchronous processing with built-in polling
Handles images and multi-page PDFs natively
Extracts key-value pairs with confidence scores
Most cost-effective OCR option ($0.0015/page with 500 pages/month free tier)
Configuration: AZURE_DOC_INTELLIGENCE_ENDPOINT, AZURE_DOC_INTELLIGENCE_KEY, optional AZURE_DOC_INTELLIGENCE_MODEL

OpenAI Vision (openai_vision)

LLM-based document understanding using GPT-4 Vision API
Simpler setup (API key only, no S3 required)
Synchronous processing (single API call)
~3x more cost-effective than AWS Textract
Better semantic understanding of document content
PDFs: First page only (Vision API limitation)
Configuration: OPENAI_API_KEY, OPENAI_MODEL (optional), OPENAI_MAX_TOKENS (optional)

OpenAI Responses (openai_responses)

Uses OpenAI Files + Responses APIs with structured JSON output
Uploads the original PDF (up to 4 pages) via the Files API for analysis
Rejects PDFs with more than 4 pages to control cost/latency
Shares prompts and taxonomy normalization with other LLM providers
Useful for evaluating multi-page PDF performance under the Responses API
Configuration: OPENAI_API_KEY, OPENAI_MODEL (optional, gpt-4o/gpt-4o-mini), OPENAI_MAX_TOKENS (optional)

Google Document AI (google_docai)

Google Cloud OCR/understanding platform
Handles scanned images and multi-page PDFs natively
Extracts form fields and entities with confidence scores
Requires GCP project, processor ID, and service account credentials
Configuration: GOOGLE_DOC_AI_PROJECT_ID, GOOGLE_DOC_AI_PROCESSOR_ID, optional GOOGLE_DOC_AI_LOCATION, GOOGLE_APPLICATION_CREDENTIALS

Anthropic Claude (anthropic_claude)

LLM-based document understanding using Claude 3.5 Sonnet
API key only setup (no S3 required)
Synchronous processing (single API call)
Excellent instruction following and JSON output
200K token context window (vs 128K for GPT-4o)
Competitive pricing between GPT-4o-mini and AWS Textract
PDFs: First page only (Vision API limitation)
Configuration: ANTHROPIC_API_KEY, ANTHROPIC_MODEL (optional), ANTHROPIC_MAX_TOKENS (optional)

Provider Configuration

AWS Textract

AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_TEXTRACT_BUCKET=your-s3-bucket-name
AWS_DEFAULT_REGION=us-east-1

Azure AI Document Intelligence

AZURE_DOC_INTELLIGENCE_ENDPOINT=https://YOUR_RESOURCE.cognitiveservices.azure.com/
AZURE_DOC_INTELLIGENCE_KEY=your_azure_api_key
AZURE_DOC_INTELLIGENCE_MODEL=prebuilt-document  # Optional, default: prebuilt-document

OpenAI Vision

OPENAI_API_KEY=your_openai_api_key
OPENAI_MODEL=gpt-4o-mini          # Optional, default: gpt-4o-mini
OPENAI_MAX_TOKENS=1500            # Optional, default: 1500

Google Document AI

GOOGLE_DOC_AI_PROJECT_ID=your_project_id
GOOGLE_DOC_AI_PROCESSOR_ID=your_processor_id
GOOGLE_DOC_AI_LOCATION=us           # Optional, default: us
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account.json

Anthropic Claude

ANTHROPIC_API_KEY=your_anthropic_api_key
ANTHROPIC_MODEL=claude-3-5-sonnet-20240620  # Optional, default: claude-3-5-sonnet-20240620
ANTHROPIC_MAX_TOKENS=2048                   # Optional, default: 2048
ANTHROPIC_FALLBACK_MODELS=claude-3-sonnet-20240229,claude-3-haiku-20240307  # Optional, comma-separated fallback list

Available Claude Models:

claude-3-5-sonnet-20240620 (default) - Best balance of speed, quality, and cost
claude-3-opus-20240229 - Highest quality, most expensive
claude-3-sonnet-20240229 - Fast and capable
claude-3-haiku-20240307 - Fastest, most economical

PDF Processing Behavior

Different providers handle multi-page PDFs differently:

Provider	PDF Processing	Approach
AWS Textract	All pages	Native PDF support, OCR-based processing
Azure Doc Intelligence	All pages	Native PDF support with Azure Document Intelligence
OpenAI Vision	First page only	Converts PDF to image, Vision API
OpenAI Responses	Up to 4 pages	Files + Responses API with single PDF upload
Google Doc AI	All pages	Google Document AI processor
Anthropic Claude	First page only	Converts PDF to image, Vision API

Why first page only for Vision API providers?

Vision APIs are designed for single image analysis
Multi-page support would require N separate API calls (N × cost)
Most identity documents (passports, licenses, birth certificates) are single page

For multi-page documents: Use AWS Textract, Azure Document Intelligence, or Google Document AI (all pages), or OpenAI Responses (up to 4 pages).

Comparing Approaches:

OpenAI Vision vs OpenAI PDF: Same Chat Completions API, different input - Vision sends base64 image (first page), PDF sends base64 PDF (all pages)

Field Taxonomy and Data Normalization

The application uses a standardized field taxonomy to normalize extracted data across different document formats and providers.

Taxonomy Location:

NEW: Taxonomy files now reside in app/services/taxonomy/ (modular, document-type-specific organization)
Multiple taxonomy categories: common, identity, birth_cert, drivers_license, marriage_certificate, form_1040, form_w2
Each category has its own .taxonomy.yml (field definitions) and .mapping.yml (name/value variations)

Features:

Dynamic Prompt Generation: LLM prompts automatically generated from taxonomy definitions (single source of truth)
Multi-language Support: Handles field names in English, Spanish, French, and Portuguese
Fuzzy Field Matching: Handles OCR errors and typos (85% similarity threshold, library-agnostic implementation)
Separator Handling: Recognizes /, -, and | as field separators (e.g., "Sex/Sexo/Genre")
Date Normalization: Automatically parses various date formats and converts to ISO format (YYYY-MM-DD)
- Supports: Text formats (05 FEB 1965), US format (02/05/1965), European format (05/02/1965), compact format (19650205)
- Applied to fields with format: YYYY-MM-DD in taxonomy (e.g., date_of_birth, issue_date, expiration_date)
Value Normalization: Converts variants (e.g., "M" → "male", "BRN" → "brown")

Performance Metrics and Provider Comparison

The application automatically calculates comprehensive metrics for each experiment to enable objective provider comparison.

Metrics Collected

Latency:

Processing time in milliseconds from experiment start to completion
Includes provider-specific processing (API calls, job polling, etc.)

Cost:

Estimated processing cost in USD based on provider pricing models (as of 2024)
AWS Textract: $0.05 per page
Azure Document Intelligence: $0.0015 per page (prebuilt), $0.004 per page (custom), 500 pages/month free
OpenAI gpt-4o-mini: ~$0.00054 per document (token-based)
Anthropic Claude 3.5 Sonnet: ~$0.0126 per document (token-based)
Google Document AI: ~$0.10 per page (Form Parser pricing)

Accuracy:

Percentage of correctly extracted fields vs. expected values
Counts: matched fields, mismatched fields, missing fields
Compares normalized/canonical data for consistency

Metrics Architecture

MetricsCalculator service orchestrates all metric calculations
Provider-specific cost calculators in app/providers/cost/ module
Metrics stored in experiment metrics field for historical tracking
Automatic calculation after experiment completion

Seeding the Database

Populate MongoDB with initial data from YAML files via the web interface:

Start the application: make compose-up
Navigate to the Documents page: http://localhost:8000/documents
Upload documents and define expected extraction fields
Create experiments to test providers

Taxonomy Customization:

Edit taxonomy files in app/services/taxonomy/ to customize field definitions
Add new document types by creating <doctype>.taxonomy.yml and <doctype>.mapping.yml files
No code changes needed - files are auto-discovered and loaded
See app/services/taxonomy/README.md for detailed format documentation

Testing

The application includes a test suite covering taxonomy functionality, field mapping, and prompt generation.

Running Tests

Run all tests:

make test

Run only unit tests (fast):

make test-unit

Run only integration tests:

make test-integration

Run taxonomy-specific tests:

make test-taxonomy

Run tests with coverage report:

make test-coverage

Test Structure

tests/
├── conftest.py                      # Pytest configuration and fixtures
├── unit/services/                   # Unit tests (50 tests)
│   ├── test_taxonomy_service.py     # TaxonomyService tests
│   └── test_prompt_generator.py     # PromptGeneratorService tests
└── integration/                     # Integration tests (13 tests)
    └── test_taxonomy_integration.py # End-to-end workflow tests

Writing New Tests

Tests follow pytest conventions and use shared fixtures from conftest.py.

Example unit test:

# tests/unit/services/test_my_service.py
import pytest

class TestMyService:
    def test_my_function(self, taxonomy_service):
        # Arrange
        data = {'field': 'value'}

        # Act
        result = taxonomy_service.map_extracted_data(data, 'passport')

        # Assert
        assert result is not None

Run specific test:

docker compose exec web pytest tests/unit/services/test_taxonomy_service.py::TestFieldMapping::test_map_passport_data -v

Available fixtures:

taxonomy_service - Fresh TaxonomyService instance
prompt_generator_service - Fresh PromptGeneratorService instance
sample_passport_data - Sample passport test data
sample_drivers_license_data - Sample DL test data
sample_birth_certificate_data - Sample birth cert test data

For more details, see tests/README.md and TEST_SUITE_SUMMARY.md.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
app		app
docker		docker
seed_data		seed_data
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Data Extraction Benchmark

Requirements

Getting Started

Configuration

Logging

MongoDB Configuration

Document Data Extraction Providers

Available Providers

Provider Configuration

AWS Textract

Azure AI Document Intelligence

OpenAI Vision

Google Document AI

Anthropic Claude

PDF Processing Behavior

Field Taxonomy and Data Normalization

Performance Metrics and Provider Comparison

Metrics Collected

Metrics Architecture

Seeding the Database

Testing

Running Tests

Test Structure

Writing New Tests

About

Uh oh!

Releases

Packages

Languages

License

theam/document-data-extraction-benchmark

Folders and files

Latest commit

History

Repository files navigation

Document Data Extraction Benchmark

Requirements

Getting Started

Configuration

Logging

MongoDB Configuration

Document Data Extraction Providers

Available Providers

Provider Configuration

AWS Textract

Azure AI Document Intelligence

OpenAI Vision

Google Document AI

Anthropic Claude

PDF Processing Behavior

Field Taxonomy and Data Normalization

Performance Metrics and Provider Comparison

Metrics Collected

Metrics Architecture

Seeding the Database

Testing

Running Tests

Test Structure

Writing New Tests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages