Skip to content

theam/document-data-extraction-benchmark

Repository files navigation

Document Data Extraction Benchmark

This project provides a starter Flask web application to help evaluate document data extraction providers for formats such as images and PDFs. The development environment relies on Docker and GNU Make.

Requirements

  • Docker
  • GNU Make

Getting Started

Build and run the application image directly:

make docker-run

Alternatively, start the development stack with Docker Compose:

make compose-up

Stop the Docker Compose stack with:

make compose-down

The application exposes a single homepage with Bootstrap styling, ready for expansion with additional routes and features. Once running, visit http://localhost:8000 to access the app.

Configuration

Set the FLASK_SECRET_KEY environment variable to override the default development secret key used for session management.

Logging

Configure logging output via the LOG_LEVEL environment variable:

  • DEBUG: Detailed diagnostic information
  • INFO: General informational messages (default)
  • WARNING: Warning messages
  • ERROR: Error messages
  • CRITICAL: Critical errors

Example: LOG_LEVEL=DEBUG

MongoDB Configuration

The application uses MongoDB to store documents and experiments. Connection settings:

  • MONGODB_URI: MongoDB connection string (default: mongodb://localhost:27017/)
  • MONGODB_DATABASE: Database name (default: document_data_extration)

These are automatically configured in Docker Compose.

Document Data Extraction Providers

The application supports multiple document parsing providers for evaluation and comparison.

Available Providers

AWS Textract (aws_textract)

  • Traditional OCR-based document parsing
  • Requires AWS credentials and S3 bucket configuration
  • Asynchronous processing with job polling
  • Best for forms and tables
  • Processes all pages of multi-page PDFs
  • Configuration: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_TEXTRACT_BUCKET

Azure AI Document Intelligence (azure_doc_intelligence)

  • Microsoft's OCR and document analysis service
  • Supports multiple prebuilt models (document, layout, invoice, receipt, custom)
  • Synchronous processing with built-in polling
  • Handles images and multi-page PDFs natively
  • Extracts key-value pairs with confidence scores
  • Most cost-effective OCR option ($0.0015/page with 500 pages/month free tier)
  • Configuration: AZURE_DOC_INTELLIGENCE_ENDPOINT, AZURE_DOC_INTELLIGENCE_KEY, optional AZURE_DOC_INTELLIGENCE_MODEL

OpenAI Vision (openai_vision)

  • LLM-based document understanding using GPT-4 Vision API
  • Simpler setup (API key only, no S3 required)
  • Synchronous processing (single API call)
  • ~3x more cost-effective than AWS Textract
  • Better semantic understanding of document content
  • PDFs: First page only (Vision API limitation)
  • Configuration: OPENAI_API_KEY, OPENAI_MODEL (optional), OPENAI_MAX_TOKENS (optional)

OpenAI Responses (openai_responses)

  • Uses OpenAI Files + Responses APIs with structured JSON output
  • Uploads the original PDF (up to 4 pages) via the Files API for analysis
  • Rejects PDFs with more than 4 pages to control cost/latency
  • Shares prompts and taxonomy normalization with other LLM providers
  • Useful for evaluating multi-page PDF performance under the Responses API
  • Configuration: OPENAI_API_KEY, OPENAI_MODEL (optional, gpt-4o/gpt-4o-mini), OPENAI_MAX_TOKENS (optional)

Google Document AI (google_docai)

  • Google Cloud OCR/understanding platform
  • Handles scanned images and multi-page PDFs natively
  • Extracts form fields and entities with confidence scores
  • Requires GCP project, processor ID, and service account credentials
  • Configuration: GOOGLE_DOC_AI_PROJECT_ID, GOOGLE_DOC_AI_PROCESSOR_ID, optional GOOGLE_DOC_AI_LOCATION, GOOGLE_APPLICATION_CREDENTIALS

Anthropic Claude (anthropic_claude)

  • LLM-based document understanding using Claude 3.5 Sonnet
  • API key only setup (no S3 required)
  • Synchronous processing (single API call)
  • Excellent instruction following and JSON output
  • 200K token context window (vs 128K for GPT-4o)
  • Competitive pricing between GPT-4o-mini and AWS Textract
  • PDFs: First page only (Vision API limitation)
  • Configuration: ANTHROPIC_API_KEY, ANTHROPIC_MODEL (optional), ANTHROPIC_MAX_TOKENS (optional)

Provider Configuration

AWS Textract

AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_TEXTRACT_BUCKET=your-s3-bucket-name
AWS_DEFAULT_REGION=us-east-1

Azure AI Document Intelligence

AZURE_DOC_INTELLIGENCE_ENDPOINT=https://YOUR_RESOURCE.cognitiveservices.azure.com/
AZURE_DOC_INTELLIGENCE_KEY=your_azure_api_key
AZURE_DOC_INTELLIGENCE_MODEL=prebuilt-document  # Optional, default: prebuilt-document

OpenAI Vision

OPENAI_API_KEY=your_openai_api_key
OPENAI_MODEL=gpt-4o-mini          # Optional, default: gpt-4o-mini
OPENAI_MAX_TOKENS=1500            # Optional, default: 1500

Google Document AI

GOOGLE_DOC_AI_PROJECT_ID=your_project_id
GOOGLE_DOC_AI_PROCESSOR_ID=your_processor_id
GOOGLE_DOC_AI_LOCATION=us           # Optional, default: us
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account.json

Anthropic Claude

ANTHROPIC_API_KEY=your_anthropic_api_key
ANTHROPIC_MODEL=claude-3-5-sonnet-20240620  # Optional, default: claude-3-5-sonnet-20240620
ANTHROPIC_MAX_TOKENS=2048                   # Optional, default: 2048
ANTHROPIC_FALLBACK_MODELS=claude-3-sonnet-20240229,claude-3-haiku-20240307  # Optional, comma-separated fallback list

Available Claude Models:

  • claude-3-5-sonnet-20240620 (default) - Best balance of speed, quality, and cost
  • claude-3-opus-20240229 - Highest quality, most expensive
  • claude-3-sonnet-20240229 - Fast and capable
  • claude-3-haiku-20240307 - Fastest, most economical

PDF Processing Behavior

Different providers handle multi-page PDFs differently:

Provider PDF Processing Approach
AWS Textract All pages Native PDF support, OCR-based processing
Azure Doc Intelligence All pages Native PDF support with Azure Document Intelligence
OpenAI Vision First page only Converts PDF to image, Vision API
OpenAI Responses Up to 4 pages Files + Responses API with single PDF upload
Google Doc AI All pages Google Document AI processor
Anthropic Claude First page only Converts PDF to image, Vision API

Why first page only for Vision API providers?

  • Vision APIs are designed for single image analysis
  • Multi-page support would require N separate API calls (N × cost)
  • Most identity documents (passports, licenses, birth certificates) are single page

For multi-page documents: Use AWS Textract, Azure Document Intelligence, or Google Document AI (all pages), or OpenAI Responses (up to 4 pages).

Comparing Approaches:

  • OpenAI Vision vs OpenAI PDF: Same Chat Completions API, different input - Vision sends base64 image (first page), PDF sends base64 PDF (all pages)

Field Taxonomy and Data Normalization

The application uses a standardized field taxonomy to normalize extracted data across different document formats and providers.

Taxonomy Location:

  • NEW: Taxonomy files now reside in app/services/taxonomy/ (modular, document-type-specific organization)
  • Multiple taxonomy categories: common, identity, birth_cert, drivers_license, marriage_certificate, form_1040, form_w2
  • Each category has its own .taxonomy.yml (field definitions) and .mapping.yml (name/value variations)

Features:

  • Dynamic Prompt Generation: LLM prompts automatically generated from taxonomy definitions (single source of truth)
  • Multi-language Support: Handles field names in English, Spanish, French, and Portuguese
  • Fuzzy Field Matching: Handles OCR errors and typos (85% similarity threshold, library-agnostic implementation)
  • Separator Handling: Recognizes /, -, and | as field separators (e.g., "Sex/Sexo/Genre")
  • Date Normalization: Automatically parses various date formats and converts to ISO format (YYYY-MM-DD)
    • Supports: Text formats (05 FEB 1965), US format (02/05/1965), European format (05/02/1965), compact format (19650205)
    • Applied to fields with format: YYYY-MM-DD in taxonomy (e.g., date_of_birth, issue_date, expiration_date)
  • Value Normalization: Converts variants (e.g., "M" → "male", "BRN" → "brown")

Performance Metrics and Provider Comparison

The application automatically calculates comprehensive metrics for each experiment to enable objective provider comparison.

Metrics Collected

Latency:

  • Processing time in milliseconds from experiment start to completion
  • Includes provider-specific processing (API calls, job polling, etc.)

Cost:

  • Estimated processing cost in USD based on provider pricing models (as of 2024)
  • AWS Textract: $0.05 per page
  • Azure Document Intelligence: $0.0015 per page (prebuilt), $0.004 per page (custom), 500 pages/month free
  • OpenAI gpt-4o-mini: ~$0.00054 per document (token-based)
  • Anthropic Claude 3.5 Sonnet: ~$0.0126 per document (token-based)
  • Google Document AI: ~$0.10 per page (Form Parser pricing)

Accuracy:

  • Percentage of correctly extracted fields vs. expected values
  • Counts: matched fields, mismatched fields, missing fields
  • Compares normalized/canonical data for consistency

Metrics Architecture

  • MetricsCalculator service orchestrates all metric calculations
  • Provider-specific cost calculators in app/providers/cost/ module
  • Metrics stored in experiment metrics field for historical tracking
  • Automatic calculation after experiment completion

Seeding the Database

Populate MongoDB with initial data from YAML files via the web interface:

  1. Start the application: make compose-up
  2. Navigate to the Documents page: http://localhost:8000/documents
  3. Upload documents and define expected extraction fields
  4. Create experiments to test providers

Taxonomy Customization:

  • Edit taxonomy files in app/services/taxonomy/ to customize field definitions
  • Add new document types by creating <doctype>.taxonomy.yml and <doctype>.mapping.yml files
  • No code changes needed - files are auto-discovered and loaded
  • See app/services/taxonomy/README.md for detailed format documentation

Testing

The application includes a test suite covering taxonomy functionality, field mapping, and prompt generation.

Running Tests

Run all tests:

make test

Run only unit tests (fast):

make test-unit

Run only integration tests:

make test-integration

Run taxonomy-specific tests:

make test-taxonomy

Run tests with coverage report:

make test-coverage

Test Structure

tests/
├── conftest.py                      # Pytest configuration and fixtures
├── unit/services/                   # Unit tests (50 tests)
│   ├── test_taxonomy_service.py     # TaxonomyService tests
│   └── test_prompt_generator.py     # PromptGeneratorService tests
└── integration/                     # Integration tests (13 tests)
    └── test_taxonomy_integration.py # End-to-end workflow tests

Writing New Tests

Tests follow pytest conventions and use shared fixtures from conftest.py.

Example unit test:

# tests/unit/services/test_my_service.py
import pytest

class TestMyService:
    def test_my_function(self, taxonomy_service):
        # Arrange
        data = {'field': 'value'}

        # Act
        result = taxonomy_service.map_extracted_data(data, 'passport')

        # Assert
        assert result is not None

Run specific test:

docker compose exec web pytest tests/unit/services/test_taxonomy_service.py::TestFieldMapping::test_map_passport_data -v

Available fixtures:

  • taxonomy_service - Fresh TaxonomyService instance
  • prompt_generator_service - Fresh PromptGeneratorService instance
  • sample_passport_data - Sample passport test data
  • sample_drivers_license_data - Sample DL test data
  • sample_birth_certificate_data - Sample birth cert test data

For more details, see tests/README.md and TEST_SUITE_SUMMARY.md.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages