Add CAPI PDF parser for structured questionnaire documents (issue #57) #124

OscarAR46 · 2025-12-06T17:55:09Z

Description

Adds a regex-based parser for CAPI PDF questionnaires. These structured documents from large research studies (e.g., Millennium Cohort Study as test req) have a specific format that the existing DeBERTa parser doesn't handle well.

Changes:

New file: 'src/harmony/parsing/capi_parser.py' - Contains 'is_capi_format()', 'extract_capi_questions()', and 'convert_capi_to_instruments()'
Modified: 'src/harmony/parsing/pdf_parser.py' - Added CAPI detection after Tika extraction, routes to new parser when detected

Approach (based on guidance from issue #57):

CAPI variable codes (e.g., PREL, PJOB) appear on their own line with question text on subsequent lines
Filters out table of contents entries (lines with '...'), interviewer instructions (all caps), and feed-forward metadata
Skips common non-question codes (CARD, NOTE, ENDIF, etc.)

No new dependencies introduced at all.

Fixes #57

Type of change

New feature (non-breaking change that adds functionality and bug fix)

Testing

Tested locally with MCS2_CAPI_Questionnaire_Documentation_June_2006_v1-2.pdf (linked in issue #57):

CAPI format detection: Successfully identified as CAPI format
Question extraction: Extracted 300+ questions from 280k characters of text
Integration: Import test passed for both 'capi_parser.py' and modified 'pdf_parser.py'

Note: Tika server had startup issues on Windows, so used pdfplumber for local text extraction testing. The CAPI parsing logic itself is independent of the text extraction method.

Test Configuration

Library version: harmony installed via pip install -e .
OS: Windows 11
Python: 3.11

Checklist

Note on unchecked items: No formal unit tests or documentation added in this PR - happy to add if requested. The feature was tested manually as described above.

…monydata#57)

OscarAR46 · 2025-12-06T18:10:16Z

By the way - I tested the integration by calling "convert_pdf_to_instruments()" directly with the MCS2 PDF and successfully detected CAPI format and created an instrument with 298 questions.

BUT - Tika server had start up issues on my Windows machine (not sure if this is a universal windows issue or just on my computer, but thought relevant to include), so I pre-extracted the text content for testing. The CAPI detection and parsing logic works correctly with the actual 'pdf_parser.py' integration point.

OscarAR46 · 2025-12-17T11:33:07Z

Hi @woodthom2 just thought I'd leave a note here to make sure everything was okay with the above, no rush what-so-ever of course, just happy to look into anything that you're not happy with!

Add CAPI PDF parser for structured questionnaire documents (issue har…

8f15007

…monydata#57)

woodthom2 merged commit 8e1cebe into harmonydata:main Dec 17, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CAPI PDF parser for structured questionnaire documents (issue #57) #124

Add CAPI PDF parser for structured questionnaire documents (issue #57) #124

Uh oh!

OscarAR46 commented Dec 6, 2025

Uh oh!

OscarAR46 commented Dec 6, 2025

Uh oh!

OscarAR46 commented Dec 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add CAPI PDF parser for structured questionnaire documents (issue #57) #124

Add CAPI PDF parser for structured questionnaire documents (issue #57) #124

Uh oh!

Conversation

OscarAR46 commented Dec 6, 2025

Description

Fixes #57

Type of change

Testing

Test Configuration

Checklist

Uh oh!

OscarAR46 commented Dec 6, 2025

Uh oh!

OscarAR46 commented Dec 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants