Skip to content

Conversation

@OscarAR46
Copy link
Contributor

Description

Adds a regex-based parser for CAPI PDF questionnaires. These structured documents from large research studies (e.g., Millennium Cohort Study as test req) have a specific format that the existing DeBERTa parser doesn't handle well.

Changes:

  • New file: 'src/harmony/parsing/capi_parser.py' - Contains 'is_capi_format()', 'extract_capi_questions()', and 'convert_capi_to_instruments()'
  • Modified: 'src/harmony/parsing/pdf_parser.py' - Added CAPI detection after Tika extraction, routes to new parser when detected

Approach (based on guidance from issue #57):

  • CAPI variable codes (e.g., PREL, PJOB) appear on their own line with question text on subsequent lines
  • Filters out table of contents entries (lines with '...'), interviewer instructions (all caps), and feed-forward metadata
  • Skips common non-question codes (CARD, NOTE, ENDIF, etc.)

No new dependencies introduced at all.

Fixes #57

Type of change

  • New feature (non-breaking change that adds functionality and bug fix)

Testing

Tested locally with MCS2_CAPI_Questionnaire_Documentation_June_2006_v1-2.pdf (linked in issue #57):

  • CAPI format detection: Successfully identified as CAPI format
  • Question extraction: Extracted 300+ questions from 280k characters of text
  • Integration: Import test passed for both 'capi_parser.py' and modified 'pdf_parser.py'

Note: Tika server had startup issues on Windows, so used pdfplumber for local text extraction testing. The CAPI parsing logic itself is independent of the text extraction method.

Test Configuration

  • Library version: harmony installed via pip install -e .
  • OS: Windows 11
  • Python: 3.11

Checklist

  • My PR is for one issue, rather than for multiple unrelated fixes.
  • My code follows the style guidelines of this project. I have applied a Linter (Pycharm code formatter) to make my whitespace consistent with the rest of the project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • I have added tests that prove my fix is effective or that my feature works (I have the test file saved locally in case necessary to add.
  • New and existing unit tests pass locally with my changes.
  • Any dependent changes have been merged and published in downstream modules.
  • I have checked my code and corrected any misspellings.
  • The Harmony API is not broken by my change to the Harmony Python library.
  • I add third party dependencies only when necessary. If I changed the requirements, it changes in 'requirements.txt', 'pyproject.toml' and also in the 'requirements.txt' in the API repo
  • If I introduced a new feature, I documented it

Note on unchecked items: No formal unit tests or documentation added in this PR - happy to add if requested. The feature was tested manually as described above.

@OscarAR46
Copy link
Contributor Author

By the way - I tested the integration by calling "convert_pdf_to_instruments()" directly with the MCS2 PDF and successfully detected CAPI format and created an instrument with 298 questions.

BUT - Tika server had start up issues on my Windows machine (not sure if this is a universal windows issue or just on my computer, but thought relevant to include), so I pre-extracted the text content for testing. The CAPI detection and parsing logic works correctly with the actual 'pdf_parser.py' integration point.

@OscarAR46
Copy link
Contributor Author

Hi @woodthom2 just thought I'd leave a note here to make sure everything was okay with the above, no rush what-so-ever of course, just happy to look into anything that you're not happy with!

@woodthom2 woodthom2 merged commit 8e1cebe into harmonydata:main Dec 17, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make separate parsing code for CAPI formatted PDF files

2 participants