Skip to content

Conversation

@slobodan-ilic
Copy link
Contributor

@slobodan-ilic slobodan-ilic commented Oct 13, 2025

File Handle Support for SAV Files

Summary

Adds support for reading SAV files directly from file-like objects (e.g., zip archives, BytesIO) without extracting to disk.

Use Case

Read large SAV files from zip archives without temporary file extraction:

import zipfile
import pyreadstat

with zipfile.ZipFile('data.zip', 'r') as zf:
with zf.open('survey.sav') as f:
df, meta = pyreadstat.read_sav(f) # No extraction needed!

Implementation

  • Leverages ReadStat's built-in I/O handler system
  • Custom Cython handlers bridge Python file objects to C callbacks
  • Unified parser function handles both paths and file objects
  • Zero breaking changes - existing code works unchanged

Testing

  • All existing tests pass (92 tests)
  • New integration test validates zip archive reading
  • Data integrity verified (identical output to path-based reading)

Changes

  • pyreadstat.read_sav() now accepts file-like objects with read() and seek() methods
  • +132 net lines (after refactoring to eliminate duplication)
  • Extensible to other formats (DTA, SAS7BDAT, etc.)

@slobodan-ilic slobodan-ilic changed the title [Issue-308 support for file handles [Issue-308]: Add support for reading directly from file handles Oct 13, 2025
@slobodan-ilic slobodan-ilic force-pushed the ISS-308-support-for-file-handles branch 3 times, most recently from 1c9535f to c2060eb Compare October 13, 2025 17:58
@ofajardo
Copy link
Collaborator

hi, thanks a lot for the PR! It is really great!

Is this ready or WIP?

I read your PR quickly, and I have a couple of questions, a couple of asks and one suggestion:

questions:

  • Why is the _read_fileobj implemented only for sav? Would it benefit also all the other reading functions?
  • in read_sav, if a file object you route it to _read_fileobj, which skips apply_value_formats ... why? I think that is still needed.

suggestions:

  • Would it be possible to just use the regular run_conversion instead of having a separate _read_fileobj? It looks to me as if run_conversion could be used out of the box with the changes you did. Maybe it is something you did for testing purposes but now can be discarded? This would solve my previous questions.

asks:

  • Could you add in the README file in the section Usage -> more reading options, a new sub-section describing briefly this new feature, and a couple of examples on how it could be used? I am thinking on your zip usecase ,and also what is described in read_sas7bdat when user_missing=True not populating meta.missing_user_values #279 if it applies (reading from a remote file)
  • The test_read_sav_from_zip_file_handle, could you put it in test_narwhalified.py and make sure it works for both pandas and polars?

@slobodan-ilic slobodan-ilic marked this pull request as draft October 14, 2025 14:28
@slobodan-ilic
Copy link
Contributor Author

Hey @ofajardo , thanks for your comment. I think all of your observations are valid. I should've created the PR as a draft (I just converted it to one just now). I first wanna make sure our team can use it for what they need, and if it proves to be usable, then I'll get back to the PR and make it much tighter, and also address all of your concerns.

@ofajardo
Copy link
Collaborator

Awesome! Thanks!

@slobodan-ilic slobodan-ilic force-pushed the ISS-308-support-for-file-handles branch 2 times, most recently from af8d838 to 01bf38f Compare January 14, 2026 09:49
- Add custom I/O handlers in _readstat_parser.pyx to bridge Python file objects to ReadStat C library
- Modify run_conversion() to detect file-like objects and use custom handlers
- Works for all read functions (SAV, SAS7BDAT, DTA, XPORT, POR)
- Add test in test_basic.py and test_narwhalified.py (pandas/polars compatible)
- Add README documentation with zip archive example

Addresses maintainer feedback:
- Uses run_conversion() directly instead of separate wrapper
- Preserves apply_value_formats and all other options
- File handle support works for all formats, not just SAV
@slobodan-ilic slobodan-ilic force-pushed the ISS-308-support-for-file-handles branch from 01bf38f to d9e7200 Compare January 14, 2026 09:50
Test demonstrates reading SAV files from remote sources via HTTP, showing real-world usage of file handle support. Uses built-in http.server - no external dependencies needed.
@slobodan-ilic
Copy link
Contributor Author

@ofajardo We finally tested this in the scope of our software, and it does what we want it to. I've updated the PR addressing all of your concerns. Do you think it's cool to put it from draft to an actual PR?

@ofajardo ofajardo changed the base branch from master to dev January 16, 2026 08:08
@ofajardo
Copy link
Collaborator

sure please go ahead. I am a bit busy these days, but will try to review soon.

@ofajardo ofajardo changed the base branch from dev to master January 16, 2026 09:00
@ofajardo ofajardo changed the base branch from master to dev January 16, 2026 09:01
@ofajardo
Copy link
Collaborator

ofajardo commented Jan 16, 2026

I changed the base to the dev branch, as I would merge into there.

I had a quick look at the changes and they look great! I am a bit confused though, because I only see 3 recent commits from a couple of days ago, and I do not see any of the changes we discussed before, for example, I do not see any change at all to pyreadstat.pyx ...

or maybe you just discarded the previous commits as those changes were not needed anymore?

BTW I see that the signature of run_readstat_parser has changed in the pyx file but in the pxd file has not been updated.

@ofajardo ofajardo marked this pull request as ready for review January 16, 2026 13:06
@ofajardo
Copy link
Collaborator

OK, I have reviewed now and everything looks good. I assume you just got rid of the previous commits and changes.
I also compiled and run tests, everything went smoothly.
I used to have all functions in the pyx listed in the pxd, and I see that you did not include the new ones. Just reading again about it, I see that there is no efficiency gain on having not cimported functions from another module listed in the pxd, so I guess it can stay as it is, as I also if the signature of a function changes in the pyx then you have to change it twice also in the pxd for no real performance gain.
So, please just confirm everything is also good from your side, and I will merge and then do a release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants