Skip to content

Conversation

@RohanDisa
Copy link
Contributor

This PR experiments with returning a PyArrow Dataset from read_xarray instead of a RecordBatchReader.

The goal is to explore whether registering chunked InMemoryDatasets improves or enables more lazy behavior when integrating with Arrow/DataFusion, in light of the ongoing discussion around dataset registration and laziness.
Issue: #93

@alxmrs
Copy link
Owner

alxmrs commented Dec 22, 2025

In my side experiments, I found something out: the default version of pyarrow that the project uses doesn't allow one to pass in a RecordBatchReader into an InMemoryDataset. I believe this is essential in order to achieve laziness in pure python.

Thus, I recommend we add pyarrow as an explicit dependency to the project, and further, we upgrade all the minimum versions of the project s.t. the default pyarrow version supports the above property. The current pyarrow version of 22.0.0 would fit this bill.

@alxmrs
Copy link
Owner

alxmrs commented Dec 22, 2025

TODO:

  • Fix broken tests (I think the test code is wrong)
  • Write new tests to ensure that the new code is actually lazily evaluated.
  • Update documentation

@alxmrs
Copy link
Owner

alxmrs commented Dec 22, 2025

Ideally, we could combine the two helper functions into one. I don't think they need to be in the closure of this function either.

@alxmrs
Copy link
Owner

alxmrs commented Dec 22, 2025

I tested the current implementation in a colab notebook, and it is not lazy. I think we need to subtly change how the record batch reader is used to get it to be lazy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants