Experiment: return PyArrow Dataset from read_xarray #95

RohanDisa · 2025-12-20T19:27:04Z

This PR experiments with returning a PyArrow Dataset from read_xarray instead of a RecordBatchReader.

The goal is to explore whether registering chunked InMemoryDatasets improves or enables more lazy behavior when integrating with Arrow/DataFusion, in light of the ongoing discussion around dataset registration and laziness.
Issue: #93

alxmrs · 2025-12-22T01:42:23Z

In my side experiments, I found something out: the default version of pyarrow that the project uses doesn't allow one to pass in a RecordBatchReader into an InMemoryDataset. I believe this is essential in order to achieve laziness in pure python.

Thus, I recommend we add pyarrow as an explicit dependency to the project, and further, we upgrade all the minimum versions of the project s.t. the default pyarrow version supports the above property. The current pyarrow version of 22.0.0 would fit this bill.

… order to have the most current pyarrow dependency version. This is necessary for lazy evaluation.

alxmrs · 2025-12-22T02:15:23Z

TODO:

Fix broken tests (I think the test code is wrong)
Write new tests to ensure that the new code is actually lazily evaluated.
Update documentation

alxmrs · 2025-12-22T05:21:59Z

Ideally, we could combine the two helper functions into one. I don't think they need to be in the closure of this function either.

alxmrs · 2025-12-22T18:04:57Z

I tested the current implementation in a colab notebook, and it is not lazy. I think we need to subtly change how the record batch reader is used to get it to be lazy.

RohanDisa added 2 commits December 12, 2025 04:53

Add read_xarray_dataset_as_pyarrow_dataset function

02902e4

Return PyArrow Dataset instead of RecordBatchReader

c642c0a

alxmrs added 3 commits December 21, 2025 18:10

Made some progress. I needed to bump up the minimum Python version in…

3660074

… order to have the most current pyarrow dependency version. This is necessary for lazy evaluation.

pass typecheck.

cc7be0d

Better to use register_dataset rather than register_table.

79d1e80

rm unused import.

671cc98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experiment: return PyArrow Dataset from read_xarray #95

Experiment: return PyArrow Dataset from read_xarray #95

Uh oh!

RohanDisa commented Dec 20, 2025

Uh oh!

alxmrs commented Dec 22, 2025

Uh oh!

alxmrs commented Dec 22, 2025 •

edited

Loading

Uh oh!

alxmrs commented Dec 22, 2025

Uh oh!

alxmrs commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Experiment: return PyArrow Dataset from read_xarray #95

Are you sure you want to change the base?

Experiment: return PyArrow Dataset from read_xarray #95

Uh oh!

Conversation

RohanDisa commented Dec 20, 2025

Uh oh!

alxmrs commented Dec 22, 2025

Uh oh!

alxmrs commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alxmrs commented Dec 22, 2025

Uh oh!

alxmrs commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alxmrs commented Dec 22, 2025 •

edited

Loading