Skip to content

Conversation

@Illviljan
Copy link
Contributor

@Illviljan Illviljan commented Dec 6, 2025

  • Closes cumsum drops index coordinates #6528 ?
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

The non-flox version reduces chunksizes significantly:

x = xr.DataArray([1, 1, 1, 1, 1], name="x").chunk()
grp_idx = xr.DataArray([-1, 0, 0, -1, 1])
with xr.set_options(use_flox=False):
    print(x.groupby(grp_idx).cumsum())
<xarray.DataArray 'x' (dim_0: 5)> Size: 40B
dask.array<getitem, shape=(5,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>
Dimensions without coordinates: dim_0

With flox the chunksize is retained:

x = xr.DataArray([1, 1, 1, 1, 1], name="x").chunk()
grp_idx = xr.DataArray([-1, 0, 0, -1, 1])
with xr.set_options(use_flox=True):
    print(x.groupby(grp_idx).cumsum())
<xarray.DataArray 'x' (dim_0: 5)> Size: 40B
dask.array<_finalize_scan, shape=(5,), dtype=int64, chunksize=(5,), chunktype=numpy.ndarray>
Dimensions without coordinates: dim_0


# return result

actual = apply_ufunc(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is the way. eventually I'd like the apply_ufunc for reductions to live in Xarray too. So feel free to move that over if it helps. We could put it in flox_compat.py

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
keep_attrs: bool | None = None,
**kwargs: Any,
) -> Dataset:
raise NotImplementedError()
Copy link
Contributor Author

@Illviljan Illviljan Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made these changes manually now.
I'm not getting pytest-accept to correctly fix the docstrings in _aggregations.py, it's for example not indenting correctly. I'm not sure if this is just a Windows 10 thing.

@Illviljan Illviljan marked this pull request as ready for review December 7, 2025 12:35
else:
# TODO: Remove drop_vars when GH6528 is fixed
# when Dataset.cumsum propagates indexes, and the group variable?
assert_identical(expected.drop_vars(["x", "group_id"]), actual)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this until min_version of flox is 0.10.5 at least.
Coordinates and docstrings might differ between using flox or not now though. Is that ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not ok imo. I think it might be fixed by simply propagating coordinates in the non-flox branch of the templated code. Might be easy

@Illviljan
Copy link
Contributor Author

Tests are passing now! But there's a lot of deactivated options left and there was quite a bit of extra code in _flox_reduce xarray_reduce, not sure how much of it was to deal with reduction operations only.

wrapper,
obj,
*codes,
# input_core_dims=input_core_dims,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this because we just want the full array forwarded

kwargs=dict(
func=func,
skipna=skipna,
expected_groups=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be the same as _flox_reduce. This is an important optimization.

Comment on lines +1255 to +1257
dtype=None,
method=None,
engine=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These we should grab from kwargs and forward just like _flox_reduce

# exclude_dims=set(dim_tuple),
# output_core_dims=[output_core_dims],
dask="allowed",
# dask_gufunc_kwargs=dict(
Copy link
Contributor

@dcherian dcherian Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please delete.

Comment on lines +1239 to +1241
# for xarray's test_groupby_duplicate_coordinate_labels
# exclude_dims=set(dim_tuple),
# output_core_dims=[output_core_dims],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please delete

("cumprod", [1.0, 2.0, 6.0, 6.0, 2.0, 2.0]),
],
)
def test_resample_cumsum(method: str, expected_array: list[float]) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use flox for the resample case. It could break catastrophically if we don't support method="blockwise" or "cohorts" in flox.

assert_identical(expected, actual)


def test_groupby_cumsum() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing here is pretty light. We'll want at least two more cases:

  1. grouping by a single nD variable
  2. grouping by multiple variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cumsum drops index coordinates

2 participants