Skip to content

Conversation

@christianstassen
Copy link
Contributor

This introduces a notebook showing how to create a pipeline for BARRA-R2 and ERA5.

I also added a bunch of new Transforms, which introduce arithmetics (add, subtract, multiply and divide) and a new Operation to round valid time steps.

The transforms are part of the notebook, I thought it might be helpful for other to see how to create their own but, I also added them to the PET source code.

All changes have been tested by me in the notebook, by pytest, and I have run the black tool.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 18769161211

Details

  • 35 of 125 (28.0%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.3%) to 61.023%

Changes Missing Coverage Covered Lines Changed/Added Lines %
packages/pipeline/src/pyearthtools/pipeline/operations/xarray/round_time.py 8 26 30.77%
packages/data/src/pyearthtools/data/transforms/arithmetic.py 26 98 26.53%
Totals Coverage Status
Change from base Build 18765016601: -0.3%
Covered Lines: 9592
Relevant Lines: 15297

💛 - Coveralls

@nikeethr
Copy link
Collaborator

nikeethr commented Jan 6, 2026

Thanks for your work, appreciate it. I have a design comment/question:

With the simplistic operations subtract, add etc. any reason why they need to be defined so explicitly as transforms? Just for my own understanding.

My guess is that this puts them within the category of "pipeline operations", so that they can be represented in some form - e.g. used for example to derive declarative versions of the pipeline. There may be other uses, but this is the main one I can think of. It's totally fine to do that, but there is a "but":

It seems troublesome to have to do this for every operation - I think there is a better way to code this sort of pattern e.g. via higher order functions. From a functional standpoint, it seems the datasets are dissected into individual arrays (using keys retrieved) with the operands applied to them and this seems to be common for all (most?) of the operations.

If, on the other hand, the reason for the transforms is for "consistency" just for consistency sake - and not for what I described above (or some other justifiable purpose), then using xarray directly seems more sensible to me. Especially if what is done is already a wrapper around something xarray is capable of. More specific/complicated transforms like temporal manipulation/rounding, can be an exception.

\cc @tennlee, since this is more of a design comment, that may also apply to pre-existing code.

@christianstassen
Copy link
Contributor Author

Thanks @nikeethr!
Some of what you describe above is above my head :) Probably best for @tennlee to comment on this.
I added those transforms for 'physical' reasons. Those being that the units of some variables in ERA5 do not match those in BARRA2. For example, precipitation in ERA5 is accumulated in meters, whereas in BARRA2 it is a flux and in kg m-2 s-1. To convert between the two I need to multiply the field in ERA5 by 1000. Some other variables have more complicated conversions (e.g., the radiation fields).
Rather than post-processing ERA5 and storing a copy of it, I thought it be easier to implement this in the pipeline.

Happy to hear other thoughts in this :)

@nikeethr
Copy link
Collaborator

nikeethr commented Jan 7, 2026

Hey @christianstassen, thanks for your explanation. I understand what you’re trying to do.

To elaborate - I’m thinking about maintenance and abstraction. Not saying you need to fix anything. It’s more an unavoidable juncture when it comes to any codebase that deals with pipelines. You have the operators and the structure around the operator conveying “meta” information.

For basic binary operations I’d expect the code bloat/overhead to be minimal, and there to be some pre-existing convenience wrapper that helps construct the pattern for each operation, so that it doesn’t have to be repeated. If we decide to change the backend that does the computations (or even bump a version), or the wrapping code itself - we’d have to apply any fixes to each class - making it an error prone and cumbersome exercise.

I’d say there’s nothing actually to fix in this case, but more something to take note of or think about if we see the same pattern repeating.

(It’s just that it’s quite apparent in this PR since they are “simple” operators - so it’s easier for me to raise the point to the maintainers. Also this would not even be in my radar for new or lightweight codebase, but this one is particularly big already)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants