Skip to content

Conversation

@m-albert
Copy link
Collaborator

@m-albert m-albert commented May 30, 2021

This PR suggests a dask-image implementation for ndimage.map_coordinates which covers the case in which both the coordinate the provided coordinates fit into memory (update after #237 (comment)) and the input arrays consist of a (potentially large) dask array. See #235.

Implementation details:

  • for every coordinate the corresponding chunk of the input array and the required input slice are determined
  • coordinates are grouped based on associated chunks
  • a task is defined for each group and minimum required input array slice to apply ndimage.map_coordinates
  • output intensities are rearranged to match input order

Some tests are included, plan is to add some more following feedback on the choices of the implementation.

m-albert added 2 commits May 30, 2021 23:50
Implementation: pre-map coordinates onto the chunks of 'image' and
execute `ndimage.map_coordinates` for every chunk.
@m-albert
Copy link
Collaborator Author

A good test would be to see how it performs for your use case @jni.

@GenevieveBuckley
Copy link
Collaborator

A good test would be to see how it performs for your use case @jni.

Yes, this would be very interesting to see

@GenevieveBuckley
Copy link
Collaborator

GenevieveBuckley commented Feb 9, 2022

First, my apologies - I keep sitting down to look at this, and then deciding I need to come back later when I feel I can give it enough attention. Then the same thing happens again later.

Tests

I like the structure of the test validation, comparing the results from scipy.ndimage.map_coordinates with the new Dask implementation. I'd prefer each test case have an explicit assert statement at the end of it, the two actual test functions are not immediately clear with what they're checking.

numpy specific code

There is lots of numpy specific code here. That might cause a separate headache later on, with our goal to implement support for other, non-numpy array libraries (eg: cupy). I don't think that's a reason not to do it, especially if there are groups who would use a map_coordinates feature (Juan, possibly Adam Ginsburg). But I would prefer if we could minimise or eliminate calls to np.array/np.asarray. Again, this doesn't have to be your headache now.

Readability

If you could add comments to the main code function, indicating which parts handle each of the four steps you outline in the first comment on this PR thread, I think that would improve readability.

Also on readability, the docstring will need to be fleshed out a bit (but perhaps a bit later, this might be putting the cart before the horse right now).

... I still don't have thoughtful comments on your actual implementation (mostly I haven't fully understood it yet). Will try to sit down and give it another go soon-ish, but no promises!

@astrofrog
Copy link

Just wanted to say that I would find this feature very useful for making https://github.com/astropy/reproject more compatible with dask inputs!

`ndimage.map_coordinates` for every chunk.
"""

coordinates = np.asarray(coordinates)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently assumes that coordinates is a small enough array that it will fit in memory, but what if both image and coordinates are large arrays and would both not fit in memory? Or is it unlikely that this would be the case?

@m-albert
Copy link
Collaborator Author

m-albert commented Oct 9, 2022

Hey @GenevieveBuckley, first of all I really appreciate you having a look at this PR and giving your feedback and comments! Second of all apologies back for letting this sit still for quite a while now. I had worked on modifications some months back and am only pushing them now.

Tests
I like the structure of the test validation, comparing the results from scipy.ndimage.map_coordinates with the new Dask implementation. I'd prefer each test case have an explicit assert statement at the end of it, the two actual test functions are not immediately clear with what they're checking.

Regarding the assert statements, it's probably not super clear that they're performed within validate_map_coordinates_general, which is called from within a test function. In test_map_coordinates_large_input, I added a comment to clarify that its testing functionality consists in finishing before timeout. Not really sure though whether it makes sense to test large inputs in this way (or at all?), I guess because 'large' is relative and also one would probably avoid heavy lifting during testing.

numpy specific code
There is lots of numpy specific code here. That might cause a separate headache later on, with our goal to implement support for other, non-numpy array libraries (eg: cupy). I don't think that's a reason not to do it, especially if there are groups who would use a map_coordinates feature (Juan, possibly Adam Ginsburg). But I would prefer if we could minimise or eliminate calls to np.array/np.asarray. Again, this doesn't have to be your headache now.

That's a very good point to keep in mind. I eliminated some explicit conversions to numpy arrays, but we'd probably need to rewrite some things when implementing more general support.

Readability
If you could add comments to the main code function, indicating which parts handle each of the four steps you outline in the first comment on this PR thread, I think that would improve readability.

Great idea, I added comments explaining the concrete steps that are performed in the code. That should hopefully make things clearer.

Also on readability, the docstring will need to be fleshed out a bit (but perhaps a bit later, this might be putting the cart before the horse right now).

Definitely, I'm open for suggestions. Actually, would there currently be a way to use @utils._update_wrapper (that copies the original docstring), and add some additional comments to e.g. give additional details about the dask-image wrapping? If not, that might be useful.

@m-albert
Copy link
Collaborator Author

m-albert commented Oct 9, 2022

your actual implementation (mostly I haven't fully understood it yet)

Hope the comments and explanations in the code make it clearer. I'm also linking a part of the previous discussion we had here: #235 (comment)

@m-albert
Copy link
Collaborator Author

m-albert commented Oct 9, 2022

Hey @astrofrog, thanks for your comment :) Yes you're right that this implementation doesn't work with chunked coordinate arrays. Allowing both the image and coordinate arrays to be dask arrays makes the problem much more complex. I guess this is because before the coordinates are present in memory, it's not known which chunks of the image array they need to be mapped onto. See this discussion #235.

I'd tend to think that while it's not necessarily safe to assume the coordinate array to be small in all use cases, it's reasonable to assume the coordinate array is at least smaller than the image array. Because in the contrary case, it's probably more efficient to arrange the coordinates on a regular grid or in some other sensible way. Also, this is true for most of my use cases (all?) of map_coordinates.

What would be your thoughts on this?

@astrofrog
Copy link

For my use case (https://github.com/astropy/reproject), the coordinates might have a similar number of elements as the input array - we are essentially transforming images to different coordinate systems which can involve arbitrary distortions etc. We are currently able to deal with large arrays of coordinates in map_coordinates by just calling map_coordinates several times for different chunks of coordinates and then putting the results together at the end. Since we've already coded this up, it's not critical to me that map_coordinates be able to deal with large arrays of coordinates, but just wanted to bring it up in case it might matter to others, and in case it turns out to be simple here.

@m-albert
Copy link
Collaborator Author

Interesting. In the case of having regular coordinates I was thinking that sth like scipy.interpolate.RegularGridInterpolator might be more appropriate than scipy.ndimage.map_coordinates, but of course that's not the case for nonlinear transformations. And here coordinate arrays can be arbitrarily large.

What's also true though is that as you mention it's relatively straight forward to split up the coordinate array and call map_coordinates several times. Therefore maybe it makes sense to use the current approach applied separately to each chunk of the coordinate array.

I guess this could in principle be done using the dask.distributed functionality of launching tasks from within tasks. For a less brute-force approach, we could also have a look at whether the problem could be expressed as a shuffling one, as suggested by @jakirkham #235 (comment).

@jni
Copy link
Contributor

jni commented Oct 19, 2022

What's also true though is that as you mention it's relatively straight forward to split up the coordinate array and call map_coordinates several times. Therefore maybe it makes sense to use the current approach applied separately to each chunk of the coordinate array.

This makes a lot of sense to me. I can't see why it wouldn't work.

@m-albert
Copy link
Collaborator Author

Thanks for your comment @jni! I'll try to implement this in the next week(s) - extending this PR's map_coordinates implementation by applying it to each chunk of the coordinate array.

@m-albert
Copy link
Collaborator Author

m-albert commented Jan 8, 2023

Happy New Year!

We are currently able to deal with large arrays of coordinates in map_coordinates by just calling map_coordinates several times for different chunks of coordinates and then putting the results together at the end.

What's also true though is that as you mention it's relatively straight forward to split up the coordinate array and call map_coordinates several times. Therefore maybe it makes sense to use the current approach applied separately to each chunk of the coordinate array.

Finally I pushed an implementation of this, which allows also the coordinates to be given as a (potentially large) dask array.

I faced a little difficulty worth mentioning here. Following the approach we discussed above, when both input and coordinates are dask arrays, two rounds of compute are required. For each chunk of the coordinate array

  1. the coordinate values are computed
  2. and are then mapped onto the input array using the discussed approach.

To achieve this I used da.map_blocks in combination with a little hack. Normally, the function used with map_blocks(func..) receives precomputed data chunks. However, in this case, only the coordinate data chunk should be precomputed, while the input array should remain a dask array. So I found the following way to avoid triggering the computation:

output = da.map_blocks(
        _map_single_coordinates_array_chunk,
        delayed(da.Array)(input.dask, input.name, input.chunks, input.dtype),
        coordinates,
        ...)

(Edit: I posted a question regarding this on the dask discourse forum).

Apart from that, I adapted the tests and the implementation is consistent with the scipy output for all combinations of numpy or dask arrays as inputs (without prefiltering). What the tests don't assess though is performance.

@astrofrog Would you be able to test dask_image.ndinterp.map_coordinates with your use case of large coordinate arrays?

@m-albert
Copy link
Collaborator Author

m-albert commented Jan 9, 2023

Actually, would there currently be a way to use @utils._update_wrapper (that copies the original docstring), and add some additional comments to e.g. give additional details about the dask-image wrapping? If not, that might be useful.

Just adding here that I just came across
dask.utils.derived_from(original_klass, version=None, ua_args=None, skipblocks=0, inconsistencies=None) which could be useful to both automatically copy original docstrings and add comments about inconsistencies between dask-image and original functions.

@GenevieveBuckley
Copy link
Collaborator

Happy new year!

I quite like your little hack, it seems like a pretty elegant sidestep around the problem to me. I commented in more detail on the discourse thread.

@jni
Copy link
Contributor

jni commented Jan 10, 2023

Just dropping in to say that I am in awe. 😍 Happy New Year indeed! 😃

As a side note the RTD action failure appears to be unrelated to this PR, probably to do with some sphinx/jinja2 update...

  File "/home/docs/checkouts/readthedocs.org/user_builds/dask-image/conda/237/lib/python3.7/site-packages/sphinx/util/rst.py", line 22, in <module>
    from jinja2 import environmentfilter
ImportError: cannot import name 'environmentfilter' from 'jinja2' (/home/docs/checkouts/readthedocs.org/user_builds/dask-image/conda/237/lib/python3.7/site-packages/jinja2/__init__.py)

@m-albert
Copy link
Collaborator Author

Thanks for having a closer look at the hack and your comments on the dask discourse @GenevieveBuckley :) So far it seems to be an okay solution.

As a side note the RTD action failure appears to be unrelated to this PR, probably to do with some sphinx/jinja2 update...

You're right, pinning jinja2<3.1 seems to fix the problem: readthedocs/readthedocs.org#9038.

@astrofrog
Copy link

@m-albert - this functionality is a blocker for some of the work that we want to do in reproject, so I am considering whether to vendor this implementation in reproject for now, or whether you think this might be merged in the near future? No worries either way but just wanted to check before proceeding.

@jni
Copy link
Contributor

jni commented Sep 30, 2025

fwiw I'm happy for this to be merged as-is. Thanks @m-albert!

@m-albert
Copy link
Collaborator Author

@astrofrog Thanks for the ping here!

I think we should go ahead with merging this PR and making ndinterp.map_coordinates available given that

  • it provides new functionality
  • its availability could create interest and further input into implementation details

On my to do list was:

clarifying the compatibility of this implementation with distributed

I experimented with the distributed scheduler and cannot reproduce the memory explosions I observed earlier. For both OOM input and output arrays I think the only sensible scheduler to use in combination with ndinterp.map_coordinates is the single-threaded one. Things will take time but actually run through! I added a note regarding this to the docstring. So I'd say this point can be considered checked.

collecting more feedback on the implementation

This would basically mean having somebody review this PR 🤗.

  1. @jni already commented here, thank you 🚀
  2. It'd be amazing if either John or Genevieve could perform a quick review here and comment whether they'd be in favor of merging this!

@jakirkham @GenevieveBuckley What do you think? Thank you for your input 🙏🙏

@astrofrog
Copy link

@jakirkham @GenevieveBuckley - just to check, is this something you would have time to review? This would be very useful for the reproject package, and I'd rather avoid vendoring the code here if at all possible. Thanks!

"affine_transform",
"rotate",
"spline_filter",
"spline_filter1d",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this include map_coordinates?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. For this function to support GPU we'd need to additionally use the dispatched function in the main function.

Unless you're interested in it I suggest to leave this to a next PR to not convolute things further.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m-albert - just to be clear, I just mean adding map_coordinates to __all__ to expose it in the public API for the package

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right I got confused with the "dispatch" shown in the diff 😂

Image

I just mean adding map_coordinates to all to expose it in the public API for the package

Yes good catch! Would you like to add it here? Otherwise I'd get to it in the next days

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy for you to do it

@GenevieveBuckley
Copy link
Collaborator

fyi Juan sent me a text message a couple of days ago to tell me to take a look at this again, thanks Juan!

@GenevieveBuckley
Copy link
Collaborator

Very impressive work @m-albert! Really excellent.

I have:

  • added map_overlap to the public API list (suggested by astrofrog)
  • made it clearer that cupy GPU arrays are not supported by this function

Question:
I've added this extra info to the docstring about the prefilter kwarg. I believe it's correct, but would like to double check with other people

    Note: prefilter is not necessary when:
      - You are using nearest neighbour interpolation, by setting order=0
      - You are using linear interpolation, by setting order=1, or
      - You have already prefiltered the input array,
      using the spline_filter or spline_filter1d functions.

If my understanding is correct, does this mean we should be directing users to run their own prefilter first, using the dask_image.ndinterp.spline_filter / spline_filter1d functions that exist in dask-image (also thanks to Marvin), in cases where order>1 (i.e. when they are not using nearest neighbour or linear interpolation, but want a higher order spline?)

Presumably that approach would lead to correct output values, matching the output results from scipy more exactly?

@GenevieveBuckley
Copy link
Collaborator

I'm going to merge this and try and make a pre-release in the next day or so.

If anyone has comments on improving the docstring information on prefilter, let me know (we can always make a separate mini-PR to improve the docs)

@GenevieveBuckley GenevieveBuckley merged commit 5a594e7 into dask:main Nov 11, 2025
17 checks passed
@GenevieveBuckley GenevieveBuckley mentioned this pull request Nov 12, 2025
@GenevieveBuckley
Copy link
Collaborator

fyi there is a dask-image pre-release with map_coordinates now available: https://github.com/dask/dask-image/releases/tag/v2025.11.0rc1
You should be able to try it out with: pip install --pre dask-image

@astrofrog are you able to try that out and let me know if you find any problems before I make the full release?

@astrofrog
Copy link

Thank you! I will test this out today

@astrofrog
Copy link

@GenevieveBuckley It looks like the PyPI upload failed:

https://github.com/dask/dask-image/actions/runs/19283237380/job/55138875502

I think you should be able to fix the issue on PyPI and just rerun the deploy step. In the mean time, I'll install from the tag.

@astrofrog
Copy link

If my understanding is correct, does this mean we should be directing users to run their own prefilter first, using the dask_image.ndinterp.spline_filter / spline_filter1d functions that exist in dask-image (also thanks to Marvin), in cases where order>1 (i.e. when they are not using nearest neighbour or linear interpolation, but want a higher order spline?)

What if dask_image.ndinterp.map_coordinates did this automatically, e.g. using dask_image.ndinterp.spline_filter automatically if prefilter=True?

@astrofrog
Copy link

astrofrog commented Nov 12, 2025

@GenevieveBuckley - I can confirm this works great for reproject (I have a PR at astropy/reproject#532 that adds support for it).

I think once this release is out, it would be worth investigating properly implementing prefilter=True using dask-image's spline_filter as you noted, so that prefilter could actually default to True here.

@GenevieveBuckley
Copy link
Collaborator

What if dask_image.ndinterp.map_coordinates did this automatically, e.g. using dask_image.ndinterp.spline_filter automatically if prefilter=True?

That's a great idea, I've made an issue with your suggestion here: #419
I don't know if anyone is likely to have time to pick that up, but it'd be an excellent addition

@GenevieveBuckley
Copy link
Collaborator

Also, release 2015.11.0 is out now and available on pypi, which includes the new map_coordinates function.

@m-albert
Copy link
Collaborator Author

m-albert commented Nov 13, 2025

Thanks so much for getting this over the finish line @GenevieveBuckley and @astrofrog 🙏🙏

And @jni of course!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants