Conversation
|
@Synicix would you mind outlining the exact form by which you are serializing the Arrow array? I'd like to have the serialization rule well documented so that in theory we could replicate the implementation in other places. |
|
There was a problem hiding this comment.
Pull request overview
This PR implements a custom Arrow data hashing system to replace the arrow-digest dependency, addressing field order dependency issues and improving support for decimal types by including metadata (precision/scale) in the hash to prevent collisions.
Key Changes:
- Custom
ArrowDigesterstruct with support for multiple Arrow data types and nested structures - Field-order-independent hashing using alphabetically sorted field names
- Enhanced decimal type hashing that includes precision and scale metadata
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| src/arrow_digester.rs | New custom hasher implementation with support for various Arrow data types, nested structures, and comprehensive test coverage |
| src/pyarrow.rs | Updated FFI integration to use custom ArrowDigester instead of arrow-digest, with improved safety annotations |
| src/lib.rs | Added arrow_digester module declaration |
| Cargo.toml | Updated dependencies (removed arrow-digest, added digest/postcard/serde), configured strict clippy lints, changed edition to '2024' |
| README.md | Added documentation explaining the hashing system architecture and design choices |
| cspell.json | Added "uids" to dictionary and cleaned up empty ignoreWords |
| .vscode/settings.json | Added VSCode workspace configuration for Rust development |
| .github/workflows/clippy.yml | Added CI workflow for Rust syntax, style, and format checks |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Not sure why clippy didn't catch this Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
As far as I can tell, this just runs checks but not apply the formatting, does it? Shall we make it so that code gets auto-formatted?
There was a problem hiding this comment.
In Vscode with the settings I set, it does auto format when you save. Unless you mean to let the github action force a commit to autoformat it?
There was a problem hiding this comment.
Generally I don't think we should include vscode settings, and if we were to do, we should keep it minimal to things that would be applicable to everyone.
There was a problem hiding this comment.
A lot of it allow people who use VS code to have everything setup with the correct configuration like rust-analyzer, and formatting stuff. Ideally I want to keep it somewhere, but I can simplify it to the pure essentials.
There was a problem hiding this comment.
Let's reduce this to bare minimal. There are definitely some entries like Python 3 interpreter path that shouldn't be set & expected to be the same across different working environments
| reason = "Need to convert raw pointers to Arrow data structures" | ||
| )] | ||
| #[expect( | ||
| clippy::multiple_unsafe_ops_per_block, | ||
| clippy::expect_used, | ||
| reason = "Okay since we are doing the same operation of dereferencing pointers, Will add proper errors later" |
There was a problem hiding this comment.
I agree with the suggestion here. @Synicix mind making the changes?
… a sha256 as public
… in the same hash, and fix bug related to it
|
@eywalker
Will tag all the linear issues here tomorrow. Fixes PLT-583, PLT-560, PLT-558 |
src/arrow_digester.rs
Outdated
| } | ||
|
|
||
| /// Hash an array directly without needing to create an `ArrowDigester` instance on the user side | ||
| pub fn hash_array(array: &dyn Array) -> Vec<u8> { |
There was a problem hiding this comment.
could you expose the hashing of the schema alone? I'd have a use case for that
There was a problem hiding this comment.
I have updated the code to allow hashing schema function to be exported to the python side.
There was a problem hiding this comment.
Let's reduce this to bare minimal. There are definitely some entries like Python 3 interpreter path that shouldn't be set & expected to be the same across different working environments
|
Update:
Fixes: PLT-657 |
Replacement for arrow-digest, since its hash depends on the order of fields elements, and missing some additional data type support that we may need in the future.
Other fixes:
Fixes PLT-451, PLT-544 PLT-561, PLT-560, PLT-594, PLT-657, PLT-583, PLT-558