-
Notifications
You must be signed in to change notification settings - Fork 41
feat(rust/sedona-pointcloud) Initial LAZ format support #471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Looking forward to this! Just a note that the |
|
Sorry for being slow here...I spent today catching up on reviews that had accumulated over the holidays and will take a look tomorrow! |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! I left some comments inline but most of them can be follow-ups.
The main thing this needs before it can be merged is test coverage. I didn't spot any tests but feel free to let me know if I missed them! I would suggest:
- Adding .laz file(s) that cover the range of input options you expose here to apache/sedona-testing, maybe with scripts to generate them if that's possible
- Read the test files and check output with
assert_batches_equal!(). There are some tests in sedona-geoparquet that do this kind of thing except yours can be easier because you can use your Plain geometry output to avoid having to check WKB.
sedona-db/rust/sedona-geoparquet/src/format.rs
Lines 564 to 573 in 68970b0
| fn setup_context() -> SessionContext { | |
| let mut state = SessionStateBuilder::new().build(); | |
| state | |
| .register_file_format(Arc::new(GeoParquetFormatFactory::new()), true) | |
| .unwrap(); | |
| SessionContext::new_with_state(state).enable_url_table() | |
| } | |
| #[tokio::test] | |
| async fn format_from_url_table() { |
- For some lower-level components you could also hard-code some byte ranges (e.g., the header bytes or the bytes for a records). For the builder you could load the builders with some data to check
finish().
| // TODO: proper size | ||
| std::mem::size_of_val(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunk_tablecapacity in bytes +extra_attributescapacity in bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved in cd2b8ba
Maybe also need to calculate the header, which can blow up with arbitrarily sized (E)VLRs
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this (and for handling the Arrow/DataFusion update!)
I left some inline comments about ways to test these files. I am mostly worried that a future contributor or an LLM acting on their behalf will roll through and break something and there won't be a failing test to detect a regression for (e.g.) Int8s with a nodata value.
I think probably a golden file that exercises the full matrix of extra attribute data types / with or without offset / with or without scale / with or without nodata value + assertions that we read the file's content correctly would be sufficient. If it's easy to build that file using the Builder on demand we can do that too.
| #[allow(static_mut_refs)] | ||
| #[tokio::test] | ||
| async fn reader_basic_e2e() { | ||
| // create laz file | ||
| static mut LAZ: Vec<u8> = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
Does this really need to be static mut? For some other tests we use a temporary directory, which would also let you check a multi file read.
sedona-db/rust/sedona/src/context.rs
Line 622 in 9065867
| let tmpdir = tempdir().unwrap(); |
| #[allow(static_mut_refs)] | ||
| #[tokio::test] | ||
| async fn header_basic_e2e() { | ||
| // create laz file | ||
| static mut LAZ: Vec<u8> = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also just a tempfile unless this really needs to be static/mutable?
| .await | ||
| .unwrap(); | ||
|
|
||
| assert_eq!(batch.num_rows(), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This level of granularity is OK if all the lower-level pieces are tested, although in this case they are mostly not.
| } | ||
|
|
||
| Ok(width) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions for tests that should live in this file:
- Create a builder with zero rows, check all of the output options to ensure they give you the schema (or at least number of columns) you are expecting
- The building of attributes (tests should cover each branch here). One of the nice parts about refactoring this to use GATs if you can would be that there are fewer branches to test (although probably easier to use an
rstestparameterized test like#[values(DataType::Int8, DataType::Int16, ...)]).
I don't think you'll need a test file for any of that (but you might need to create a mock header and mock attributes with/without offset, scale, and nodata).
|
|
||
| pub(crate) fn extra_bytes_attributes( | ||
| header: &Header, | ||
| ) -> Result<Vec<ExtraAttribute>, Box<dyn Error + Send + Sync>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe?
| ) -> Result<Vec<ExtraAttribute>, Box<dyn Error + Send + Sync>> { | |
| ) -> Result<Vec<ExtraAttribute>, DataFusionError> { |
| store: &(impl ObjectStore + ?Sized), | ||
| object_meta: &ObjectMeta, | ||
| header: &Header, | ||
| ) -> Result<Vec<ChunkMeta>, Box<dyn Error + Send + Sync>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ) -> Result<Vec<ChunkMeta>, Box<dyn Error + Send + Sync>> { | |
| ) -> Result<Vec<ChunkMeta>, DataFusionError> { |
Probably use plan_err!() for these?
| reader.header(), | ||
| &metadata_reader.fetch_header().await.unwrap() | ||
| ); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a dummy header (perhaps you can inline the bytes of a known test file, or perhaps you can create one using the Builder that exercises the code path for the full matrix of data types by offset/scale/nodata, or create a function that accepts DataType, offset, scale, nodata, and outputs a file with exactly one extra attribute that we can use in a parameterized test to check that it roundtrips.
This provides basic listing table provider functionality for LAZ files, featuring partitioned/parallel read capabilities with low memory footprint.