Skip to content

Conversation

@b4l
Copy link
Contributor

@b4l b4l commented Dec 30, 2025

This provides basic listing table provider functionality for LAZ files, featuring partitioned/parallel read capabilities with low memory footprint.

@paleolimbot
Copy link
Member

Looking forward to this!

Just a note that the packaging CI is mad because your files don't have Apache license headers (you can copy them from any other .rs file in the repo).

@b4l b4l marked this pull request as ready for review January 2, 2026 10:50
@b4l b4l changed the title [WIP] LAZ format support feat(rust/sedona-pointcloud) Initial LAZ format support Jan 2, 2026
@paleolimbot
Copy link
Member

Sorry for being slow here...I spent today catching up on reviews that had accumulated over the holidays and will take a look tomorrow!

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! I left some comments inline but most of them can be follow-ups.

The main thing this needs before it can be merged is test coverage. I didn't spot any tests but feel free to let me know if I missed them! I would suggest:

  • Adding .laz file(s) that cover the range of input options you expose here to apache/sedona-testing, maybe with scripts to generate them if that's possible
  • Read the test files and check output with assert_batches_equal!(). There are some tests in sedona-geoparquet that do this kind of thing except yours can be easier because you can use your Plain geometry output to avoid having to check WKB.

fn setup_context() -> SessionContext {
let mut state = SessionStateBuilder::new().build();
state
.register_file_format(Arc::new(GeoParquetFormatFactory::new()), true)
.unwrap();
SessionContext::new_with_state(state).enable_url_table()
}
#[tokio::test]
async fn format_from_url_table() {

  • For some lower-level components you could also hard-code some byte ranges (e.g., the header bytes or the bytes for a records). For the builder you could load the builders with some data to check finish().

Comment on lines 56 to 57
// TODO: proper size
std::mem::size_of_val(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • chunk_table capacity in bytes + extra_attributes capacity in bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved in cd2b8ba

Maybe also need to calculate the header, which can blow up with arbitrarily sized (E)VLRs

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this (and for handling the Arrow/DataFusion update!)

I left some inline comments about ways to test these files. I am mostly worried that a future contributor or an LLM acting on their behalf will roll through and break something and there won't be a failing test to detect a regression for (e.g.) Int8s with a nodata value.

I think probably a golden file that exercises the full matrix of extra attribute data types / with or without offset / with or without scale / with or without nodata value + assertions that we read the file's content correctly would be sufficient. If it's easy to build that file using the Builder on demand we can do that too.

Comment on lines +189 to +193
#[allow(static_mut_refs)]
#[tokio::test]
async fn reader_basic_e2e() {
// create laz file
static mut LAZ: Vec<u8> = Vec::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

Does this really need to be static mut? For some other tests we use a temporary directory, which would also let you check a multi file read.

let tmpdir = tempdir().unwrap();

Comment on lines +417 to +421
#[allow(static_mut_refs)]
#[tokio::test]
async fn header_basic_e2e() {
// create laz file
static mut LAZ: Vec<u8> = Vec::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also just a tempfile unless this really needs to be static/mutable?

.await
.unwrap();

assert_eq!(batch.num_rows(), 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This level of granularity is OK if all the lower-level pieces are tested, although in this case they are mostly not.

}

Ok(width)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions for tests that should live in this file:

  • Create a builder with zero rows, check all of the output options to ensure they give you the schema (or at least number of columns) you are expecting
  • The building of attributes (tests should cover each branch here). One of the nice parts about refactoring this to use GATs if you can would be that there are fewer branches to test (although probably easier to use an rstest parameterized test like #[values(DataType::Int8, DataType::Int16, ...)]).

I don't think you'll need a test file for any of that (but you might need to create a mock header and mock attributes with/without offset, scale, and nodata).


pub(crate) fn extra_bytes_attributes(
header: &Header,
) -> Result<Vec<ExtraAttribute>, Box<dyn Error + Send + Sync>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe?

Suggested change
) -> Result<Vec<ExtraAttribute>, Box<dyn Error + Send + Sync>> {
) -> Result<Vec<ExtraAttribute>, DataFusionError> {

store: &(impl ObjectStore + ?Sized),
object_meta: &ObjectMeta,
header: &Header,
) -> Result<Vec<ChunkMeta>, Box<dyn Error + Send + Sync>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> Result<Vec<ChunkMeta>, Box<dyn Error + Send + Sync>> {
) -> Result<Vec<ChunkMeta>, DataFusionError> {

Probably use plan_err!() for these?

reader.header(),
&metadata_reader.fetch_header().await.unwrap()
);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a dummy header (perhaps you can inline the bytes of a known test file, or perhaps you can create one using the Builder that exercises the code path for the full matrix of data types by offset/scale/nodata, or create a function that accepts DataType, offset, scale, nodata, and outputs a file with exactly one extra attribute that we can use in a parameterized test to check that it roundtrips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants