Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Dec 14, 2024

This is based on #50 and adds an extra test and fixes a bug

@andygrove andygrove changed the title wip: bug fix & tests Upgrade to DataFusion 43, fix a bug, add more tests Dec 14, 2024
@andygrove andygrove marked this pull request as ready for review December 14, 2024 16:53

/// Get the input partition count. This is the same as the number of concurrent tasks
/// when we schedule this query stage for execution
pub fn get_input_partition_count(&self) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why in the case of a leaf node, the input partition is the same as the output partition of the plan, while in case of a plan with children it is the output partitioning of the first child? This is assuming that all children have the same partition count?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in context.py we have this logic:

    # if the query stage has a single output partition then we need to execute for the output
    # partition, otherwise we need to execute in parallel for each input partition
    concurrency = stage.get_input_partition_count()
    output_partitions_count = stage.get_output_partition_count()

This is based on the assumption that the query stage is a shuffle write, which perhaps was always true when running TPC-H, so the existing code worked.

With the new simple SELECT * FROM table test that you added, we had a query stage where the plan was a CsvExec and had no children so we had to handle this as a special case. There is no input partitioning in this case. We use the output partitioning because DataFusion will have already decided that based on the files that are available.

This code is all confusing and I would like to make it less so.

@andygrove andygrove merged commit 151a0e2 into apache:main Dec 14, 2024
2 checks passed
@andygrove andygrove deleted the tests branch December 14, 2024 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants