[SPARK-54617][PYTHON][SQL] Enable Arrow Grouped Iter Aggregate UDF registration for SQL #53357

Yicong-Huang · 2025-12-05T23:11:16Z

What changes were proposed in this pull request?

This PR enables Arrow grouped iter aggregate UDFs to be registered and used in SQL queries. Previously, Arrow iter aggregate UDFs could only be used via DataFrame API, but not in SQL.

The main change is adding SQL_GROUPED_AGG_ARROW_ITER_UDF to the allowed eval types in UDFRegistration.register() method, along with comprehensive test cases.

Why are the changes needed?

Arrow iter aggregate UDFs provide a memory-efficient way to perform grouped aggregations by processing data in batches iteratively. However, they could only be used via DataFrame API, not in SQL queries. This limitation prevented users from using these UDFs in SQL-based workflows.

Does this PR introduce any user-facing change?

Yes. Users can now register Arrow grouped iter aggregate UDFs and use them in SQL queries.

Example:

from typing import Iterator
from pyspark.sql.functions import arrow_udf
import pyarrow as pa

@arrow_udf("double")
def arrow_mean_iter(it: Iterator[pa.Array]) -> float:
    sum_val = 0.0
    cnt = 0
    for v in it:
        sum_val += pa.compute.sum(v).as_py()
        cnt += len(v)
    return sum_val / cnt if cnt > 0 else 0.0

# Now this works:
spark.udf.register("arrow_mean_iter", arrow_mean_iter)
spark.sql("SELECT id, arrow_mean_iter(v) as mean FROM test_table GROUP BY id").show()

How was this patch tested?

Added comprehensive test cases covering:

Single column Arrow iter aggregate UDF in SQL
Multiple columns Arrow iter aggregate UDF in SQL

Was this patch authored or co-authored using generative AI tooling?

No.

Yicong-Huang added 3 commits December 5, 2025 14:53

feat: register to be used with SQL

e4ce9db

test: add test cases

1249c03

fix: format

61a8877

github-actions bot added SQL PYTHON labels Dec 5, 2025

Yicong-Huang changed the title ~~[SPARK-54617] Enable Arrow Grouped Iter Aggregate UDF registration for SQL~~ [SPARK-54617][PYTHON][SQL] Enable Arrow Grouped Iter Aggregate UDF registration for SQL Dec 5, 2025

Yicong-Huang added 2 commits December 5, 2025 15:14

test: remove one basic test

f311b06

feat: add types to supported range

0e66f81

github-actions bot added the CONNECT label Dec 6, 2025

fix: test

b64a430

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54617][PYTHON][SQL] Enable Arrow Grouped Iter Aggregate UDF registration for SQL #53357

[SPARK-54617][PYTHON][SQL] Enable Arrow Grouped Iter Aggregate UDF registration for SQL #53357

Yicong-Huang commented Dec 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[SPARK-54617][PYTHON][SQL] Enable Arrow Grouped Iter Aggregate UDF registration for SQL #53357

Are you sure you want to change the base?

[SPARK-54617][PYTHON][SQL] Enable Arrow Grouped Iter Aggregate UDF registration for SQL #53357

Conversation

Yicong-Huang commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Yicong-Huang commented Dec 5, 2025 •

edited

Loading