Skip to content

Further improve performance of IN list evaluation #19241

@geoffreyclaude

Description

@geoffreyclaude

Related Issues

Motivation

IN list filters have become critical-path operations for dynamic filter pushdown. When scanning large tables with partition pruning or dynamic filters, the IN expression is evaluated millions of times per query. The current generic implementation leaves significant performance on the table.

The POC demonstrates 30-78% speedups on primitive types and up to 43% speedups on string types by exploiting type-specific optimizations that the compiler cannot infer from generic code.

High-Level Optimization Strategy

1. Const-Generic Branchless Evaluation

For small IN lists (≤16 elements), use compile-time-known array sizes ([T; N] instead of Vec<T>). This enables:

  • Loop unrolling and SIMD vectorization
  • Branch elimination (no conditional jumps)
  • Register-resident data (no heap access)

2. Type Normalization

For equality comparison, only bit patterns matter. Normalize types to reduce code paths:

  • Signed integers → Unsigned (Int32UInt32)
  • Floats → Unsigned (Float32UInt32)
  • Short Utf8View strings (≤12 bytes) → 128-bit integers

This is zero-cost at runtime (buffer pointer cast, no data copy).

3. Tiered Lookup Strategy

Select the optimal algorithm based on list size:

List Size Strategy Rationale
1-16 Branchless OR-chain Parallel comparison in registers
17-32 Binary search O(log n) with good cache locality
>32 Hash set O(1) amortized

4. Short-String Fast Path

Utf8View stores strings ≤12 bytes inline. Reinterpret the 16-byte view struct as a 128-bit integer to turn string comparison into a single integer comparison.

Benchmark Highlights

Most impactful improvements (POC vs. #18832):

Benchmark Speedup
Float32/list=3/nulls=0% -78% (2.20 µs → 485 ns)
Float32/list=8/nulls=0% -77% (2.94 µs → 677 ns)
Int32/list=8/nulls=0% -63% (1.88 µs → 688 ns)
Int32/list=3/nulls=0% -62% (1.41 µs → 531 ns)
Utf8View/list=3/str=3 -43% (2.18 µs → 1.25 µs)
Utf8View/list=3/str=12 -42% (2.19 µs → 1.27 µs)
Utf8/list=100/str=3 -30% (8.02 µs → 5.62 µs)
Slice Search Micro-Benchmark Image

Proposed Implementation Plan

To make this reviewable, I propose splitting into multiple PRs:

PR 1: Const-Generic Branchless Filter for Primitives

  • Add BranchlessFilter<T, N> with compile-time known sizes (1-16)
  • Add FilterStrategy enum and tiered select_strategy (branchless/binary/hash)
  • Add unified PrimitiveFilter<T, S> with LookupStrategy trait
  • Cover unsigned types: UInt8/16/32/64

PR 2: Type Normalization

  • Add TransformingFilter wrapper for zero-cost type reinterpretation
  • Signed → Unsigned: Int8/16/32/64UInt8/16/32/64
  • Float → Unsigned: Float32UInt32, Float64UInt64

PR 3: Utf8View Short-String Optimization

  • Implement short-string (≤12 bytes) reinterpretation as i128/Decimal128
  • Integrate with branchless filter for small lists
  • Fallback to hash for long strings or large lists

PR 4: Remaining Types (if needed)

  • Boolean
  • Decimal128/256
  • Binary/LargeBinary/BinaryView

Open Questions

  1. Threshold tuning: The cutoffs (16 for branchless, 32 for binary) are based on microbenchmarks. Should we tune for specific workloads?
  2. Code size: Const-generic instantiation creates multiple monomorphized versions. Is the binary size increase acceptable?
  3. Dictionary arrays: Current POC unpacks dictionaries. Should we optimize the dictionary case separately?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions