Further improve performance of IN list evaluation

## Related Issues

- Follow-up to #18824 (Restore IN_LIST performance -- Implement specialized `StaticFilters` for different data types)
- POC implementation: https://github.com/pydantic/datafusion/pull/48
- Micro Benchmarks of the different search algorithms: https://github.com/geoffreyclaude/datafusion/pull/14

## Motivation

IN list filters have become **critical-path operations** for dynamic filter pushdown. When scanning large tables with partition pruning or dynamic filters, the IN expression is evaluated millions of times per query. The current generic implementation leaves significant performance on the table.

The POC demonstrates **30-78% speedups** on primitive types and **up to 43% speedups** on string types by exploiting type-specific optimizations that the compiler cannot infer from generic code.

## High-Level Optimization Strategy

### 1. Const-Generic Branchless Evaluation

For small IN lists (≤16 elements), use compile-time-known array sizes (`[T; N]` instead of `Vec<T>`). This enables:
- Loop unrolling and SIMD vectorization
- Branch elimination (no conditional jumps)
- Register-resident data (no heap access)

### 2. Type Normalization

For equality comparison, only bit patterns matter. Normalize types to reduce code paths:
- Signed integers → Unsigned (`Int32` → `UInt32`)
- Floats → Unsigned (`Float32` → `UInt32`)
- Short `Utf8View` strings (≤12 bytes) → 128-bit integers

This is zero-cost at runtime (buffer pointer cast, no data copy).

### 3. Tiered Lookup Strategy

Select the optimal algorithm based on list size:
| List Size | Strategy | Rationale |
|-----------|----------|-----------|
| 1-16 | Branchless OR-chain | Parallel comparison in registers |
| 17-32 | Binary search | O(log n) with good cache locality |
| >32 | Hash set | O(1) amortized |

### 4. Short-String Fast Path

`Utf8View` stores strings ≤12 bytes inline. Reinterpret the 16-byte view struct as a 128-bit integer to turn string comparison into a single integer comparison.

## Benchmark Highlights

Most impactful improvements (POC vs. [#18832](https://github.com/apache/datafusion/pull/18832)):

| Benchmark | Speedup |
|-----------|---------|
| `Float32/list=3/nulls=0%` | **-78%** (2.20 µs → 485 ns) |
| `Float32/list=8/nulls=0%` | **-77%** (2.94 µs → 677 ns) |
| `Int32/list=8/nulls=0%` | **-63%** (1.88 µs → 688 ns) |
| `Int32/list=3/nulls=0%` | **-62%** (1.41 µs → 531 ns) |
| `Utf8View/list=3/str=3` | **-43%** (2.18 µs → 1.25 µs) |
| `Utf8View/list=3/str=12` | **-42%** (2.19 µs → 1.27 µs) |
| `Utf8/list=100/str=3` | **-30%** (8.02 µs → 5.62 µs) |

<details>
  <summary>Slice Search Micro-Benchmark</summary>

  <img width="3025" height="4528" alt="Image" src="https://github.com/user-attachments/assets/a50e8f38-ca54-4e1f-bc55-5118b46d4849" />

</details>

## Proposed Implementation Plan

To make this reviewable, I propose splitting into multiple PRs:

### PR 1: Const-Generic Branchless Filter for Primitives
- [ ] Add `BranchlessFilter<T, N>` with compile-time known sizes (1-16)
- [ ] Add `FilterStrategy` enum and tiered `select_strategy` (branchless/binary/hash)
- [ ] Add unified `PrimitiveFilter<T, S>` with `LookupStrategy` trait
- [ ] Cover unsigned types: `UInt8/16/32/64`

### PR 2: Type Normalization
- [ ] Add `TransformingFilter` wrapper for zero-cost type reinterpretation
- [ ] Signed → Unsigned: `Int8/16/32/64` → `UInt8/16/32/64`
- [ ] Float → Unsigned: `Float32` → `UInt32`, `Float64` → `UInt64`

### PR 3: Utf8View Short-String Optimization  
- [ ] Implement short-string (≤12 bytes) reinterpretation as `i128`/`Decimal128`
- [ ] Integrate with branchless filter for small lists
- [ ] Fallback to hash for long strings or large lists

### PR 4: Remaining Types (if needed)
- [ ] Boolean
- [ ] Decimal128/256
- [ ] Binary/LargeBinary/BinaryView

## Open Questions

1. **Threshold tuning**: The cutoffs (16 for branchless, 32 for binary) are based on microbenchmarks. Should we tune for specific workloads?
2. **Code size**: Const-generic instantiation creates multiple monomorphized versions. Is the binary size increase acceptable?
3. **Dictionary arrays**: Current POC unpacks dictionaries. Should we optimize the dictionary case separately?



Benchmark	Speedup
`Float32/list=3/nulls=0%`	-78% (2.20 µs → 485 ns)
`Float32/list=8/nulls=0%`	-77% (2.94 µs → 677 ns)
`Int32/list=8/nulls=0%`	-63% (1.88 µs → 688 ns)
`Int32/list=3/nulls=0%`	-62% (1.41 µs → 531 ns)
`Utf8View/list=3/str=3`	-43% (2.18 µs → 1.25 µs)
`Utf8View/list=3/str=12`	-42% (2.19 µs → 1.27 µs)
`Utf8/list=100/str=3`	-30% (8.02 µs → 5.62 µs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Further improve performance of IN list evaluation #19241

Related Issues

Motivation

High-Level Optimization Strategy

1. Const-Generic Branchless Evaluation

2. Type Normalization

3. Tiered Lookup Strategy

4. Short-String Fast Path

Benchmark Highlights

Proposed Implementation Plan

PR 1: Const-Generic Branchless Filter for Primitives

PR 2: Type Normalization

PR 3: Utf8View Short-String Optimization

PR 4: Remaining Types (if needed)

Open Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

List Size	Strategy	Rationale
1-16	Branchless OR-chain	Parallel comparison in registers
17-32	Binary search	O(log n) with good cache locality
>32	Hash set	O(1) amortized

Further improve performance of IN list evaluation #19241

Description

Related Issues

Motivation

High-Level Optimization Strategy

1. Const-Generic Branchless Evaluation

2. Type Normalization

3. Tiered Lookup Strategy

4. Short-String Fast Path

Benchmark Highlights

Proposed Implementation Plan

PR 1: Const-Generic Branchless Filter for Primitives

PR 2: Type Normalization

PR 3: Utf8View Short-String Optimization

PR 4: Remaining Types (if needed)

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions