[agent] Optimize BFV dot product accumulation #355

tlepoint · 2026-01-28T01:33:43Z

Reduce single-thread runtime of MulPIR by optimizing the hot BFV inner-product accumulation path used when computing encrypted database responses.
Replace higher-overhead ndarray-based accumulation with a more cache-friendly contiguous accumulator to lower per-coefficient overhead.

Replace the multi-dimensional ndarray accumulation in dot_product_scalar with a contiguous Vec<u128> accumulator and explicit indexing to reduce iterator/Array2 overhead and improve cache locality.
Perform per-part/per-modulus inner fused-multiply-adds into the contiguous buffer using the existing fma helper, then reduce u128 accumulators to u64 via modulus operators and reconstruct Poly parts.
Add a unit test test_dot_product_scalar_step_by to validate strided/step_by iterator usage and ensure correctness for common MulPIR access patterns.
Remove an unused import and replace expect calls on slices with proper Result error returns when coefficient memory is not contiguous.

Ran cargo +nightly fmt --all and formatting completed successfully.
Ran cargo test and all unit tests and doc-tests passed (72 + 73 tests in crates; overall test suites succeeded).
Ran cargo clippy --all-targets -- -D warnings and it completed without warnings.

Optimize dot product accumulation

a6bd540

tlepoint added the codex label Jan 28, 2026 — with ChatGPT Codex Connector

Provide feedback