Skip to content

Conversation

@tlepoint
Copy link
Owner

Motivation

  • Reduce single-thread runtime of MulPIR by optimizing the hot BFV inner-product accumulation path used when computing encrypted database responses.
  • Replace higher-overhead ndarray-based accumulation with a more cache-friendly contiguous accumulator to lower per-coefficient overhead.

Description

  • Replace the multi-dimensional ndarray accumulation in dot_product_scalar with a contiguous Vec<u128> accumulator and explicit indexing to reduce iterator/Array2 overhead and improve cache locality.
  • Perform per-part/per-modulus inner fused-multiply-adds into the contiguous buffer using the existing fma helper, then reduce u128 accumulators to u64 via modulus operators and reconstruct Poly parts.
  • Add a unit test test_dot_product_scalar_step_by to validate strided/step_by iterator usage and ensure correctness for common MulPIR access patterns.
  • Remove an unused import and replace expect calls on slices with proper Result error returns when coefficient memory is not contiguous.

Testing

  • Ran cargo +nightly fmt --all and formatting completed successfully.
  • Ran cargo test and all unit tests and doc-tests passed (72 + 73 tests in crates; overall test suites succeeded).
  • Ran cargo clippy --all-targets -- -D warnings and it completed without warnings.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants