Skip to content

Optimize VMAF core: AVX2 PSNR, thread pool job pool, and micro-optimizations#1464

Open
tolgaki wants to merge 3 commits intoNetflix:masterfrom
tolgaki:master
Open

Optimize VMAF core: AVX2 PSNR, thread pool job pool, and micro-optimizations#1464
tolgaki wants to merge 3 commits intoNetflix:masterfrom
tolgaki:master

Conversation

@tolgaki
Copy link

@tolgaki tolgaki commented Feb 15, 2026

  • AVX2 PSNR SSE computation (32 pixels/iteration with runtime CPU dispatch)
  • AVX2 SAD for motion feature
  • Thread pool job object pool (free list + 64-byte inline data buffer)
  • Thread pool thundering herd fix (signal vs broadcast)
  • Feature collector initial capacity 8 -> 512
  • integer_adm.c: pow(2, N) -> bit shifts/constants; eliminate redundant float conversions
  • integer_vif.c: Remove unnecessary epsilon; cache g*g
  • predict.c: Stack-allocate SVM nodes; lazy-cache name_with_opt
  • convolution.c: Hoist stride multiplication out of inner loops
  • Comprehensive test suite (11 tests covering all optimized paths)
  • All 18 meson tests pass

claude and others added 3 commits February 15, 2026 15:56
Phase 1 - Hot-path computation optimizations:
- Hoist stride multiplications out of inner convolution loops (convolution.c)
- Replace runtime pow(2,N) calls with compile-time bit shifts and ldexp
  across integer_adm.c and integer_motion.c (~20 instances)
- Remove unnecessary epsilon and cache g*g in VIF statistic loops,
  eliminating redundant FP division per pixel (integer_vif.c)
- Eliminate redundant float conversions in ADM decouple by using
  integer-domain angle/sign checks instead (integer_adm.c)

Phase 1 - Threading and allocation optimizations:
- Fix thundering herd: use pthread_cond_signal instead of broadcast
  in thread pool job enqueue (thread_pool.c)
- Use stack allocation for SVM node array in predict path, avoiding
  per-frame malloc/free churn (predict.c)
- Cache generated feature names in model to avoid repeated context
  creation/destruction per prediction (predict.c, model.h, model.c)
- Increase feature vector initial capacity from 8 to 512 to reduce
  realloc frequency for typical workloads (feature_collector.c)

Phase 2 - SIMD coverage:
- Add AVX2 SAD (Sum of Absolute Differences) for motion feature,
  processing 16 uint16 elements per iteration (motion_avx2.c/h)

https://claude.ai/code/session_01MaZDdztsZcd5y2PQo1av2G
- Add AVX2 SSE computation for 8-bit PSNR (32 pixels/iteration via
  cvtepu8_epi16 + madd_epi16), with scalar C fallback and runtime
  CPU dispatch
- Add thread pool job object pool with free list recycling and 64-byte
  inline data buffer to eliminate per-job malloc/free overhead
- Add comprehensive test suite (test_perf_optimizations) covering:
  thread pool (1000 jobs, data passing), feature collector (capacity
  512, 20 features), predict (score consistency with name caching),
  PSNR/VIF/ADM/motion feature extractors, and end-to-end VMAF scoring

https://claude.ai/code/session_01MaZDdztsZcd5y2PQo1av2G
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants