Optimize VMAF core: AVX2 PSNR, thread pool job pool, and micro-optimizations by tolgaki · Pull Request #1464 · Netflix/vmaf

tolgaki · 2026-02-15T19:46:30Z

AVX2 PSNR SSE computation (32 pixels/iteration with runtime CPU dispatch)
AVX2 SAD for motion feature
Thread pool job object pool (free list + 64-byte inline data buffer)
Thread pool thundering herd fix (signal vs broadcast)
Feature collector initial capacity 8 -> 512
integer_adm.c: pow(2, N) -> bit shifts/constants; eliminate redundant float conversions
integer_vif.c: Remove unnecessary epsilon; cache g*g
predict.c: Stack-allocate SVM nodes; lazy-cache name_with_opt
convolution.c: Hoist stride multiplication out of inner loops
Comprehensive test suite (11 tests covering all optimized paths)
All 18 meson tests pass

Phase 1 - Hot-path computation optimizations: - Hoist stride multiplications out of inner convolution loops (convolution.c) - Replace runtime pow(2,N) calls with compile-time bit shifts and ldexp across integer_adm.c and integer_motion.c (~20 instances) - Remove unnecessary epsilon and cache g*g in VIF statistic loops, eliminating redundant FP division per pixel (integer_vif.c) - Eliminate redundant float conversions in ADM decouple by using integer-domain angle/sign checks instead (integer_adm.c) Phase 1 - Threading and allocation optimizations: - Fix thundering herd: use pthread_cond_signal instead of broadcast in thread pool job enqueue (thread_pool.c) - Use stack allocation for SVM node array in predict path, avoiding per-frame malloc/free churn (predict.c) - Cache generated feature names in model to avoid repeated context creation/destruction per prediction (predict.c, model.h, model.c) - Increase feature vector initial capacity from 8 to 512 to reduce realloc frequency for typical workloads (feature_collector.c) Phase 2 - SIMD coverage: - Add AVX2 SAD (Sum of Absolute Differences) for motion feature, processing 16 uint16 elements per iteration (motion_avx2.c/h) https://claude.ai/code/session_01MaZDdztsZcd5y2PQo1av2G

- Add AVX2 SSE computation for 8-bit PSNR (32 pixels/iteration via cvtepu8_epi16 + madd_epi16), with scalar C fallback and runtime CPU dispatch - Add thread pool job object pool with free list recycling and 64-byte inline data buffer to eliminate per-job malloc/free overhead - Add comprehensive test suite (test_perf_optimizations) covering: thread pool (1000 jobs, data passing), feature collector (capacity 512, 20 features), predict (score consistency with name caching), PSNR/VIF/ADM/motion feature extractors, and end-to-end VMAF scoring https://claude.ai/code/session_01MaZDdztsZcd5y2PQo1av2G

claude and others added 3 commits February 15, 2026 15:56

Merge pull request #1 from tolgaki/claude/optimize-vmaf-code-Odccg

86e945a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize VMAF core: AVX2 PSNR, thread pool job pool, and micro-optimizations#1464

Optimize VMAF core: AVX2 PSNR, thread pool job pool, and micro-optimizations#1464
tolgaki wants to merge 3 commits intoNetflix:masterfrom
tolgaki:master

tolgaki commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tolgaki commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants