Optimize VMAF core: AVX2 PSNR, thread pool job pool, and micro-optimizations#1464
Open
tolgaki wants to merge 3 commits intoNetflix:masterfrom
Open
Optimize VMAF core: AVX2 PSNR, thread pool job pool, and micro-optimizations#1464tolgaki wants to merge 3 commits intoNetflix:masterfrom
tolgaki wants to merge 3 commits intoNetflix:masterfrom
Conversation
tolgaki
commented
Feb 15, 2026
- AVX2 PSNR SSE computation (32 pixels/iteration with runtime CPU dispatch)
- AVX2 SAD for motion feature
- Thread pool job object pool (free list + 64-byte inline data buffer)
- Thread pool thundering herd fix (signal vs broadcast)
- Feature collector initial capacity 8 -> 512
- integer_adm.c: pow(2, N) -> bit shifts/constants; eliminate redundant float conversions
- integer_vif.c: Remove unnecessary epsilon; cache g*g
- predict.c: Stack-allocate SVM nodes; lazy-cache name_with_opt
- convolution.c: Hoist stride multiplication out of inner loops
- Comprehensive test suite (11 tests covering all optimized paths)
- All 18 meson tests pass
Phase 1 - Hot-path computation optimizations: - Hoist stride multiplications out of inner convolution loops (convolution.c) - Replace runtime pow(2,N) calls with compile-time bit shifts and ldexp across integer_adm.c and integer_motion.c (~20 instances) - Remove unnecessary epsilon and cache g*g in VIF statistic loops, eliminating redundant FP division per pixel (integer_vif.c) - Eliminate redundant float conversions in ADM decouple by using integer-domain angle/sign checks instead (integer_adm.c) Phase 1 - Threading and allocation optimizations: - Fix thundering herd: use pthread_cond_signal instead of broadcast in thread pool job enqueue (thread_pool.c) - Use stack allocation for SVM node array in predict path, avoiding per-frame malloc/free churn (predict.c) - Cache generated feature names in model to avoid repeated context creation/destruction per prediction (predict.c, model.h, model.c) - Increase feature vector initial capacity from 8 to 512 to reduce realloc frequency for typical workloads (feature_collector.c) Phase 2 - SIMD coverage: - Add AVX2 SAD (Sum of Absolute Differences) for motion feature, processing 16 uint16 elements per iteration (motion_avx2.c/h) https://claude.ai/code/session_01MaZDdztsZcd5y2PQo1av2G
- Add AVX2 SSE computation for 8-bit PSNR (32 pixels/iteration via cvtepu8_epi16 + madd_epi16), with scalar C fallback and runtime CPU dispatch - Add thread pool job object pool with free list recycling and 64-byte inline data buffer to eliminate per-job malloc/free overhead - Add comprehensive test suite (test_perf_optimizations) covering: thread pool (1000 jobs, data passing), feature collector (capacity 512, 20 features), predict (score consistency with name caching), PSNR/VIF/ADM/motion feature extractors, and end-to-end VMAF scoring https://claude.ai/code/session_01MaZDdztsZcd5y2PQo1av2G
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.