-
Notifications
You must be signed in to change notification settings - Fork 0
AVX Revectorization Evaluation Guide
Latest update is tracked in issue #12716. The revectorization pass has been ported from V8’s turbofan compiler to the new turboshaft pipeline. The turboshaft wasm pipeline is enabled by default after Chrome 132.0.6829.1 by CL.
The command line flags are as listed below:
-
Baseline turboshaft (by default): "--turboshaft-wasm --turboshaft-wasm-instruction-selection-staged"
-
Enable revectorization: "--experimental-wasm-revectorize"
-
Trace revectorization: "--trace-wasm-revectorize"
To enable wasm revectorization with node, please make sure to include below PR to update the build config files: https://github.com/nodejs/node/pull/54896 (merged after 2024, Dec 8th, in version 23.5.0) Please also update node to the latest version or a version after Aug 25 that patched to V8 12.8.374.22.
- Baseline Starting from 24.0, turboshaft wasm is enabled by default, no additional flags needed. If you run with an older version node.js, you need to enable turboshaft wasm manually as below:
$ node --turboshaft-wasm --turboshaft-wasm-instruction-selection-staged
- Revec
$ node [--turboshaft-wasm --turboshaft-wasm-instruction-selection-staged] --experimental-wasm-revectorize
By default v8 enables lazy compilation and liftoff baseline compilation before tiering up to turbofan/turboshaft for advanced optimization. The AVX revectorization phase is enabled in Turboshaft. If the test only runs a few times, it may not get the chance to enter Turboshaft phase and get optimized to AVX-256.
- Baseline
$ node [--turboshaft-wasm --turboshaft-wasm-instruction-selection-staged] --no-liftoff --no-wasm_lazy_compilation
- Revec
$ node [--turboshaft-wasm --turboshaft-wasm-instruction-selection-staged] --no-liftoff --no-wasm_lazy_compilation --experimental-wasm-revectorize
By default concurrent compilation is enabled which will make the output message mixed for different function units. It is recommended to disable concurrent compilation when trace for revectorization.
$ node [--turboshaft-wasm --turboshaft-wasm-instruction-selection-staged] --experimental-wasm-revectorize --wasm-num-compilation-tasks=1
Steps:
- Download a Canary Chrome from https://www.google.com/chrome/canary/ or an internal version after 132.0.6829.1.
- [For manual test] Open a command window and go to the directory where Chrome.exe is located. It normally located at "C:\Program Files\Google\Chrome\Application" or you can identify the Executable Path by opening “chrome://version/” from Chrome browser.
- [For manual test] Run below commands to launch Chrome browser with a clear disk cache:
-
baseline:
>chrome.exe –user-data-dir=”%TEMP%\base”
-
Revec:
>chrome.exe –user-data-dir=”%TEMP%\revec” –js-flags=--experimental-wasm-revectorize
[For automation] Setup the Chrome flags with setting "–js-flags=--experimental-wasm-revectorize". Note that there is no additional " quotation marks after --js-flags. This seems not work with some automation API.
We can quickly verify if the revectorization is enabled successfully by enable logging and print out the trace to stderr. But this may generate too many messages, and we need to copy out console buffer manually. (On Linux, you can redirect to the output to a file.)
>chrome.exe –user-data-dir=”%TEMP%\revec” –js-flags="--experimental-wasm-revectorize --trace-wasm-revectorize" --no-sandbox --enable-logging
Or verify through the Tensorflow.js local benchmark: https://tensorflow.github.io/tfjs/e2e/benchmarks/local-benchmark/index.html Select backend as wasm, and models as MobileNetV2, you will likely see obvious speedup on the “Subseqent average(50 runs)” time.
To generate complete trace log, we need to add below build flag in “args.gn” and build Chromium manually:
is_debug = false
is_official_build = true
disable_fieldtrial_testing_config = true
win_console_app = true
Then run below command to launch chrome and redirect the output to a log file:
>chrome.exe –user-data-dir=”%TEMP%\revec” –js-flags="--experimental-wasm-revectorize --trace-wasm-revectorize --wasm-num-compilation-tasks=1" --no-sandbox --enable-logging > run.log 2>&1
There are three methods to select for conveniency:
-
Compile the native AVX/AVX2 intrinsic code to wasm through Emscripten directly. We started to support 256-bit AVX/AVX2 intrinsics for wasm by PRs:
-
Using Highway C++ library for WASM_EMU256 (a 2x unrolled version of wasm128) or similar code pattern.
-
For general C/C++ compilation, make sure interleaved unrolling is enabled. In practice, passing below flags to the clang compiler:
-mllvm -force-vector-interleave=2
-mllvm --pre-RA-sched=source