-
Notifications
You must be signed in to change notification settings - Fork 84
Benchmark: Micro benchmark - Add float datatype support and other refinements to GPU Stream #769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Benchmark: Micro benchmark - Add float datatype support and other refinements to GPU Stream #769
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #769 +/- ##
=======================================
Coverage 85.70% 85.70%
=======================================
Files 102 102
Lines 7703 7704 +1
=======================================
+ Hits 6602 6603 +1
Misses 1101 1101
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Updates the GPU STREAM microbenchmark to support runtime-selectable FP32/FP64 execution and improve GPU memory bandwidth utilization, while aligning SuperBench integration (CLI, output tags, docs, and tests) to the new behavior.
Changes:
- Add
--data_type <float|double>to select FP32/FP64 at runtime and propagate it through the Python benchmark wrapper + unit tests. - Refactor CUDA kernels to use 128-bit vectorized accesses (
double2/float4) and move template kernel implementations into a header for cross-TU instantiation. - Adjust execution/output to single visible GPU (device 0 via
CUDA_VISIBLE_DEVICES) and update metric/tag formats (removinggpu_id) plus docs/examples/test log.
Reviewed changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
tests/data/gpu_stream.log |
Updates golden log output to include data type and new tag format (no gpu_id). |
tests/benchmarks/micro_benchmarks/test_gpu_stream.py |
Extends command-generation assertions to include --data_type (currently only covers double). |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp |
Removes NUMA/GPU iteration fields from args and adds Opts::data_type. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp |
Adds CLI parsing/printing for --data_type. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_main.cpp |
New entry point replacing the previous main file. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp |
Introduces vector-type mapping and templated kernel definitions (128-bit loads/stores). |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu |
Keeps a CUDA compilation unit and moves template implementations to the header. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp |
Expands bench-args variant to support float and double. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu |
Uses local NUMA allocation, enforces 16B/thread sizing, launches templated vectorized kernels, updates tag format, and runs only CUDA device 0. |
superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt |
Switches target sources to the new gpu_stream_main.cpp. |
superbench/benchmarks/micro_benchmarks/gpu_stream.py |
Adds --data_type argument and forwards it to the binary. |
examples/benchmarks/gpu_stream.py |
Updates example invocation to include --data_type double. |
docs/user-tutorial/benchmarks/micro-benchmarks.md |
Updates gpu-stream metric patterns to include `(double |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Generate data to copy - use local NUMA node for best CPU access | ||
| args->sub.data_buf = static_cast<T *>(numa_alloc_local(args->size * sizeof(T))); | ||
|
|
||
| for (int j = 0; j < args->size / sizeof(T); j++) { | ||
| args->sub.data_buf[j] = static_cast<T>(j % kUInt8Mod); | ||
| } | ||
|
|
||
| // Allocate check buffer | ||
| args->sub.check_buf = static_cast<T *>(numa_alloc_onnode(args->size * sizeof(T), args->numa_id)); | ||
| // Allocate check buffer on local NUMA node | ||
| args->sub.check_buf = static_cast<T *>(numa_alloc_local(args->size * sizeof(T))); |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the check_data path, opts.size/args->size is consistently treated as a byte count (see kDefaultBufferSizeInBytes, PrintUsage/PrintInputInfo, bandwidth calculation, and cudaMemcpy(..., args->size, ...)), but here host buffers are allocated with args->size * sizeof(T) and then only initialized/copied for args->size bytes. This will over-allocate (e.g., 4GiB -> 32GiB for double), can OOM, and also mismatches the later numa_free(..., args->size) size. Fix by using a single unit throughout (bytes): allocate/free/copy exactly args->size bytes and size the initialization loop accordingly; also ensure the same correction is applied to the GPU cudaMalloc sizing in this function to keep host/device allocations consistent.
| constexpr uint64_t kBytesPerThread = 16; // 128-bit aligned access | ||
| uint64_t num_bytes_in_thread_block = num_threads_per_block * kBytesPerThread; | ||
| if (args->size % num_bytes_in_thread_block) { | ||
| std::cerr << "RunCopy: Data size should be multiple of " << num_bytes_in_thread_block << std::endl; |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message uses the prefix RunCopy, but this validation is in RunStreamKernel and applies to all kernels. Updating the message to reference RunStreamKernel (and ideally mention the 16-byte-per-thread constraint) will make failures much easier to diagnose.
| std::cerr << "RunCopy: Data size should be multiple of " << num_bytes_in_thread_block << std::endl; | |
| std::cerr << "RunStreamKernel: Data size should be a multiple of " | |
| << num_bytes_in_thread_block | |
| << " bytes (16 bytes per thread)." << std::endl; |
| | STREAM\_ADD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the add operation with specified buffer size and block size. | | ||
| | STREAM\_TRIAD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the triad operation with specified buffer size and block size. | | ||
| | STREAM\_COPY\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the copy operation with specified buffer size and block size. | | ||
| | STREAM\_SCALE\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the scale operation with specified buffer size and block size. | | ||
| | STREAM\_ADD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the add operation with specified buffer size and block size. | | ||
| | STREAM\_TRIAD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the triad operation with specified buffer size and block size. | |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These metric descriptions still say "fp64" even though the metric names now explicitly allow both double and float. Please update the descriptions (and the gpu-stream introduction above) to reflect that the datatype is selected at runtime and that float results correspond to fp32 bandwidth/efficiency.
| | STREAM\_ADD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the add operation with specified buffer size and block size. | | |
| | STREAM\_TRIAD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the triad operation with specified buffer size and block size. | | |
| | STREAM\_COPY\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the copy operation with specified buffer size and block size. | | |
| | STREAM\_SCALE\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the scale operation with specified buffer size and block size. | | |
| | STREAM\_ADD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the add operation with specified buffer size and block size. | | |
| | STREAM\_TRIAD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the triad operation with specified buffer size and block size. | | |
| | STREAM\_ADD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The memory bandwidth of the GPU for the add operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. | | |
| | STREAM\_TRIAD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The memory bandwidth of the GPU for the triad operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. | | |
| | STREAM\_COPY\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The memory bandwidth efficiency of the GPU for the copy operation with the selected data type, where double results correspond to fp64 efficiency and float results correspond to fp32 efficiency, for the specified buffer size and block size. | | |
| | STREAM\_SCALE\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The memory bandwidth efficiency of the GPU for the scale operation with the selected data type, where double results correspond to fp64 efficiency and float results correspond to fp32 efficiency, for the specified buffer size and block size. | | |
| | STREAM\_ADD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The memory bandwidth efficiency of the GPU for the add operation with the selected data type, where double results correspond to fp64 efficiency and float results correspond to fp32 efficiency, for the specified buffer size and block size. | | |
| | STREAM\_TRIAD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The memory bandwidth efficiency of the GPU for the triad operation with the selected data type, where double results correspond to fp64 efficiency and float results correspond to fp32 efficiency, for the specified buffer size and block size. | |
Refinements:
New config:
New rule:
Example results:
Processed by rules: