Skip to content

Conversation

@WenqingLan1
Copy link
Contributor

@WenqingLan1 WenqingLan1 commented Dec 19, 2025

Refinements:

  • Use 128-bit aligned vector types (double2/float4) for optimal memory bandwidth.
  • Add support for float execution.
  • Add --data_type <float|double> CLI option for runtime type selection.
  • Move template kernel implementations to header file (required for CUDA template instantiation across compilation units).
  • Rename entry point file from gpu_stream_test.cpp to gpu_stream_main.cpp.
  • Updated hard-coded GPU iteration to single node run so it can run with SuperBench's distributed execution in config.yaml.
  • Updated numa assignment from hard coded numa_alloc_onnode to numa_alloc_local to optimize memory allocation.
  • Updated micro benchmark doc to reflect new metric name removing gpu_id.

New config:

    gpu-stream:fp64:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 10
        num_loops: 40
        size: 1308622848
        data_type: double
    gpu-stream:fp64-correctness:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 0
        num_loops: 1
        size: 1048576
        data_type: double
        check_data: true
    gpu-stream:fp32:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 10
        num_loops: 40
        size: 2617245696
        data_type: float
    gpu-stream:fp32-correctness:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 0
        num_loops: 1
        size: 1048576
        data_type: float
        check_data: true

New rule:

    gpu-stream:
      statistics:
        - mean
      categories: GPU-STREAM
      aggregate: True
      metrics:
        - gpu-stream:fp(?:32|64)/STREAM_.*_(?:bw|ratio):(\d+)

Example results:

"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:0": 1234, 
"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:1": 1234, 
"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:2": 1234, 
"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:3": 1234

Processed by rules:

| gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw | mean | 1234|

@WenqingLan1 WenqingLan1 requested a review from a team as a code owner December 19, 2025 20:05
@WenqingLan1 WenqingLan1 added the micro-benchmarks Micro Benchmark Test for SuperBench Benchmarks label Dec 19, 2025
@codecov
Copy link

codecov bot commented Dec 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.70%. Comparing base (8b805d9) to head (e445363).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #769   +/-   ##
=======================================
  Coverage   85.70%   85.70%           
=======================================
  Files         102      102           
  Lines        7703     7704    +1     
=======================================
+ Hits         6602     6603    +1     
  Misses       1101     1101           
Flag Coverage Δ
cpu-python3.10-unit-test 70.96% <50.00%> (+<0.01%) ⬆️
cpu-python3.12-unit-test 70.96% <50.00%> (+<0.01%) ⬆️
cpu-python3.7-unit-test 70.44% <50.00%> (+<0.01%) ⬆️
cuda-unit-test 83.59% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@guoshzhao guoshzhao self-assigned this Dec 19, 2025
Copilot AI review requested due to automatic review settings February 3, 2026 22:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the GPU STREAM microbenchmark to support runtime-selectable FP32/FP64 execution and improve GPU memory bandwidth utilization, while aligning SuperBench integration (CLI, output tags, docs, and tests) to the new behavior.

Changes:

  • Add --data_type <float|double> to select FP32/FP64 at runtime and propagate it through the Python benchmark wrapper + unit tests.
  • Refactor CUDA kernels to use 128-bit vectorized accesses (double2 / float4) and move template kernel implementations into a header for cross-TU instantiation.
  • Adjust execution/output to single visible GPU (device 0 via CUDA_VISIBLE_DEVICES) and update metric/tag formats (removing gpu_id) plus docs/examples/test log.

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/data/gpu_stream.log Updates golden log output to include data type and new tag format (no gpu_id).
tests/benchmarks/micro_benchmarks/test_gpu_stream.py Extends command-generation assertions to include --data_type (currently only covers double).
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp Removes NUMA/GPU iteration fields from args and adds Opts::data_type.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp Adds CLI parsing/printing for --data_type.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_main.cpp New entry point replacing the previous main file.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp Introduces vector-type mapping and templated kernel definitions (128-bit loads/stores).
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu Keeps a CUDA compilation unit and moves template implementations to the header.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp Expands bench-args variant to support float and double.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Uses local NUMA allocation, enforces 16B/thread sizing, launches templated vectorized kernels, updates tag format, and runs only CUDA device 0.
superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt Switches target sources to the new gpu_stream_main.cpp.
superbench/benchmarks/micro_benchmarks/gpu_stream.py Adds --data_type argument and forwards it to the binary.
examples/benchmarks/gpu_stream.py Updates example invocation to include --data_type double.
docs/user-tutorial/benchmarks/micro-benchmarks.md Updates gpu-stream metric patterns to include `(double

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +238 to +246
// Generate data to copy - use local NUMA node for best CPU access
args->sub.data_buf = static_cast<T *>(numa_alloc_local(args->size * sizeof(T)));

for (int j = 0; j < args->size / sizeof(T); j++) {
args->sub.data_buf[j] = static_cast<T>(j % kUInt8Mod);
}

// Allocate check buffer
args->sub.check_buf = static_cast<T *>(numa_alloc_onnode(args->size * sizeof(T), args->numa_id));
// Allocate check buffer on local NUMA node
args->sub.check_buf = static_cast<T *>(numa_alloc_local(args->size * sizeof(T)));
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the check_data path, opts.size/args->size is consistently treated as a byte count (see kDefaultBufferSizeInBytes, PrintUsage/PrintInputInfo, bandwidth calculation, and cudaMemcpy(..., args->size, ...)), but here host buffers are allocated with args->size * sizeof(T) and then only initialized/copied for args->size bytes. This will over-allocate (e.g., 4GiB -> 32GiB for double), can OOM, and also mismatches the later numa_free(..., args->size) size. Fix by using a single unit throughout (bytes): allocate/free/copy exactly args->size bytes and size the initialization loop accordingly; also ensure the same correction is applied to the GPU cudaMalloc sizing in this function to keep host/device allocations consistent.

Copilot uses AI. Check for mistakes.
constexpr uint64_t kBytesPerThread = 16; // 128-bit aligned access
uint64_t num_bytes_in_thread_block = num_threads_per_block * kBytesPerThread;
if (args->size % num_bytes_in_thread_block) {
std::cerr << "RunCopy: Data size should be multiple of " << num_bytes_in_thread_block << std::endl;
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message uses the prefix RunCopy, but this validation is in RunStreamKernel and applies to all kernels. Updating the message to reference RunStreamKernel (and ideally mention the 16-byte-per-thread constraint) will make failures much easier to diagnose.

Suggested change
std::cerr << "RunCopy: Data size should be multiple of " << num_bytes_in_thread_block << std::endl;
std::cerr << "RunStreamKernel: Data size should be a multiple of "
<< num_bytes_in_thread_block
<< " bytes (16 bytes per thread)." << std::endl;

Copilot uses AI. Check for mistakes.
Comment on lines +278 to +283
| STREAM\_ADD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the add operation with specified buffer size and block size. |
| STREAM\_TRIAD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the triad operation with specified buffer size and block size. |
| STREAM\_COPY\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the copy operation with specified buffer size and block size. |
| STREAM\_SCALE\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the scale operation with specified buffer size and block size. |
| STREAM\_ADD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the add operation with specified buffer size and block size. |
| STREAM\_TRIAD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the triad operation with specified buffer size and block size. |
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metric descriptions still say "fp64" even though the metric names now explicitly allow both double and float. Please update the descriptions (and the gpu-stream introduction above) to reflect that the datatype is selected at runtime and that float results correspond to fp32 bandwidth/efficiency.

Suggested change
| STREAM\_ADD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the add operation with specified buffer size and block size. |
| STREAM\_TRIAD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the triad operation with specified buffer size and block size. |
| STREAM\_COPY\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the copy operation with specified buffer size and block size. |
| STREAM\_SCALE\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the scale operation with specified buffer size and block size. |
| STREAM\_ADD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the add operation with specified buffer size and block size. |
| STREAM\_TRIAD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the triad operation with specified buffer size and block size. |
| STREAM\_ADD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The memory bandwidth of the GPU for the add operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. |
| STREAM\_TRIAD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The memory bandwidth of the GPU for the triad operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. |
| STREAM\_COPY\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The memory bandwidth efficiency of the GPU for the copy operation with the selected data type, where double results correspond to fp64 efficiency and float results correspond to fp32 efficiency, for the specified buffer size and block size. |
| STREAM\_SCALE\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The memory bandwidth efficiency of the GPU for the scale operation with the selected data type, where double results correspond to fp64 efficiency and float results correspond to fp32 efficiency, for the specified buffer size and block size. |
| STREAM\_ADD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The memory bandwidth efficiency of the GPU for the add operation with the selected data type, where double results correspond to fp64 efficiency and float results correspond to fp32 efficiency, for the specified buffer size and block size. |
| STREAM\_TRIAD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The memory bandwidth efficiency of the GPU for the triad operation with the selected data type, where double results correspond to fp64 efficiency and float results correspond to fp32 efficiency, for the specified buffer size and block size. |

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

micro-benchmarks Micro Benchmark Test for SuperBench Benchmarks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants