Skip to content

Conversation

@yewentao256
Copy link
Member

@yewentao256 yewentao256 commented Dec 5, 2025

Purpose

We are trying to optimize the GLMv4.6 model, this kernel takes a lot of time and we try to reduce this first.

Optimize the kernel, mainly:

  1. Use template for the scoring function
  2. unroll some usual ngroup

Test

export MODEL="zai-org/GLM-4.6-FP8"

Acc

lm_eval --model local-completions --model_args "base_url=http://127.0.0.1:9256/v1/completions,model=$MODEL,num_concurrent=1024" --tasks gsm8k
Now
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.9356|±  |0.0068|
|     |       |strict-match    |     5|exact_match||0.9310|±  |0.0070|
main
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.9325|±  |0.0069|
|     |       |strict-match    |     5|exact_match||0.9280|±  |0.0071|

Perf

vllm bench serve --model $MODEL  --dataset-name random --host 127.0.0.1 --port 9256 --random-input-len 2 --random-output-len 128 --request-rate inf --num-prompts 1024

Now
============ Serving Benchmark Result ============
Successful requests:                     1024      
Failed requests:                         0         
Benchmark duration (s):                  21.81     
Total input tokens:                      2048      
Total generated tokens:                  131072    
Request throughput (req/s):              46.95     
Output token throughput (tok/s):         6009.90   
Peak output token throughput (tok/s):    7157.00   
Peak concurrent requests:                1024.00   
Total Token throughput (tok/s):          6103.80   
---------------Time to First Token----------------
Mean TTFT (ms):                          969.39    
Median TTFT (ms):                        1037.20   
P99 TTFT (ms):                           1195.60   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          162.53    
Median TPOT (ms):                        162.71    
P99 TPOT (ms):                           162.94    
---------------Inter-token Latency----------------
Mean ITL (ms):                           162.54    
Median ITL (ms):                         161.80    
P99 ITL (ms):                            188.35    
==================================================

Main
============ Serving Benchmark Result ============
Successful requests:                     1024      
Failed requests:                         0         
Benchmark duration (s):                  22.24     
Total input tokens:                      2048      
Total generated tokens:                  131072    
Request throughput (req/s):              46.05     
Output token throughput (tok/s):         5894.52   
Peak output token throughput (tok/s):    6715.00   
Peak concurrent requests:                1024.00   
Total Token throughput (tok/s):          5986.63   
---------------Time to First Token----------------
Mean TTFT (ms):                          966.52    
Median TTFT (ms):                        1066.92   
P99 TTFT (ms):                           1080.13   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          166.03    
Median TPOT (ms):                        166.09    
P99 TPOT (ms):                           166.41    
---------------Inter-token Latency----------------
Mean ITL (ms):                           166.05    
Median ITL (ms):                         164.11    
P99 ITL (ms):                            206.18    
==================================================

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces performance optimizations to the group_topk kernel by leveraging C++ templates for compile-time specialization based on the scoring function, renormalization, and group size. These changes appear to correctly implement the intended optimizations and should yield the performance improvements described. My review focuses on several instances of significant code duplication that have been introduced. While the optimizations are valuable, the duplicated code harms maintainability and increases the risk of future bugs. I've provided suggestions to refactor these sections to be more DRY (Don't Repeat Yourself) while retaining the performance benefits.

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants