[Perf] Optimize group_topk kernel, 1.9% Throughput improvement, 2.1% TPOT improvemnt
#30159
+128
−47
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
We are trying to optimize the GLMv4.6 model, this kernel takes a lot of time and we try to reduce this first.
Optimize the kernel, mainly:
Test
export MODEL="zai-org/GLM-4.6-FP8"Acc
Perf
vllm bench serve --model $MODEL --dataset-name random --host 127.0.0.1 --port 9256 --random-input-len 2 --random-output-len 128 --request-rate inf --num-prompts 1024 Now ============ Serving Benchmark Result ============ Successful requests: 1024 Failed requests: 0 Benchmark duration (s): 21.81 Total input tokens: 2048 Total generated tokens: 131072 Request throughput (req/s): 46.95 Output token throughput (tok/s): 6009.90 Peak output token throughput (tok/s): 7157.00 Peak concurrent requests: 1024.00 Total Token throughput (tok/s): 6103.80 ---------------Time to First Token---------------- Mean TTFT (ms): 969.39 Median TTFT (ms): 1037.20 P99 TTFT (ms): 1195.60 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 162.53 Median TPOT (ms): 162.71 P99 TPOT (ms): 162.94 ---------------Inter-token Latency---------------- Mean ITL (ms): 162.54 Median ITL (ms): 161.80 P99 ITL (ms): 188.35 ================================================== Main ============ Serving Benchmark Result ============ Successful requests: 1024 Failed requests: 0 Benchmark duration (s): 22.24 Total input tokens: 2048 Total generated tokens: 131072 Request throughput (req/s): 46.05 Output token throughput (tok/s): 5894.52 Peak output token throughput (tok/s): 6715.00 Peak concurrent requests: 1024.00 Total Token throughput (tok/s): 5986.63 ---------------Time to First Token---------------- Mean TTFT (ms): 966.52 Median TTFT (ms): 1066.92 P99 TTFT (ms): 1080.13 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 166.03 Median TPOT (ms): 166.09 P99 TPOT (ms): 166.41 ---------------Inter-token Latency---------------- Mean ITL (ms): 166.05 Median ITL (ms): 164.11 P99 ITL (ms): 206.18 ==================================================