[BugFix] Fix `assert batch_descriptor.num_tokens == num_tokens_padded` #30173

LucasWilkinson · 2025-12-06T05:32:01Z

There's a bug with FULL_DECODE_ONLY (FULL_AND_PIECEWISE works fine) with DP. There is an edge case where one rank runs eager but all other ranks want to run with cudagraphs, so now we synchronize the cudagraph mode each rank wants to run as across all ranks. Since currently PIECEWISE can be treated as eager (valid in all the same situations) it is sufficient to just disable full cudagraphs if all ranks want to; we make want to pass an explicit list of valid modes in the future.

FIXES: #28579 (comment)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-06T05:34:56Z

vllm/v1/worker/dp_utils.py

    )

-    return (should_ubatch, num_tokens_after_padding)
+    return (should_ubatch, num_tokens_after_padding, synced_cudagraph_mode)


Align coordinate_batch_across_dp unpacking with new return

coordinate_batch_across_dp now returns three values including the synchronized cudagraph mode (return at line 240), but callers such as set_forward_context in forward_context.py (around lines 295–300) and eagle._pad_batch_across_dp in v1/spec_decode/eagle.py (around lines 1261–1269) still unpack only two items. In multi-DP runs where these paths invoke coordinate_batch_across_dp, Python will raise ValueError: too many values to unpack before padding or execution begins, breaking DP execution for forward contexts and EAGLE. Callers need to accept the third element or the function must preserve the previous 2-tuple interface.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request addresses a bug related to CUDA graph mode synchronization in a data parallel setup. The changes ensure that all ranks agree on a common CUDA graph mode by communicating their preferred mode and selecting the minimum one. This prevents assertion failures when different ranks attempt to use incompatible modes (e.g., one eager, others with cudagraphs). The implementation correctly passes the cudagraph mode during the all-reduce synchronization and uses the synchronized mode to make dispatching decisions. The logic appears sound and effectively fixes the described issue. I have one suggestion to improve code maintainability by replacing magic numbers with named constants.

gemini-code-assist · 2025-12-06T05:42:58Z

vllm/v1/worker/dp_utils.py

+    tensor = torch.zeros(5, dp_size, device=device, dtype=torch.int32)
    tensor[0][dp_rank] = orig_num_tokens_per_ubatch
    tensor[1][dp_rank] = padded_num_tokens_per_ubatch
    tensor[2][dp_rank] = 1 if should_ubatch else 0
    tensor[3][dp_rank] = 1 if should_dp_pad else 0
+    tensor[4][dp_rank] = cudagraph_mode


The use of magic numbers 0, 1, 2, 3, 4 for indexing into the tensor makes the code hard to read and maintain. It's not immediately clear what each index represents without looking at the surrounding code or comments. This pattern is also present in _post_process_cudagraph_mode with tensor[4, :]. This can lead to bugs if the order or size of the tensor changes.

I recommend defining these indices as constants at the module level, for example, using an Enum. This would make the code self-documenting and less error-prone across all functions that use this tensor (_run_ar, _post_process_ubatch, _post_process_dp_padding, _post_process_cudagraph_mode).

For example:

from enum import IntEnum class DPSync(IntEnum): ORIG_NUM_TOKENS_PER_UBATCH = 0 PADDED_NUM_TOKENS_PER_UBATCH = 1 SHOULD_UBATCH = 2 SHOULD_DP_PAD = 3 CUDAGRAPH_MODE = 4 TENSOR_SIZE = 5

Then you could use tensor[DPSync.CUDAGRAPH_MODE] instead of tensor[4].

fix

e53e968

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

mergify bot added nvidia v1 labels Dec 6, 2025

github-project-automation bot added this to NVIDIA Dec 6, 2025

cleanup

05ae1d7

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

chatgpt-codex-connector bot reviewed Dec 6, 2025

View reviewed changes

LucasWilkinson mentioned this pull request Dec 6, 2025

[Core] Refactor padding logic and pad for CUDA graphs before attention metadata building #28579

Merged

gemini-code-assist bot reviewed Dec 6, 2025

View reviewed changes

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix] Fix `assert batch_descriptor.num_tokens == num_tokens_padded` #30173

[BugFix] Fix `assert batch_descriptor.num_tokens == num_tokens_padded` #30173

LucasWilkinson commented Dec 6, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Dec 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[BugFix] Fix assert batch_descriptor.num_tokens == num_tokens_padded #30173

Are you sure you want to change the base?

[BugFix] Fix assert batch_descriptor.num_tokens == num_tokens_padded #30173

Conversation

LucasWilkinson commented Dec 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[BugFix] Fix `assert batch_descriptor.num_tokens == num_tokens_padded` #30173

[BugFix] Fix `assert batch_descriptor.num_tokens == num_tokens_padded` #30173

LucasWilkinson commented Dec 6, 2025 •

edited by github-actions bot

Loading