-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[BugFix] Fix assert batch_descriptor.num_tokens == num_tokens_padded
#30173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[BugFix] Fix assert batch_descriptor.num_tokens == num_tokens_padded
#30173
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ) | ||
|
|
||
| return (should_ubatch, num_tokens_after_padding) | ||
| return (should_ubatch, num_tokens_after_padding, synced_cudagraph_mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align coordinate_batch_across_dp unpacking with new return
coordinate_batch_across_dp now returns three values including the synchronized cudagraph mode (return at line 240), but callers such as set_forward_context in forward_context.py (around lines 295–300) and eagle._pad_batch_across_dp in v1/spec_decode/eagle.py (around lines 1261–1269) still unpack only two items. In multi-DP runs where these paths invoke coordinate_batch_across_dp, Python will raise ValueError: too many values to unpack before padding or execution begins, breaking DP execution for forward contexts and EAGLE. Callers need to accept the third element or the function must preserve the previous 2-tuple interface.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug related to CUDA graph mode synchronization in a data parallel setup. The changes ensure that all ranks agree on a common CUDA graph mode by communicating their preferred mode and selecting the minimum one. This prevents assertion failures when different ranks attempt to use incompatible modes (e.g., one eager, others with cudagraphs). The implementation correctly passes the cudagraph mode during the all-reduce synchronization and uses the synchronized mode to make dispatching decisions. The logic appears sound and effectively fixes the described issue. I have one suggestion to improve code maintainability by replacing magic numbers with named constants.
| tensor = torch.zeros(5, dp_size, device=device, dtype=torch.int32) | ||
| tensor[0][dp_rank] = orig_num_tokens_per_ubatch | ||
| tensor[1][dp_rank] = padded_num_tokens_per_ubatch | ||
| tensor[2][dp_rank] = 1 if should_ubatch else 0 | ||
| tensor[3][dp_rank] = 1 if should_dp_pad else 0 | ||
| tensor[4][dp_rank] = cudagraph_mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of magic numbers 0, 1, 2, 3, 4 for indexing into the tensor makes the code hard to read and maintain. It's not immediately clear what each index represents without looking at the surrounding code or comments. This pattern is also present in _post_process_cudagraph_mode with tensor[4, :]. This can lead to bugs if the order or size of the tensor changes.
I recommend defining these indices as constants at the module level, for example, using an Enum. This would make the code self-documenting and less error-prone across all functions that use this tensor (_run_ar, _post_process_ubatch, _post_process_dp_padding, _post_process_cudagraph_mode).
For example:
from enum import IntEnum
class DPSync(IntEnum):
ORIG_NUM_TOKENS_PER_UBATCH = 0
PADDED_NUM_TOKENS_PER_UBATCH = 1
SHOULD_UBATCH = 2
SHOULD_DP_PAD = 3
CUDAGRAPH_MODE = 4
TENSOR_SIZE = 5Then you could use tensor[DPSync.CUDAGRAPH_MODE] instead of tensor[4].
There's a bug with
FULL_DECODE_ONLY(FULL_AND_PIECEWISEworks fine) with DP. There is an edge case where one rank runs eager but all other ranks want to run with cudagraphs, so now we synchronize the cudagraph mode each rank wants to run as across all ranks. Since currently PIECEWISE can be treated as eager (valid in all the same situations) it is sufficient to just disable full cudagraphs if all ranks want to; we make want to pass an explicit list of valid modes in the future.FIXES: #28579 (comment)