-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Misc] Remove pad_for_cudagraphs from config #30143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Misc] Remove pad_for_cudagraphs from config #30143
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the CUDA graph padding logic by removing pad_for_cudagraphs from the configuration and centralizing it within the CudagraphDispatcher. The changes are well-contained and improve the code structure by making the dispatcher solely responsible for padding, which simplifies debugging and future extensions.
My review found one potential issue in vllm/v1/spec_decode/eagle.py where the refactoring might unintentionally change the behavior for EagleProposer when cudagraph_mode is set to FULL. I've left a comment with a suggested fix to ensure the original behavior is preserved. Other than that, the changes look solid and achieve the stated goal.
| eagle_cudagraph_mode = ( | ||
| cudagraph_mode.mixed_mode() | ||
| if not self.speculative_config.enforce_eager | ||
| else CUDAGraphMode.NONE | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change might introduce a regression. The previous implementation for EagleProposer only enabled CUDA graphs if PIECEWISE mode was supported. With this change, if the main model's cudagraph_mode is FULL, cudagraph_mode.mixed_mode() will return FULL, and EagleProposer will attempt to use FULL CUDA graphs, which it may not support (as indicated by the old logic and removed warning). This could lead to runtime errors.
To preserve the original behavior and ensure only PIECEWISE graphs are used for Eagle, I suggest restoring the check for PIECEWISE support.
| eagle_cudagraph_mode = ( | |
| cudagraph_mode.mixed_mode() | |
| if not self.speculative_config.enforce_eager | |
| else CUDAGraphMode.NONE | |
| ) | |
| if (not self.speculative_config.enforce_eager | |
| and cudagraph_mode.has_mode(CUDAGraphMode.PIECEWISE)): | |
| eagle_cudagraph_mode = CUDAGraphMode.PIECEWISE | |
| else: | |
| eagle_cudagraph_mode = CUDAGraphMode.NONE |
|
Hi @LucasWilkinson, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| vllm_config = EngineArgs(model=model, trust_remote_code=True).create_engine_config() | ||
| while len(example_prompts) == vllm_config.pad_for_cudagraph(len(example_prompts)): | ||
| cudagraph_dispatcher = CudagraphDispatcher(vllm_config) | ||
| while ( | ||
| len(example_prompts) | ||
| == cudagraph_dispatcher.dispatch(len(example_prompts))[1].num_tokens | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dispatch loop never terminates in hybrid padding test
In test_mamba_cache_cg_padding a CudagraphDispatcher is created and immediately used in the while condition without ever calling initialize_cudagraph_keys, so dispatch returns a BatchDescriptor with num_tokens unchanged when keys_initialized is False. The condition len(example_prompts) == ...num_tokens therefore stays true on every iteration and the loop appends forever, hanging the test before it exercises any logic. This test now never completes under any configuration.
Useful? React with 👍 / 👎.
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
This pull request has merge conflicts that must be resolved before it can be |
With #28579 we can strip
pad_for_cudagraphfrom the config forcing cudagraph padding through the dispatch consistently making it easier to debug and enable support for separate sizes for FULL and PIECEWISE graphs