-
Notifications
You must be signed in to change notification settings - Fork 646
[Feat] Eagle Proposer support FULL_DECODE_ONLY graph mode #4763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for FULL_DECODE_ONLY graph mode for the Eagle Proposer, which involves renaming is_mtp_model to is_draft_model for better clarity and adding logic to handle graph parameters and attention metadata for the draft model. While the changes are generally in the right direction, I've identified several critical issues in vllm_ascend/spec_decode/eagle_proposer.py that will break graph replay. These issues involve tensor reassignments instead of in-place updates for metadata attributes like block_tables and query_start_loc. Additionally, there are some misleading comments and potentially unnecessary tensor transfers to the CPU that could impact performance. Addressing these issues is crucial for the correctness and efficiency of the graph mode implementation.
| self.hidden_states[:num_tokens] = target_hidden_states | ||
| # NOTE: **FullGraph: Why we need to change this block_tables? It wasn't changed before. | ||
| # If we really need to change it, It should be copied to the attn_metadata.block_tables, not assigned. | ||
| attn_metadata.block_tables = block_table.to(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reassigning attn_metadata.block_tables will break graph replay. The graph is captured with a tensor at a specific memory address, and reassigning it to a new tensor object will cause the graph to use stale data during replay. The author's own comment correctly points this out. Please use an in-place copy to update the tensor's content.
attn_metadata.block_tables[:block_table.shape[0]].copy_(block_table.to(device))| attn_metadata.max_query_len = 1 | ||
| # NOTE: **FullGraph: Here make a new tensor with a new address. | ||
| # Once it was used in forward, it may cases any errors in fullgraph mode. | ||
| attn_metadata.query_start_loc = self.arange[:batch_size + 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a new tensor view for attn_metadata.query_start_loc by slicing self.arange will break graph replay, as the memory address of the tensor will change if batch_size varies. The graph expects a tensor at a fixed memory address. You should update the content of a pre-allocated buffer in-place.
new_query_start_loc = self.arange[:batch_size + 1]
attn_metadata.query_start_loc[:new_query_start_loc.shape[0]].copy_(new_query_start_loc)| attn_metadata.attn_mask = attn_mask | ||
| # NOTE: **FullGraph: If we really need to change it, It should be | ||
| # copied to the attn_metadata.block_tables, not assigned. | ||
| attn_metadata.block_tables = block_table.to(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another instance of reassigning attn_metadata.block_tables, which will break graph replay. As with the previous occurrence, please use an in-place copy to update the tensor's content instead of creating a new tensor object.
attn_metadata.block_tables[:block_table.shape[0]].copy_(block_table.to(device))| # NOTE: We do not need to send the block_table to cpu. | ||
| block_table = block_table.cpu() | ||
| num_tokens = target_token_ids.shape[0] | ||
| batch_size = next_token_ids.shape[0] | ||
| last_token_indices = cu_num_tokens[1:] - 1 | ||
| # NOTE: We do not need to send the target_positions to cpu. | ||
| target_positions = target_positions.cpu() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments on lines 497 and 502 contradict the code. The comments state that block_table and target_positions are not needed on the CPU, but the subsequent lines of code move these tensors to the CPU. This is misleading and could cause confusion for future maintenance. If the CPU transfer is necessary for subsequent operations, please remove these comments. If not, the .cpu() calls should be removed to avoid unnecessary device-to-host synchronization and potential performance degradation.
212af97 to
28b4946
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
| update_graph_params_workspaces) | ||
| from vllm_ascend.compilation.acl_graph import ( | ||
| get_graph_params, get_mtp_graph_params, update_graph_params_workspaces, | ||
| update_mtp_graph_params_workspaces) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change mtp to draft, keep in same as ascend_forward_context.py
update_draft_graph_params_workspaces
| graph_params = get_graph_params() | ||
| forward_context = get_forward_context() | ||
| if forward_context.is_draft_model: | ||
| graph_params = get_mtp_graph_params() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_draft_graph_params
wangxiyuan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add eagle + full graph e2e test. Thanks.
…raph capture and execution for the draft model's forward pass to improve performance. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
…ec > 1 Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
28b4946 to
2e26ae9
Compare
What this PR does / why we need it?
This PR is used to make Eagle Proposer support FULL_DECODE_ONLY graph mode in vllm_ascend.
The changes include:
1). If use full graph, make Fullgraph Wrapper when load model.
2). Build a new meatadata, set running mode in FULL and mark attention update in dummy_run when in Fullgraph mode.
3). Fixed and fill any attn_metadata, such as attn_metadata.slot_mapping.
4). Add a descriptor.
5). Set running mode and triggered update metadata.
Does this PR introduce any user-facing change?
How was this patch tested?