Skip to content

Conversation

@anon189Ty
Copy link
Contributor

@anon189Ty anon189Ty commented Dec 6, 2025

What this PR does / why we need it?

This PR is used to make Eagle Proposer support FULL_DECODE_ONLY graph mode in vllm_ascend.

The changes include:

  1. Distinguish between processing graph_params and draft_graph_params in attention_v1.
  2. Adapt the full-graph mode in eagle_proposer, include:
    1). If use full graph, make Fullgraph Wrapper when load model.
    2). Build a new meatadata, set running mode in FULL and mark attention update in dummy_run when in Fullgraph mode.
    3). Fixed and fill any attn_metadata, such as attn_metadata.slot_mapping.
    4). Add a descriptor.
    5). Set running mode and triggered update metadata.
  3. Trans is_mtp_model to is_draft_model, and add the update of workspace.

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link

github-actions bot commented Dec 6, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for FULL_DECODE_ONLY graph mode for the Eagle Proposer, which involves renaming is_mtp_model to is_draft_model for better clarity and adding logic to handle graph parameters and attention metadata for the draft model. While the changes are generally in the right direction, I've identified several critical issues in vllm_ascend/spec_decode/eagle_proposer.py that will break graph replay. These issues involve tensor reassignments instead of in-place updates for metadata attributes like block_tables and query_start_loc. Additionally, there are some misleading comments and potentially unnecessary tensor transfers to the CPU that could impact performance. Addressing these issues is crucial for the correctness and efficiency of the graph mode implementation.

self.hidden_states[:num_tokens] = target_hidden_states
# NOTE: **FullGraph: Why we need to change this block_tables? It wasn't changed before.
# If we really need to change it, It should be copied to the attn_metadata.block_tables, not assigned.
attn_metadata.block_tables = block_table.to(device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Reassigning attn_metadata.block_tables will break graph replay. The graph is captured with a tensor at a specific memory address, and reassigning it to a new tensor object will cause the graph to use stale data during replay. The author's own comment correctly points this out. Please use an in-place copy to update the tensor's content.

attn_metadata.block_tables[:block_table.shape[0]].copy_(block_table.to(device))

attn_metadata.max_query_len = 1
# NOTE: **FullGraph: Here make a new tensor with a new address.
# Once it was used in forward, it may cases any errors in fullgraph mode.
attn_metadata.query_start_loc = self.arange[:batch_size + 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Creating a new tensor view for attn_metadata.query_start_loc by slicing self.arange will break graph replay, as the memory address of the tensor will change if batch_size varies. The graph expects a tensor at a fixed memory address. You should update the content of a pre-allocated buffer in-place.

new_query_start_loc = self.arange[:batch_size + 1]
attn_metadata.query_start_loc[:new_query_start_loc.shape[0]].copy_(new_query_start_loc)

attn_metadata.attn_mask = attn_mask
# NOTE: **FullGraph: If we really need to change it, It should be
# copied to the attn_metadata.block_tables, not assigned.
attn_metadata.block_tables = block_table.to(device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This is another instance of reassigning attn_metadata.block_tables, which will break graph replay. As with the previous occurrence, please use an in-place copy to update the tensor's content instead of creating a new tensor object.

attn_metadata.block_tables[:block_table.shape[0]].copy_(block_table.to(device))

Comment on lines +497 to 501
# NOTE: We do not need to send the block_table to cpu.
block_table = block_table.cpu()
num_tokens = target_token_ids.shape[0]
batch_size = next_token_ids.shape[0]
last_token_indices = cu_num_tokens[1:] - 1
# NOTE: We do not need to send the target_positions to cpu.
target_positions = target_positions.cpu()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The comments on lines 497 and 502 contradict the code. The comments state that block_table and target_positions are not needed on the CPU, but the subsequent lines of code move these tensors to the CPU. This is misleading and could cause confusion for future maintenance. If the CPU transfer is necessary for subsequent operations, please remove these comments. If not, the .cpu() calls should be removed to avoid unnecessary device-to-host synchronization and potential performance degradation.

@anon189Ty anon189Ty force-pushed the eagle_fullgraph branch 2 times, most recently from 212af97 to 28b4946 Compare December 6, 2025 14:44
@github-actions
Copy link

github-actions bot commented Dec 8, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

update_graph_params_workspaces)
from vllm_ascend.compilation.acl_graph import (
get_graph_params, get_mtp_graph_params, update_graph_params_workspaces,
update_mtp_graph_params_workspaces)
Copy link

@AlbertCheese AlbertCheese Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change mtp to draft, keep in same as ascend_forward_context.py
update_draft_graph_params_workspaces

graph_params = get_graph_params()
forward_context = get_forward_context()
if forward_context.is_draft_model:
graph_params = get_mtp_graph_params()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_draft_graph_params

@anon189Ty anon189Ty changed the title [WIP][Feat] Eagle Proposer support FULL_DECODE_ONLY graph mode [Feat] Eagle Proposer support FULL_DECODE_ONLY graph mode Dec 8, 2025
Copy link
Collaborator

@wangxiyuan wangxiyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add eagle + full graph e2e test. Thanks.

yiz-liu and others added 2 commits December 10, 2025 10:22
…raph capture and execution for the draft model's forward pass to improve performance.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
…ec > 1

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants