[Feat] Eagle Proposer support FULL_DECODE_ONLY graph mode #4763

anon189Ty · 2025-12-06T12:21:40Z

What this PR does / why we need it?

This PR is used to make Eagle Proposer support FULL_DECODE_ONLY graph mode in vllm_ascend.

The changes include:

Distinguish between processing graph_params and draft_graph_params in attention_v1.
Adapt the full-graph mode in eagle_proposer, include:
1). If use full graph, make Fullgraph Wrapper when load model.
2). Build a new meatadata, set running mode in FULL and mark attention update in dummy_run when in Fullgraph mode.
3). Fixed and fill any attn_metadata, such as attn_metadata.slot_mapping.
4). Add a descriptor.
5). Set running mode and triggered update metadata.
Trans is_mtp_model to is_draft_model, and add the update of workspace.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

github-actions · 2025-12-06T12:21:48Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request adds support for FULL_DECODE_ONLY graph mode for the Eagle Proposer, which involves renaming is_mtp_model to is_draft_model for better clarity and adding logic to handle graph parameters and attention metadata for the draft model. While the changes are generally in the right direction, I've identified several critical issues in vllm_ascend/spec_decode/eagle_proposer.py that will break graph replay. These issues involve tensor reassignments instead of in-place updates for metadata attributes like block_tables and query_start_loc. Additionally, there are some misleading comments and potentially unnecessary tensor transfers to the CPU that could impact performance. Addressing these issues is crucial for the correctness and efficiency of the graph mode implementation.

gemini-code-assist · 2025-12-06T12:23:50Z

vllm_ascend/spec_decode/eagle_proposer.py

        self.hidden_states[:num_tokens] = target_hidden_states
+        # NOTE: **FullGraph: Why we need to change this block_tables? It wasn't changed before.
+        # If we really need to change it, It should be copied to the attn_metadata.block_tables, not assigned.
        attn_metadata.block_tables = block_table.to(device)


Reassigning attn_metadata.block_tables will break graph replay. The graph is captured with a tensor at a specific memory address, and reassigning it to a new tensor object will cause the graph to use stale data during replay. The author's own comment correctly points this out. Please use an in-place copy to update the tensor's content.

attn_metadata.block_tables[:block_table.shape[0]].copy_(block_table.to(device))

gemini-code-assist · 2025-12-06T12:23:50Z

vllm_ascend/spec_decode/eagle_proposer.py

        attn_metadata.max_query_len = 1
+        # NOTE: **FullGraph: Here make a new tensor with a new address.
+        # Once it was used in forward, it may cases any errors in fullgraph mode.
        attn_metadata.query_start_loc = self.arange[:batch_size + 1]


Creating a new tensor view for attn_metadata.query_start_loc by slicing self.arange will break graph replay, as the memory address of the tensor will change if batch_size varies. The graph expects a tensor at a fixed memory address. You should update the content of a pre-allocated buffer in-place.

new_query_start_loc = self.arange[:batch_size + 1] attn_metadata.query_start_loc[:new_query_start_loc.shape[0]].copy_(new_query_start_loc)

gemini-code-assist · 2025-12-06T12:23:50Z

vllm_ascend/spec_decode/eagle_proposer.py

            attn_metadata.attn_mask = attn_mask
+            # NOTE: **FullGraph: If we really need to change it, It should be
+            # copied to the attn_metadata.block_tables, not assigned.
            attn_metadata.block_tables = block_table.to(device)


This is another instance of reassigning attn_metadata.block_tables, which will break graph replay. As with the previous occurrence, please use an in-place copy to update the tensor's content instead of creating a new tensor object.

attn_metadata.block_tables[:block_table.shape[0]].copy_(block_table.to(device))

gemini-code-assist · 2025-12-06T12:23:50Z

vllm_ascend/spec_decode/eagle_proposer.py

+        # NOTE: We do not need to send the block_table to cpu.
        block_table = block_table.cpu()
        num_tokens = target_token_ids.shape[0]
        batch_size = next_token_ids.shape[0]
        last_token_indices = cu_num_tokens[1:] - 1
+        # NOTE: We do not need to send the target_positions to cpu.
        target_positions = target_positions.cpu()


The comments on lines 497 and 502 contradict the code. The comments state that block_table and target_positions are not needed on the CPU, but the subsequent lines of code move these tensors to the CPU. This is misleading and could cause confusion for future maintenance. If the CPU transfer is necessary for subsequent operations, please remove these comments. If not, the .cpu() calls should be removed to avoid unnecessary device-to-host synchronization and potential performance degradation.

github-actions · 2025-12-08T03:07:53Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

AlbertCheese · 2025-12-08T06:27:46Z

vllm_ascend/attention/attention_v1.py

-                                               update_graph_params_workspaces)
+from vllm_ascend.compilation.acl_graph import (
+    get_graph_params, get_mtp_graph_params, update_graph_params_workspaces,
+    update_mtp_graph_params_workspaces)


change mtp to draft, keep in same as ascend_forward_context.py
update_draft_graph_params_workspaces

AlbertCheese · 2025-12-08T06:29:28Z

vllm_ascend/attention/attention_v1.py

-        graph_params = get_graph_params()
+        forward_context = get_forward_context()
+        if forward_context.is_draft_model:
+            graph_params = get_mtp_graph_params()


get_draft_graph_params

wangxiyuan

add eagle + full graph e2e test. Thanks.

…raph capture and execution for the draft model's forward pass to improve performance. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

…ec > 1 Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>

github-actions bot added the module:core label Dec 6, 2025

gemini-code-assist bot reviewed Dec 6, 2025

View reviewed changes

anon189Ty force-pushed the eagle_fullgraph branch 2 times, most recently from 212af97 to 28b4946 Compare December 6, 2025 14:44

github-actions bot added the merge-conflicts label Dec 8, 2025

AlbertCheese reviewed Dec 8, 2025

View reviewed changes

anon189Ty changed the title ~~[WIP][Feat] Eagle Proposer support FULL_DECODE_ONLY graph mode~~ [Feat] Eagle Proposer support FULL_DECODE_ONLY graph mode Dec 8, 2025

wangxiyuan reviewed Dec 8, 2025

View reviewed changes

yiz-liu and others added 2 commits December 10, 2025 10:22

Feat: introduces ACL graph support for the Eagle proposer, enabling g…

6807c5c

…raph capture and execution for the draft model's forward pass to improve performance. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

Fix the code and acceptance rate error and adapt the case that num_sp…

2e26ae9

…ec > 1 Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>

anon189Ty force-pushed the eagle_fullgraph branch from 28b4946 to 2e26ae9 Compare December 10, 2025 02:26

github-actions bot removed the merge-conflicts label Dec 10, 2025

yiz-liu mentioned this pull request Dec 10, 2025

[WIP][Feat] Eagle Proposer support FULL_DECODE_ONLY graph mode #4530

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] Eagle Proposer support FULL_DECODE_ONLY graph mode #4763

[Feat] Eagle Proposer support FULL_DECODE_ONLY graph mode #4763

anon189Ty commented Dec 6, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Dec 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 6, 2025

Uh oh!

gemini-code-assist bot Dec 6, 2025

Uh oh!

gemini-code-assist bot Dec 6, 2025

Uh oh!

gemini-code-assist bot Dec 6, 2025

Uh oh!

github-actions bot commented Dec 8, 2025

Uh oh!

AlbertCheese Dec 8, 2025 •

edited

Loading

Uh oh!

AlbertCheese Dec 8, 2025

Uh oh!

wangxiyuan left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Feat] Eagle Proposer support FULL_DECODE_ONLY graph mode #4763

Are you sure you want to change the base?

[Feat] Eagle Proposer support FULL_DECODE_ONLY graph mode #4763

Conversation

anon189Ty commented Dec 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Dec 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 8, 2025

Uh oh!

AlbertCheese Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlbertCheese Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiyuan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anon189Ty commented Dec 6, 2025 •

edited by github-actions bot

Loading

AlbertCheese Dec 8, 2025 •

edited

Loading