-
Notifications
You must be signed in to change notification settings - Fork 646
[Bugfix] Fix MoE MLP related issues (ref #4490) #4743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request primarily addresses a bug in the MoE MLP implementation by correctly formatting the group_index parameter for the npu_dequant_swiglu_quant operation, which appears to be a critical fix. The PR also includes several updates to test files to adapt to other changes and improve robustness. However, I've identified a critical issue in one of the test file changes, tests/ut/eplb/core/test_eplb_device_transfer_loader.py, which seems to introduce a failing test and points to an underlying bug in the expert weight loading logic that is not addressed in this PR.
| loader_obj.generate_expert_d2d_transfer_task([], [], {}, 0) | ||
| assert loader_obj.comm_op_list is None | ||
| # assert loader_obj.comm_op_list is None | ||
| assert loader_obj.comm_op_list == [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion change will cause the test to fail with the current implementation.
The generate_expert_d2d_transfer_task function returns early when expert_send_info and expert_recv_info are empty, without modifying loader_obj.comm_op_list. Since comm_op_list is None from initialization and is not changed by the first call in this test, it remains None. Therefore, the assertion assert loader_obj.comm_op_list == [] will fail. The original assertion assert loader_obj.comm_op_list is None was correct.
This change likely points to an underlying bug in generate_expert_d2d_transfer_task located in vllm_ascend/eplb/core/eplb_device_transfer_loader.py. The function doesn't reset self.comm_op_list when no new transfer tasks are generated. This could lead to asyn_expert_weight_transfer re-executing stale communication operations from a previous call, which is a serious bug.
A potential fix in vllm_ascend/eplb/core/eplb_device_transfer_loader.py would be to reset self.comm_op_list at the beginning of generate_expert_d2d_transfer_task. For example:
# In vllm_ascend/eplb/core/eplb_device_transfer_loader.py
def generate_expert_d2d_transfer_task(self, ...):
if self.state != ExpertWeightUpdateState.WAITING:
# ...
return
self.comm_op_list = [] # Reset here
if not (expert_send_info or expert_recv_info):
return
# ...With such a fix in the implementation, this test change would be correct. As it stands, this PR introduces a failing test.
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
ef6c5c3 to
1021a57
Compare
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
e26453e to
945aab4
Compare
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?