[Bug] Block send failed in partition-split mode due to retry time exceeded

The load-balanced partition split significantly improves write throughput for huge partitions (validated with up to 5 TB of data per partition). However, there are still some corner cases that need to be handled

# Background of problem

I found that some Spark jobs failed due to retry time exceeded when a huge partition occurred in one stage. The Uniffle client logs are shown below (with additional information added to the existing logs):

`26/02/06 20:57:14 ERROR RssShuffleWriter: Block of blockId[3518521950613624] partitionId[10102] in taskId[34299_0] retry exceeding the max retry times: [1]. Fast fail! faulty server list: [ShuffleServerInfo{host[xxxxx], grpc port[29100], netty port[29104]}]`

After adding additional logs and reproducing the issue, I found that there are some unhandled corner cases in the current `partition-split` implementation when `LOAD_BALANCE` mode is enabled.

In the failed job, a huge partition failed to write to its assigned shuffle server and then fell back to other servers. After several reassignments, the operation eventually failed due to exceeding the maximum retry limit.

Assume the huge partition is partition-1. Once it is marked as a huge partition, it is split across 20 shuffle servers (as configured by the Uniffle client) to take over the shuffle data for partition-1. However, the total data size of partition-1 is extremely large—about 1 TB. This means the average data size per shuffle server is approximately 50 GB, while the configured capacity per shuffle server is only 20 GB. As a result, a second partition split is triggered.

In this scenario, the second partition split is not handled properly by the current implementation. Instead, partition-1 keeps selecting the next shuffle server one by one for writing, which is driven by `reassignOnPartitionNeedSplit`. These frequent partition assignment updates lead to stale assignments (as defined in `ShuffleBlockInfo#isStaleAssignment` ). Consequently, shuffle blocks accumulated in the data pusher queue are repeatedly resent through the resend logic. Unfortunately, after several retries, the process ultimately fails due to exceeding the maximum retry limit.

# investigation

And after the patch of #2726 , I try to re-run the failed job, but it's still fail but with another failure, the logs as follows:

`org.apache.uniffle.common.exception.RssWaitFailedException: Timeout: Task[32412_0] failed because 95 blocks can't be sent to shuffle server in 600000 ms.`

There were no error messages such as cannot require buffer or any network failures in the logs. It appeared that the block sending events were simply lost.

After thorough investigation, I found that the root cause was the stale assignment filter.

In the current logic, if all blocks in a sending event are filtered out as stale assignments, they are added to the failedBlockTracker, but the event callback is not executed. As a result, the RssShuffleWriter never receives completion feedback and hangs until the timeout is reached.

The fix is implemented in #2729.

# Possible solutions 

### Option 1
Increase the configured number of load-balanced shuffle servers.
However, this value is currently static, and ideally it should be adaptive.

### Option 2
In the second partition-split scenario, stop searching for the next shuffle server and continue writing to the current one.
This may slightly degrade performance if the throughput limit is reached, but it avoids excessive reassignment and retries.

### Option 3
Introduce an adaptive mechanism to dynamically increase the number of load-balanced shuffle servers.
This mechanism could be triggered by the writer when it detects that all assigned servers are backpressured.

### Option 4

Remove the logic of filter out the staleAssignment blocks to resend, that maybe can bypass the above limitation of max-retry-time. 

#2726


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Block send failed in partition-split mode due to retry time exceeded #2725

Background of problem

investigation

Possible solutions

Option 1

Option 2

Option 3

Option 4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Block send failed in partition-split mode due to retry time exceeded #2725

Description

Background of problem

investigation

Possible solutions

Option 1

Option 2

Option 3

Option 4

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions