-
Notifications
You must be signed in to change notification settings - Fork 170
Description
The load-balanced partition split significantly improves write throughput for huge partitions (validated with up to 5 TB of data per partition). However, there are still some corner cases that need to be handled
Background of problem
I found that some Spark jobs failed due to retry time exceeded when a huge partition occurred in one stage. The Uniffle client logs are shown below (with additional information added to the existing logs):
26/02/06 20:57:14 ERROR RssShuffleWriter: Block of blockId[3518521950613624] partitionId[10102] in taskId[34299_0] retry exceeding the max retry times: [1]. Fast fail! faulty server list: [ShuffleServerInfo{host[xxxxx], grpc port[29100], netty port[29104]}]
After adding additional logs and reproducing the issue, I found that there are some unhandled corner cases in the current partition-split implementation when LOAD_BALANCE mode is enabled.
In the failed job, a huge partition failed to write to its assigned shuffle server and then fell back to other servers. After several reassignments, the operation eventually failed due to exceeding the maximum retry limit.
Assume the huge partition is partition-1. Once it is marked as a huge partition, it is split across 20 shuffle servers (as configured by the Uniffle client) to take over the shuffle data for partition-1. However, the total data size of partition-1 is extremely large—about 1 TB. This means the average data size per shuffle server is approximately 50 GB, while the configured capacity per shuffle server is only 20 GB. As a result, a second partition split is triggered.
In this scenario, the second partition split is not handled properly by the current implementation. Instead, partition-1 keeps selecting the next shuffle server one by one for writing, which is driven by reassignOnPartitionNeedSplit. These frequent partition assignment updates lead to stale assignments (as defined in ShuffleBlockInfo#isStaleAssignment ). Consequently, shuffle blocks accumulated in the data pusher queue are repeatedly resent through the resend logic. Unfortunately, after several retries, the process ultimately fails due to exceeding the maximum retry limit.
investigation
And after the patch of #2726 , I try to re-run the failed job, but it's still fail but with another failure, the logs as follows:
org.apache.uniffle.common.exception.RssWaitFailedException: Timeout: Task[32412_0] failed because 95 blocks can't be sent to shuffle server in 600000 ms.
There were no error messages such as cannot require buffer or any network failures in the logs. It appeared that the block sending events were simply lost.
After thorough investigation, I found that the root cause was the stale assignment filter.
In the current logic, if all blocks in a sending event are filtered out as stale assignments, they are added to the failedBlockTracker, but the event callback is not executed. As a result, the RssShuffleWriter never receives completion feedback and hangs until the timeout is reached.
The fix is implemented in #2729.
Possible solutions
Option 1
Increase the configured number of load-balanced shuffle servers.
However, this value is currently static, and ideally it should be adaptive.
Option 2
In the second partition-split scenario, stop searching for the next shuffle server and continue writing to the current one.
This may slightly degrade performance if the throughput limit is reached, but it avoids excessive reassignment and retries.
Option 3
Introduce an adaptive mechanism to dynamically increase the number of load-balanced shuffle servers.
This mechanism could be triggered by the writer when it detects that all assigned servers are backpressured.
Option 4
Remove the logic of filter out the staleAssignment blocks to resend, that maybe can bypass the above limitation of max-retry-time.