Skip to content

[CELEBORN-2257] Fix remote disks not being reported on registration#3597

Open
Dzeri96 wants to merge 2 commits intoapache:mainfrom
Dzeri96:CELEBORN-2257
Open

[CELEBORN-2257] Fix remote disks not being reported on registration#3597
Dzeri96 wants to merge 2 commits intoapache:mainfrom
Dzeri96:CELEBORN-2257

Conversation

@Dzeri96
Copy link

@Dzeri96 Dzeri96 commented Feb 6, 2026

What changes were proposed in this pull request?

  1. Disks reported to the master on registration now include remote disks (HDFS, S3, OSS)
  2. Refactored method names to clarify difference between local and remote disks.
  3. Embedded disk type information into the enum.
  4. Refactored unnecessarily complicated code in the slot assignment and worker registration path.

Why are the changes needed?

  1. Before the first heartbeat, the master won't be able to assign slots from the remote disks on the worker.
  2. All other changes are in preparation for better support of remote disks.

Does this PR resolve a correctness bug?

Yes

Does this PR introduce any user-facing change?

No

How was this patch tested?

Important: I want help from the community on how to write tests for this.

@SteNicholas SteNicholas changed the title [CELEBORN-2257] Fixed remote disks not being reported on registration [CELEBORN-2257] Fix remote disks not being reported on registration Feb 8, 2026
@SteNicholas SteNicholas requested a review from Copilot February 8, 2026 04:33
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes worker registration disk reporting so the master can see remote storage (HDFS/S3/OSS) immediately (before the first heartbeat), and refactors disk snapshot APIs / slot-allocation logic to distinguish local vs remote disks more clearly.

Changes:

  • Renamed disk snapshot / healthy-dir helpers to explicitly mean “local” and added an “all disks” snapshot.
  • Updated worker registration/heartbeat disk reporting to incorporate remote disks.
  • Simplified master slot-allocation filtering by embedding disk-type metadata into StorageInfo.Type and using it in allocation logic.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
worker/src/test/scala/org/apache/celeborn/service/deploy/worker/storage/StorageManagerSuite.scala Updates mocks to the renamed localDisksSnapshot() API.
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/StorageManager.scala Introduces localDisksSnapshot() / allDisksSnapshot() and renames “healthy working dirs” to local-only.
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Worker.scala Switches registration to report all disks; refactors heartbeat disk update flow.
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Controller.scala Uses local-only healthy working dirs check for slot reservation.
tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/CelebornHashCheckDiskSuite.scala Updates test to use localDisksSnapshot().
master/src/main/java/org/apache/celeborn/service/deploy/master/SlotsAllocator.java Simplifies disk filtering using StorageInfo.Type metadata; refactors usable-slot bookkeeping.
common/src/main/scala/org/apache/celeborn/common/meta/WorkerInfo.scala Refactors slot recomputation / propagation logic and uses isDFS.
common/src/main/java/org/apache/celeborn/common/protocol/StorageInfo.java Adds isDFS + mask metadata into StorageInfo.Type and introduces isAvailable(...).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link

codecov bot commented Feb 8, 2026

Codecov Report

❌ Patch coverage is 94.44444% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 67.07%. Comparing base (2dd1b7a) to head (d5ae63a).
⚠️ Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
...a/org/apache/celeborn/common/meta/WorkerInfo.scala 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3597      +/-   ##
==========================================
- Coverage   67.13%   67.07%   -0.06%     
==========================================
  Files         357      357              
  Lines       21860    21935      +75     
  Branches     1943     1947       +4     
==========================================
+ Hits        14674    14711      +37     
- Misses       6166     6213      +47     
+ Partials     1020     1011       -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +287 to +290
private val diskInfos = storageManager
.allDisksSnapshot()
.map { diskInfo => diskInfo.mountPoint -> diskInfo }
.toMap.asJava
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR changes worker registration/heartbeat disk reporting to include remote disks (allDisksSnapshot) and introduces new slot-availability semantics (StorageInfo.isAvailable, Type.isDFS). There doesn’t appear to be a test asserting that remote disk infos are (a) included in the initial registration payload and (b) preserved across subsequent heartbeats so the master can allocate slots from them before/without the first heartbeat. Adding a focused unit/integration test around worker->master disk info propagation would help prevent regressions here.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I wrote in the original PR. I need someone from the existing community to guide me on writing an integration test.

Copy link
Member

@SteNicholas SteNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dzeri96, thanks for contribution. Could you explain which fix mainly provided in this pull request?

Type(int value) {
Type(int value, boolean isDFS, int mask) {
this.value = value;
this.isDFS = isDFS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it's unnecessary to add isDFS variable. The isDFS method is enough for usage.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +217 to +220
if (estimatedPartitionSize.nonEmpty && !newDisk.storageType.isDFS) {
newDisk.maxSlots = newDisk.totalSpace / estimatedPartitionSize.get
newDisk.availableSlots = newDisk.actualUsableSpace / estimatedPartitionSize.get
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces new slot-update semantics for DFS vs local disks inside updateThenGetDiskInfos, but there are no unit tests covering that remote/DFS disk slot fields are preserved across successive updates (e.g., registration updateDiskSlots(...) followed by heartbeat updateThenGetDiskInfos(...)). Adding a focused test in WorkerInfoSuite for a DFS DiskInfo would help prevent regressions like remote disks becoming unavailable after the first heartbeat.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants