Skip to content

Conversation

@hongyunyan
Copy link
Collaborator

@hongyunyan hongyunyan commented Feb 3, 2026

What problem does this PR solve?

Issue Number: close #xxx

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added node draining capability for graceful node maintenance and workload migration.
    • Introduced node liveness tracking with states: Alive, Draining, Stopping, and Unknown.
    • Enhanced scheduler to intelligently avoid placing work on draining or stopping nodes.
    • Added drain endpoint to initiate node draining operations.
  • Tests

    • Comprehensive test coverage for drain operations and node liveness management.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Feb 3, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Feb 3, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign charlescheung96 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 3, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 3, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a comprehensive node draining feature for TiCDC, enabling graceful node evacuation. It adds a DrainController to orchestrate draining state transitions, a three-state node liveness model (Alive → Draining → Stopping), and a DrainScheduler to migrate maintainers away from draining nodes. The system integrates new protobuf messages for heartbeats and liveness control, updates the drain API endpoint to use the coordinator, and refactors schedulers to use a DestNodeSelector abstraction for filtering schedulable destinations.

Changes

Cohort / File(s) Summary
API Drain Endpoint
api/v1/api.go, api/v1/api_test.go
Replaced atomic counter-based drain implementation with coordinator-driven approach. Added error handling for unsupported coordinators, new test coverage for drain response (202 Accepted with table count, 500 for unsupported coordinator).
Drain Controller
coordinator/drain_controller.go, coordinator/drain_controller_test.go
New DrainController orchestrating node draining via SetNodeLiveness requests. Tracks per-node drain state, enforces retry intervals, waits for DRAINING observation before promoting to STOPPING. Includes comprehensive unit tests validating drain initiation, state transitions, remaining work computation, and unknown node handling.
Node Liveness Tracking
coordinator/node_liveness.go, coordinator/node_liveness_test.go
New NodeLivenessView providing thread-safe per-node liveness tracking with TTL-based timeout. Derives liveness state (Alive, Draining, Stopping, Unknown) from heartbeats and SetLiveness responses. Includes node epoch tracking and schedulability checks with tests covering timeouts, draining states, and filtering.
Destination Node Selector
coordinator/dest_node_selector.go, coordinator/scheduler/dest_node_selector.go
New DestNodeSelector interface and internal implementation filtering schedulable nodes by liveness. Provides GetSchedulableDestNodes and GetSchedulableDestNodeIDs excluding draining/stopping/unknown nodes.
Drain Scheduler
coordinator/scheduler/drain.go, coordinator/scheduler/drain_test.go
New DrainScheduler moving maintainers away from draining nodes via MoveMaintainer operators. Uses round-robin scheduling with load balancing and respects batch capacity constraints. Tests validate operator creation and handling of in-flight operators.
Scheduler Refactoring
coordinator/scheduler/basic.go, coordinator/scheduler/balance.go
Replaced NodeManager dependency with DestNodeSelector abstraction in both BasicScheduler and BalanceScheduler. Updated constructor signatures and scheduling logic to use GetSchedulableDestNodeIDs/GetSchedulableDestNodes instead of GetAliveNodeIDs.
Coordinator Integration
coordinator/controller.go, coordinator/coordinator.go, coordinator/operator/operator_controller.go
Integrated DrainController and NodeLivenessView into Controller. Added DrainNode public API. Wired DrainScheduler into scheduler set. Added node liveness message handling (TypeNodeHeartbeatRequest, TypeSetNodeLivenessResponse). Added CountOperatorsInvolvingNode helper for drain eligibility checks.
Liveness State API
pkg/api/util.go, pkg/api/util_test.go
Reworked Liveness enum to three-state monotonic lifecycle: Alive (0) → Draining (1) → Stopping (2). Enforces stepwise upgrade without downgrades and idempotent transitions. Tests validate monotonic progression and rejection of state skipping.
Message Types
pkg/messaging/message.go, pkg/scheduler/scheduler.go
Added three new IOType constants (TypeNodeHeartbeatRequest=42, TypeSetNodeLivenessRequest=43, TypeSetNodeLivenessResponse=44) with String() support and decoding logic. Added DrainScheduler constant.
Proto Messages
heartbeatpb/heartbeat.proto
Added NodeLiveness enum (ALIVE, DRAINING, STOPPING) and messages: NodeHeartbeat, SetNodeLivenessRequest, SetNodeLivenessResponse for liveness control flow. Extended AddMaintainerRequest with checkpoint_ts, is_new_changefeed, keyspace_id fields.
Maintainer Liveness Integration
maintainer/maintainer_manager.go, maintainer/maintainer_manager_test.go, maintainer/node_agent_test.go
Updated NewMaintainerManager to accept Liveness parameter. Added SetNodeLiveness request handling with API↔PB conversion, immediate heartbeat on successful transition, and stopping-state validation for additions. Integrated node heartbeat mechanism. Tests validate epoch mismatch rejection and transition-triggered heartbeats.
Server Initialization & Election
server/server.go, server/module_election.go
Updated MaintainerManager construction to pass liveness reference. Enhanced coordinator election with non-alive gate checks, background liveness monitoring with automatic resignation on stopping transition, and improved context management for resign operations.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

lgtm, approved, size/XL, release-note-none

Suggested reviewers

  • wk989898
  • asddongmen
  • tenfyzhong

Poem

🐰 Behold the draining node, so graceful and fine,
From Alive to Draining, then Stopping—divine!
The maintainers hop away, light as can be,
While schedulers dance to the liveness decree.

🚥 Pre-merge checks | ❌ 3
❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely template boilerplate with all sections unfilled. It contains only placeholder text like 'close #xxx' and checkbox options, with no actual implementation details, problem statement, or changes explanation. Complete all required sections: specify the issue number, explain the problem being solved, describe what changed and how it works, confirm which tests were added, address performance/compatibility concerns, and provide a release note.
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title 'wip drain capture' is vague and generic. 'wip' suggests work-in-progress, and 'drain capture' lacks specificity about what aspect is being implemented or changed. Replace with a more descriptive title that clearly indicates the main feature being added, such as 'Add node drain controller and scheduler for graceful node draining' or 'Implement drain capture API with node liveness state machine'.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Failure to add the new IP will result in interrupted reviews.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello @hongyunyan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust node draining mechanism, allowing for the graceful removal or shutdown of nodes by systematically migrating their assigned workloads. It establishes a clear, monotonic lifecycle for nodes (Alive, Draining, Stopping) and integrates this state management across the system's core components. The changes enable operators to initiate drains via an API, while the coordinator intelligently orchestrates task migration and ensures that only healthy nodes participate in scheduling and leadership elections, significantly enhancing operational flexibility and system stability.

Highlights

  • Node Liveness States: Introduced a new NodeLiveness enum with ALIVE, DRAINING, and STOPPING states, along with a monotonic state machine in pkg/api/util.go to manage node lifecycle transitions.
  • Drain Capture API: Implemented the /api/v1/captures/drain API endpoint to initiate the draining process for a specified capture, which now interacts with the coordinator to perform actual draining logic and returns the count of remaining tasks.
  • Drain Controller and Scheduler: Added a DrainController to the coordinator responsible for orchestrating the draining process by sending liveness requests to nodes. A new DrainScheduler was also introduced to actively migrate changefeeds from draining nodes to other available, schedulable nodes.
  • Liveness-Aware Scheduling: Modified existing BasicScheduler and BalanceScheduler to utilize a new DestNodeSelector interface, ensuring that new tasks are only assigned to nodes currently in the ALIVE state, preventing assignment to draining or stopping nodes.
  • Active Coordinator Resignation: Enhanced the election modules for both the coordinator and log coordinator to check node liveness before campaigning and to actively resign their leadership if the node transitions to a non-ALIVE state, improving system resilience during shutdowns.
  • Node Agent Liveness Handling: The maintainer manager (node agent) now processes SetNodeLivenessRequest messages from the coordinator, updates its local liveness state, and sends periodic node heartbeats reflecting its current liveness status.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • api/v1/api.go
    • Removed dummy implementation for drainCapture API.
    • Implemented actual drainCapture logic to interact with the coordinator.
    • Removed sync/atomic import and drainCaptureCounter.
    • Added pkg/node import.
  • api/v1/api_test.go
    • Added unit tests for the drainCapture API, covering successful drain and unsupported coordinator scenarios.
  • coordinator/controller.go
    • Added drainController and nodeLivenessView fields.
    • Initialized DrainController and NodeLivenessView in NewController.
    • Modified NewController to pass destSelector to BasicScheduler and BalanceScheduler.
    • Added DrainScheduler to the list of schedulers.
    • Added drainController.Tick() to onPeriodTask.
    • Added message handlers for TypeNodeHeartbeatRequest and TypeSetNodeLivenessResponse to update liveness views.
    • Implemented DrainNode method to delegate to drainController.
  • coordinator/coordinator.go
    • Added DrainNode method to the coordinator interface and implementation.
  • coordinator/dest_node_selector.go
    • New file: Implemented DestNodeSelector to filter schedulable nodes based on liveness.
  • coordinator/drain_controller.go
    • New file: Implemented DrainController to manage node draining, send liveness requests, and track node state.
  • coordinator/drain_controller_test.go
    • New file: Added unit tests for DrainController.
  • coordinator/node_liveness.go
    • New file: Implemented NodeLivenessView to track and derive node liveness states.
  • coordinator/node_liveness_test.go
    • New file: Added unit tests for NodeLivenessView.
  • coordinator/operator/operator_controller.go
    • Added CountOperatorsInvolvingNode method.
  • coordinator/scheduler/balance.go
    • Updated balanceScheduler to use DestNodeSelector for node selection.
    • Removed direct dependency on watcher.NodeManager.
  • coordinator/scheduler/basic.go
    • Updated basicScheduler to use DestNodeSelector for node selection.
    • Removed direct dependency on watcher.NodeManager.
  • coordinator/scheduler/dest_node_selector.go
    • New file: Defined DestNodeSelector interface.
  • coordinator/scheduler/drain.go
    • New file: Implemented drainScheduler to move changefeeds from draining nodes.
  • coordinator/scheduler/drain_test.go
    • New file: Added unit tests for drainScheduler.
  • heartbeatpb/heartbeat.pb.go
    • Generated code updated for new protobuf messages and enum.
  • heartbeatpb/heartbeat.proto
    • Added NodeLiveness enum (ALIVE, DRAINING, STOPPING).
    • Added NodeHeartbeat, SetNodeLivenessRequest, SetNodeLivenessResponse messages.
  • maintainer/maintainer_manager.go
    • Added liveness and nodeEpoch fields.
    • Modified NewMaintainerManager to accept api.Liveness.
    • Implemented sendNodeHeartbeat and onSetNodeLivenessRequest.
    • Added logic to send immediate heartbeat on coordinator bootstrap.
    • Prevented adding maintainers to stopping nodes.
  • maintainer/maintainer_manager_test.go
    • Updated test setup to pass api.Liveness to NewMaintainerManager.
  • maintainer/node_agent_test.go
    • New file: Added tests for node agent's liveness handling.
  • pkg/api/util.go
    • Added LivenessCaptureDraining constant.
    • Modified Liveness.Store to enforce monotonic state transitions.
    • Updated Liveness.String() method.
  • pkg/api/util_test.go
    • New file: Added unit tests for Liveness.Store behavior.
  • pkg/messaging/message.go
    • Added TypeNodeHeartbeatRequest, TypeSetNodeLivenessRequest, TypeSetNodeLivenessResponse constants.
    • Updated message encoding/decoding logic for new types.
  • pkg/scheduler/scheduler.go
    • Added DrainScheduler constant.
  • server/module_election.go
    • Modified coordinator and log coordinator election logic to respect node liveness (only campaign if ALIVE, resign if STOPPING).
    • Added active resignation mechanism for coordinators if liveness changes to STOPPING.
  • server/server.go
    • Passed server's liveness to MaintainerManager during initialization.
Activity
  • The pull request is currently marked as 'work in progress' (wip).
  • The description indicates a placeholder for an issue number, suggesting this is an early draft or a feature under active development.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the drain capture feature, which is a significant piece of work involving a new liveness state machine for capture nodes, a DrainController to orchestrate the draining process, and a DrainScheduler to move workloads away from draining nodes. The changes are well-structured and include comprehensive tests for the new components. I've identified one bug in error handling within the election module and one suggestion for improving robustness in the maintainer manager. Overall, this is a solid implementation of a complex feature.

zap.String("nodeID", nodeID),
zap.Error(resignErr))
cancel()
return errors.Trace(err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There seems to be a bug in the error handling here. If e.resign(resignCtx) fails, resignErr will be non-nil, but the function returns errors.Trace(err). At this point, err (from the successful Campaign call) is nil, so the function will return nil, effectively masking the resignation error. It should return errors.Trace(resignErr) to properly propagate the failure, similar to the logic in campaignLogCoordinator.

Suggested change
return errors.Trace(err)
return errors.Trace(resignErr)

func (m *Manager) onAddMaintainerRequest(req *heartbeatpb.AddMaintainerRequest) *heartbeatpb.MaintainerStatus {
changefeedID := common.NewChangefeedIDFromPB(req.Id)

if m.liveness != nil && m.liveness.Load() == api.LivenessCaptureStopping {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check correctly refuses to add a maintainer if the node is Stopping. For improved robustness, it would be better to also refuse requests when the node is Draining. A draining node's purpose is to shed its workload, so it shouldn't accept new changefeeds. While the schedulers are designed to prevent this, this check acts as a final safeguard against race conditions where a move is scheduled just before the node's state changes to draining.

Suggested change
if m.liveness != nil && m.liveness.Load() == api.LivenessCaptureStopping {
if m.liveness != nil && m.liveness.Load() >= api.LivenessCaptureDraining {

@ti-chi-bot
Copy link

ti-chi-bot bot commented Feb 3, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@coordinator/drain_controller.go`:
- Around line 16-27: The SendCommand call returns a third-party error that must
be wrapped with errors.Trace(err) before logging to preserve stack context;
locate the SendCommand invocation (symbol: SendCommand) and change the error
handling to wrap the returned err with errors.Trace(err) when calling
processLogger/zap logger (e.g., logger.Error or log.Error), and add the import
for "github.com/pingcap/errors" if not present so errors.Trace can be used.

In `@server/module_election.go`:
- Around line 122-131: The resign error path returns the wrong variable; when
e.resign(resignCtx) fails inside the block (resignErr), the function currently
returns errors.Trace(err) where err is nil—change the return to
errors.Trace(resignErr). Update the code around the resign call (the
resignCtx/resignErr handling inside the coordinator resignation logic that logs
with zap.String("nodeID", nodeID) and zap.Error(resignErr)) to cancel the
context and return errors.Trace(resignErr) instead of errors.Trace(err).
🧹 Nitpick comments (3)
api/v1/api_test.go (1)

35-83: Consider reducing duplication between fake coordinator types.

fakeCoordinator and fakeCoordinatorWithoutDrain share nearly identical implementations. You could embed a common base or use a single struct with an optional DrainNode function field.

♻️ Suggested refactor to reduce duplication
-type fakeCoordinator struct {
-	remaining int
-}
-
-func (c *fakeCoordinator) Stop()                     {}
-func (c *fakeCoordinator) Run(context.Context) error { return nil }
-// ... all other methods ...
-func (c *fakeCoordinator) DrainNode(node.ID) int { return c.remaining }
-
-type fakeCoordinatorWithoutDrain struct{}
-
-func (c *fakeCoordinatorWithoutDrain) Stop()                     {}
-// ... duplicated methods ...

+type fakeCoordinatorBase struct{}
+
+func (c *fakeCoordinatorBase) Stop()                     {}
+func (c *fakeCoordinatorBase) Run(context.Context) error { return nil }
+// ... common methods ...
+
+type fakeCoordinator struct {
+	fakeCoordinatorBase
+	remaining int
+}
+
+func (c *fakeCoordinator) DrainNode(node.ID) int { return c.remaining }
+
+type fakeCoordinatorWithoutDrain struct {
+	fakeCoordinatorBase
+}
server/module_election.go (1)

287-316: Redundant cancel() call after coordinator exits.

cancel() is called on Line 308 inside the goroutine when resigning, and again on Line 316 after co.Run(coCtx) returns. While calling cancel() multiple times is safe, consider restructuring to make the intent clearer. The goroutine already cancels the context on resign, so the outer cancel() is only needed for non-resign exits.

maintainer/maintainer_manager.go (1)

467-488: Minor: Redundant interval check when called with force=true.

When sendNodeHeartbeat(force=true) calls sendNodeHeartbeatTo(targetID, true), the interval check at lines 468-470 is redundant since force=true will skip it anyway. This is harmless but slightly confusing.

♻️ Optional: Simplify by removing duplicate check
 func (m *Manager) sendNodeHeartbeatTo(targetID node.ID, force bool) {
-	if !force && time.Since(m.lastNodeHeartbeatTime) < nodeHeartbeatInterval {
-		return
-	}
 	m.lastNodeHeartbeatTime = time.Now()

Move the interval check to only sendNodeHeartbeat, since sendNodeHeartbeatTo is also called directly from onSetNodeLivenessRequest where the force parameter controls the behavior.

Comment on lines +16 to +27
import (
"sync"
"time"

"github.com/pingcap/log"
"github.com/pingcap/ticdc/coordinator/changefeed"
"github.com/pingcap/ticdc/coordinator/operator"
"github.com/pingcap/ticdc/heartbeatpb"
"github.com/pingcap/ticdc/pkg/messaging"
"github.com/pingcap/ticdc/pkg/node"
"go.uber.org/zap"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's explore the structure and read the drain_controller.go file
cd /tmp && git ls-files pingcap/ticdc coordinator/drain_controller.go 2>/dev/null || echo "Not a git repo, searching directly"

# Find the file
fd drain_controller.go

Repository: pingcap/ticdc

Length of output: 91


🏁 Script executed:

# Read the complete drain_controller.go file
cat -n coordinator/drain_controller.go

Repository: pingcap/ticdc

Length of output: 7552


🏁 Script executed:

# Also check if errors package is already imported
rg "github.com/pingcap/errors" coordinator/drain_controller.go

Repository: pingcap/ticdc

Length of output: 39


🏁 Script executed:

# Search for SendCommand calls in the file to understand the context
rg "SendCommand" coordinator/drain_controller.go -B 3 -A 3

Repository: pingcap/ticdc

Length of output: 344


Wrap SendCommand error with stack trace metadata.

The SendCommand call at line 194 returns a library error that should be wrapped with errors.Trace(err) before logging to preserve stack context, per the coding guidelines for third-party error handling.

🔧 Suggested fix
 import (
 	"sync"
 	"time"

+	"github.com/pingcap/errors"
 	"github.com/pingcap/log"
 	"github.com/pingcap/ticdc/coordinator/changefeed"
 	"github.com/pingcap/ticdc/coordinator/operator"
 	"github.com/pingcap/ticdc/heartbeatpb"
 	"github.com/pingcap/ticdc/pkg/messaging"
 	"github.com/pingcap/ticdc/pkg/node"
 	"go.uber.org/zap"
 )
@@
 	msg := messaging.NewSingleTargetMessage(target, messaging.MaintainerManagerTopic, req)
 	if err := d.mc.SendCommand(msg); err != nil {
+		err = errors.Trace(err)
 		log.Warn("send set node liveness request failed",
 			zap.Stringer("target", target),
 			zap.Int32("liveness", int32(liveness)),
 			zap.Error(err))
 	}
🤖 Prompt for AI Agents
In `@coordinator/drain_controller.go` around lines 16 - 27, The SendCommand call
returns a third-party error that must be wrapped with errors.Trace(err) before
logging to preserve stack context; locate the SendCommand invocation (symbol:
SendCommand) and change the error handling to wrap the returned err with
errors.Trace(err) when calling processLogger/zap logger (e.g., logger.Error or
log.Error), and add the import for "github.com/pingcap/errors" if not present so
errors.Trace can be used.

Comment on lines +122 to 131
resignCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
if resignErr := e.resign(resignCtx); resignErr != nil {
log.Warn("resign coordinator actively failed",
zap.String("nodeID", nodeID),
zap.Error(resignErr))
cancel()
return errors.Trace(err)
}
cancel()
return nil
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: Incorrect error variable returned on resign failure.

When e.resign(resignCtx) fails, the code returns errors.Trace(err) but err is nil at this point (the campaign succeeded). This should return errors.Trace(resignErr) instead.

🐛 Proposed fix
 			if resignErr := e.resign(resignCtx); resignErr != nil {
 				log.Warn("resign coordinator actively failed",
 					zap.String("nodeID", nodeID),
 					zap.Error(resignErr))
 				cancel()
-				return errors.Trace(err)
+				return errors.Trace(resignErr)
 			}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
resignCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
if resignErr := e.resign(resignCtx); resignErr != nil {
log.Warn("resign coordinator actively failed",
zap.String("nodeID", nodeID),
zap.Error(resignErr))
cancel()
return errors.Trace(err)
}
cancel()
return nil
resignCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
if resignErr := e.resign(resignCtx); resignErr != nil {
log.Warn("resign coordinator actively failed",
zap.String("nodeID", nodeID),
zap.Error(resignErr))
cancel()
return errors.Trace(resignErr)
}
cancel()
return nil
🤖 Prompt for AI Agents
In `@server/module_election.go` around lines 122 - 131, The resign error path
returns the wrong variable; when e.resign(resignCtx) fails inside the block
(resignErr), the function currently returns errors.Trace(err) where err is
nil—change the return to errors.Trace(resignErr). Update the code around the
resign call (the resignCtx/resignErr handling inside the coordinator resignation
logic that logs with zap.String("nodeID", nodeID) and zap.Error(resignErr)) to
cancel the context and return errors.Trace(resignErr) instead of
errors.Trace(err).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant