implement server agent draining #795

ipochi · 2025-10-27T23:33:08Z

Currently, the server only logs if it receives a drain request from the agent. Ideally, in that scenario, the server should mark it and not select it for newer requests. This PR implements that.

Signed-off-by: Imran Pochi <imranpochi@microsoft.com>

k8s-ci-robot · 2025-10-27T23:33:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ipochi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [ipochi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jkh52 · 2025-10-30T18:41:15Z

I'm happy to see interest in completing this feature! When I had been looking into this, here's the questions I ran into:

Whether the server should keep agent draining state by raw connection, or by agent identifiers. The current BackendStorage makes the latter much more complicated, I agree with the former (like this PR does). But note that currently, the agent sync loop will continue to issue new syncs, and a server could get into conflicting drain state for a given agent. So without an accompanying agent change (to stop syncing new connections after entering draining state), it is possible for the server to continue picking these agents.

That potential agent change is probably an improvement but needs careful risk analysis.

Fallback logic in case no non-draining agent is found, continue using one that is draining. I think the argument for this is strong, because it's "no worse" than the current server behavior, and may avoid degrading some clusters (those that are limping along with very few agents, e.g. 1).

ipochi · 2025-10-30T20:44:10Z

I'm happy to see interest in completing this feature! When I had been looking into this, here's the questions I ran into:

Whether the server should keep agent draining state by raw connection, or by agent identifiers. The current BackendStorage makes the latter much more complicated, I agree with the former (like this PR does). But note that currently, the agent sync loop will continue to issue new syncs, and a server could get into conflicting drain state for a given agent. So without an accompanying agent change (to stop syncing new connections after entering draining state), it is possible for the server to continue picking these agents.

That potential agent change is probably an improvement but needs careful risk analysis.

Fallback logic in case no non-draining agent is found, continue using one that is draining. I think the argument for this is strong, because it's "no worse" than the current server behavior, and may avoid degrading some clusters (those that are limping along with very few agents, e.g. 1).

Thank you for the feedback @jkh52

Your point on 1, hadn't crossed my mind. It seems logical thing to do to stop the agent sync loop if the agent is draining. I'll give more thought on this whether the server should do more so that it doesn't get into a conflicted state or the agent should stop teh sync loops.

On Point 2, I think its valid point that continue to use the draining agent if there are no non-draining agents. Its akin falling back on current behaviour.

This commit adds a fallback in the case where all the agents in system are draining. Rather than drop the request with error, we fallback to the existing behavior i.e continue to the send the request to the agent even if its draining. As for the agent side issue, if the agent has sent the DRAIN signal to the server, ideally it should stop doing the syncOnce with the server. This mistakes the server the agent is back ready. Signed-off-by: Imran Pochi <imranpochi@microsoft.com>

ipochi · 2025-11-11T23:50:55Z

@jkh52 @cheftako

Current PR limitations:

no agent side implemenation
no fallback from server when all agents are draining.

Expanding on 1 (agent side implementation)

simplest option is to stop the agent from doing connectOnce if the agent is draining. I don't want to complicate the logic by making the agents do anything fancy or complex like sending DIAL_CLS or something to stop accepting new connections coz the agent doesn't have a whole view of the system (may its a one agent setup or all agents are draining) in which case server retry and agent doing something fancy could lead to deadlock (back and forth)

Expanding on the fall back (when all or the only agent is in draining state), I think the simplest option in this case would again be pick the first draining agent you found and just send the request to that agent, even it it means the agent is draining.

These limitations are now addressed in the second commit of the PR. Please review and let me know your thoughts.

cheftako

The changes look pretty good and follow the design goals, with the fallback mechanism in backend selection serving as a good safeguard. However, the complexity of GetRandomBackend has increased from O(1) to O(N), which is worth noting for performance at scale. Please verify the fallback behavior in DestHostBackendManager to ensure it doesn't create hotspots. Most importantly, please add unit tests for GetRandomBackend (and DestHostBackendManager.Backend) to verify that it prefers non-draining backends and correctly falls back to a draining backend if that is the only option.

cheftako · 2025-12-19T18:16:17Z

pkg/server/backend_manager.go

+
+	var firstDrainingBackend *Backend
+
+	// Start at a random agent and check each agent in sequence


The complexity here changes from O(1) to O(N). While likely acceptable given the typical number of agents, please ensure this doesn't introduce latency spikes in large clusters.

cheftako · 2025-12-19T18:16:17Z

pkg/server/backend_manager.go

+		}
+
+		// Keep track of first draining backend as fallback
+		if firstDrainingBackend == nil {


The fallback logic is critical here. It would be great to see a unit test in pkg/server/backend_manager_test.go ensuring that:

Non-draining backends are prioritized.

Draining backends are returned if no others are available (fallback).

cheftako · 2025-12-19T18:16:18Z

pkg/server/desthost_backend_manager.go

+
+			// Find a non-draining backend for this destination host
+			for _, backend := range bes {
+				if !backend.IsDraining() {


Same as DefaultBackendStorage, adding a test case for this logic in DestHostBackendManager would be beneficial to ensure the preference and fallback behavior work correctly.

cheftako · 2025-12-19T18:16:18Z

pkg/server/backend_manager.go

+
+	// draining indicates if this backend is draining and should not accept new connections
+	draining bool
+	// mu protects draining field


Consider using sync/atomic (e.g., atomic.Bool or int32) for the draining field. This would allow IsDraining() to be lock-free, reducing the overhead inside the O(N) loop in GetRandomBackend, which runs under the global s.mu lock.

cheftako · 2025-12-19T18:16:18Z

pkg/server/backend_manager.go

+	var firstDrainingBackend *Backend
+
+	// Start at a random agent and check each agent in sequence
+	startIdx := s.random.Intn(len(s.agentIDs))


This loop changes the complexity of GetRandomBackend from O(1) to O(N) while holding the DefaultBackendStorage exclusive lock (s.mu). In clusters with thousands of agents, this extends the critical section and could impact throughput. If performance becomes an issue, consider maintaining a separate list/set of non-draining agents.

cheftako · 2025-12-19T18:16:18Z

pkg/agent/clientset.go

 }

 func (cs *ClientSet) connectOnce() error {
+	// Skip establishing new connections if draining


Ensure that cs.drainCh is properly initialized in NewClientSet (or wherever ClientSet is created). If it is nil, this case will block (if it were a lone case) or be skipped (in a select with default), effectively disabling the check.

I believe its setup by SetupSignalHandler but given the distance between it would be good to be sure.

cheftako · 2025-12-19T18:16:18Z

pkg/server/backend_manager.go

+
+	// All agents are draining, use one as fallback
+	if firstDrainingBackend != nil {
+		agentID := firstDrainingBackend.id


Inconsistency in log levels: GetRandomBackend logs the fallback event at V(2), whereas DestHostBackendManager logs a similar fallback event at V(4).

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/logging.md suggests that V2 is for significant changes in state. If we detected that now we had run out of non-draining backends I would agree V2. As it is I would go more for V3.

cheftako · 2025-12-19T18:16:18Z

pkg/server/backend_manager.go

+}
+
+// SetDraining marks the backend as draining
+func (b *Backend) SetDraining() {


Verify lock ordering: GetRandomBackend holds s.mu (Lock) and calls IsDraining (which takes b.mu RLock). SetDraining takes b.mu Lock. Ensure no code path attempts to acquire s.mu while holding b.mu to avoid potential deadlocks.

implement server agent draining

53580b0

Signed-off-by: Imran Pochi <imranpochi@microsoft.com>

k8s-ci-robot requested review from elmiko and tallclair October 27, 2025 23:33

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 27, 2025

cheftako reviewed Dec 19, 2025

View reviewed changes


		var firstDrainingBackend *Backend

		// Start at a random agent and check each agent in sequence

implement server agent draining #795

Are you sure you want to change the base?

implement server agent draining #795

Conversation

ipochi commented Oct 27, 2025

Uh oh!

k8s-ci-robot commented Oct 27, 2025

Uh oh!

jkh52 commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ipochi commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ipochi commented Nov 11, 2025

Uh oh!

cheftako left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jkh52 commented Oct 30, 2025 •

edited

Loading

ipochi commented Oct 30, 2025 •

edited

Loading