Add deployment readiness conditions to DPA status #2047

shubham-pampattiwar · 2025-12-16T19:58:37Z

Summary

Adds new readiness conditions to DPA status: VeleroReady, NodeAgentReady, NonAdminReady, VMFileRestoreReady
Tracks actual pod/deployment readiness, not just resource creation
DPA requeues every 30 seconds until all enabled components are ready
Disabled components show True with reason ComponentDisabled

Problem

Previously, DPA could show Reconciled=True while underlying pods were not running due to deployment issues (e.g., CrashLoopBackOff, image pull errors). This made it difficult to diagnose issues.

Solution

Added separate readiness conditions that check:

Deployments: Status.ReadyReplicas == Spec.Replicas
DaemonSets: Status.NumberReady == Status.DesiredNumberScheduled

Test plan

Unit tests added and passing (make test)
Deploy on cluster and verify conditions appear on DPA status
Verify conditions update correctly when pods become ready/not ready
Verify disabled components show ComponentDisabled reason

🤖 Generated with Claude Code

This change addresses the gap where DPA could show Reconciled=True while underlying pods (Velero, Non-Admin, VMFR, NodeAgent) were not actually running due to deployment issues. Added new readiness conditions: - VeleroReady: Tracks Velero deployment readiness - NodeAgentReady: Tracks NodeAgent DaemonSet readiness - NonAdminReady: Tracks Non-Admin controller deployment readiness - VMFileRestoreReady: Tracks VM File Restore controller readiness Behavior: - Disabled components show True with reason ComponentDisabled - Missing deployments show False with reason ComponentNotFound - DPA requeues every 30 seconds until all components are ready - Existing Reconciled condition unchanged (tracks resource creation) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

openshift-ci · 2025-12-16T19:58:42Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2025-12-16T19:58:45Z

Walkthrough

Introduces readiness condition constants for DPA components, implements centralized readiness evaluation logic for Velero, NodeAgent, NonAdmin, and VMFileRestore through a new readiness module, integrates readiness checks into the main reconciliation loop, and adds comprehensive test coverage for all readiness scenarios.

Changes

Cohort / File(s)	Summary
API Types Constants `api/v1alpha1/dataprotectionapplication_types.go`	Added readiness condition constants (ConditionVeleroReady, ConditionNodeAgentReady, ConditionNonAdminReady, ConditionVMFileRestoreReady) and corresponding reason constants (ReasonDeploymentReady, ReasonDeploymentNotReady, ReasonDaemonSetReady, ReasonDaemonSetNotReady, ReasonComponentDisabled, ReasonComponentNotFound) for semantic labeling of readiness checks.
Readiness Implementation `internal/controller/readiness.go`	New module implementing centralized readiness evaluation. Adds updateReadinessConditions orchestrator function and component-specific readiness check functions (updateVeleroReadinessCondition, updateNodeAgentReadinessCondition, updateNonAdminReadinessCondition, updateVMFileRestoreReadinessCondition) that fetch resources, evaluate replica readiness, handle missing components, and update status conditions.
Readiness Tests `internal/controller/readiness_test.go`	Comprehensive test suite validating readiness logic across all components using Ginkgo/Gomega, including scenarios for missing deployments, zero ready replicas, disabled features, and combined readiness checks. Includes helper constructors for test resources.
Controller Integration `internal/controller/dataprotectionapplication_controller.go`	Integrated readiness evaluation into reconciliation loop. After reconciliation, calls updateReadinessConditions to evaluate all components, requeues with 30-second delay if not all components are ready, and updates DPA status while preserving error state. Added time import for RequeueAfter scheduling.
Cleanup `internal/controller/velero.go`	Removed comment and blank line from ReconcileVeleroDeployment with no functional code changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

readiness.go requires careful review of each component's readiness logic (Deployment replica checks, DaemonSet replica checks, disabled component handling, error propagation)
readiness_test.go verification that test cases cover all branch conditions and resource states
Controller integration in dataprotectionapplication_controller.go to ensure readiness checks are properly sequenced and error handling preserves meaningful error information
Constants naming and usage consistency across new readiness condition types

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between f75a38d and 93f70f0.

📒 Files selected for processing (5)

api/v1alpha1/dataprotectionapplication_types.go (1 hunks)
internal/controller/dataprotectionapplication_controller.go (2 hunks)
internal/controller/readiness.go (1 hunks)
internal/controller/readiness_test.go (1 hunks)
internal/controller/velero.go (0 hunks)

💤 Files with no reviewable changes (1)

internal/controller/velero.go

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

internal/controller/readiness_test.go
api/v1alpha1/dataprotectionapplication_types.go
internal/controller/dataprotectionapplication_controller.go
internal/controller/readiness.go

🧬 Code graph analysis (1)

internal/controller/readiness_test.go (2)

pkg/common/common.go (2)

Velero (19-19)

NodeAgent (20-20)

tests/e2e/lib/scheme.go (1)

Scheme (23-23)

🔇 Additional comments (11)

api/v1alpha1/dataprotectionapplication_types.go (2)

37-43: LGTM!

The readiness condition constants are well-named and consistently follow the existing naming pattern. They comprehensively cover all DPA components (Velero, NodeAgent, NonAdmin, VMFileRestore) as intended.

45-53: LGTM!

The readiness reason constants comprehensively cover all deployment and daemonset readiness scenarios (ready, not ready, disabled, not found). Clear and semantic naming.

internal/controller/dataprotectionapplication_controller.go (3)

23-23: LGTM!

Time import correctly added to support the RequeueAfter scheduling at line 156.

140-147: LGTM!

The error handling correctly prioritizes reconciliation errors over readiness errors (line 144-146), preventing the masking of more critical reconciliation failures. The readiness error is returned when no prior error exists, addressing the concern from the previous review.

154-158: LGTM!

The 30-second requeue interval for non-ready components provides a reasonable balance between responsiveness and resource usage. The simpler polling approach (versus watches) is appropriate for this use case and can be optimized later if needed.

internal/controller/readiness_test.go (1)

36-517: LGTM!

Comprehensive test suite with excellent coverage:

All four readiness functions tested (Velero, NodeAgent, NonAdmin, VMFileRestore)

Multiple scenarios per component (not found, not ready, ready, disabled)

Edge cases handled (zero desired pods for DaemonSet at lines 224-245)

Combined readiness scenarios validate integration

Clean test structure with reusable helpers

The test implementation validates both the boolean return values and the status condition fields, ensuring complete verification of the readiness logic.

internal/controller/readiness.go (5)

32-66: LGTM!

The orchestrator function correctly aggregates readiness across all components with proper error propagation. The logical AND aggregation ensures that all enabled components must be ready for overall readiness. Clear separation between required (Velero) and optional components.

68-114: LGTM!

Velero readiness logic correctly validates that all replicas are ready and at least one exists (lines 94-95). The NotFound case is properly handled (lines 77-85), and informative messages include replica counts for debugging.

116-170: LGTM!

NodeAgent readiness correctly handles the edge case where no nodes match the selector (DesiredNumberScheduled == 0 at line 150), treating this as ready rather than failing. The disabled component check (lines 119-127) ensures optional components report correctly.

172-228: LGTM!

NonAdmin readiness follows the consistent deployment readiness pattern, with proper handling of the disabled state and NotFound cases. The logic correctly ensures all replicas are ready.

230-286: LGTM!

VMFileRestore readiness implementation is consistent with other deployment checks, properly handling disabled and NotFound states. The readiness validation ensures operational stability.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

kaovilai · 2025-12-16T20:03:38Z

summary lgtm will tal tmr.

weshayutin · 2025-12-16T21:20:15Z

        - hypershift
      disableFsBackup: false
      logLevel: debug
  logFormat: text
  nonAdmin:
    enable: true
  vmFileRestore:
    enable: true
status:
  conditions:
    - lastTransitionTime: '2025-12-16T21:15:44Z'
      message: Reconcile complete
      reason: Complete
      status: 'True'
      type: Reconciled
    - lastTransitionTime: '2025-12-16T21:15:54Z'
      message: 'Velero deployment ready: 1/1 replicas'
      reason: DeploymentReady
      status: 'True'
      type: VeleroReady
    - lastTransitionTime: '2025-12-16T21:15:54Z'
      message: 'NodeAgent DaemonSet ready: 3/3 pods ready'
      reason: DaemonSetReady
      status: 'True'
      type: NodeAgentReady
    - lastTransitionTime: '2025-12-16T21:18:45Z'
      message: 'Non-Admin controller ready: 1/1 replicas'
      reason: DeploymentReady
      status: 'True'
      type: NonAdminReady
    - lastTransitionTime: '2025-12-16T21:16:14Z'
      message: 'VM File Restore controller ready: 1/1 replicas'
      reason: DeploymentReady
      status: 'True'
      type: VMFileRestoreReady

@shubham-pampattiwar nice.. looks

internal/controller/dataprotectionapplication_controller.go

kaovilai · 2025-12-17T07:18:29Z

internal/controller/dataprotectionapplication_controller.go

+	// Requeue if not all components are ready (to check again later)
+	if !allReady {
+		return ctrl.Result{RequeueAfter: 30 * time.Second}, err
+	}
+


Using watch would be more efficient and reactive?

Good point. For now using requeue to keep it simple - watches could cause frequent reconciles. Can optimize in a follow-up if 30s proves too slow in practice.

Address review comment: propagate readinessErr instead of just logging, so that API errors during readiness checks trigger a reconcile retry. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

openshift-ci · 2025-12-17T21:56:01Z

@shubham-pampattiwar: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/4.20-e2e-test-cli-aws	`93f70f0`	link	false	`/test 4.20-e2e-test-cli-aws`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci · 2025-12-18T20:01:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shubham-pampattiwar, sseago, weshayutin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [shubham-pampattiwar,sseago]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mpryc

/lgtm

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 16, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 16, 2025

weshayutin previously approved these changes Dec 16, 2025

View reviewed changes

kaovilai reviewed Dec 17, 2025

View reviewed changes

shubham-pampattiwar dismissed weshayutin’s stale review via 93f70f0 December 17, 2025 18:17

shubham-pampattiwar marked this pull request as ready for review December 17, 2025 18:18

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 17, 2025

shubham-pampattiwar requested review from kaovilai and weshayutin December 17, 2025 18:19

openshift-ci bot requested review from mrnold and sseago December 17, 2025 18:19

sseago approved these changes Dec 18, 2025

View reviewed changes

weshayutin approved these changes Dec 18, 2025

View reviewed changes

mpryc reviewed Dec 19, 2025

View reviewed changes

openshift-ci bot assigned mpryc Dec 19, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add deployment readiness conditions to DPA status #2047

Add deployment readiness conditions to DPA status #2047

shubham-pampattiwar commented Dec 16, 2025

Uh oh!

openshift-ci bot commented Dec 16, 2025

Uh oh!

coderabbitai bot commented Dec 16, 2025 •

edited

Loading

Uh oh!

kaovilai commented Dec 16, 2025

Uh oh!

weshayutin commented Dec 16, 2025

Uh oh!

Uh oh!

kaovilai Dec 17, 2025

Uh oh!

shubham-pampattiwar Dec 17, 2025 •

edited

Loading

Uh oh!

openshift-ci bot commented Dec 17, 2025

Uh oh!

openshift-ci bot commented Dec 18, 2025

Uh oh!

mpryc left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add deployment readiness conditions to DPA status #2047

Are you sure you want to change the base?

Add deployment readiness conditions to DPA status #2047

Conversation

shubham-pampattiwar commented Dec 16, 2025

Summary

Problem

Solution

Test plan

Uh oh!

openshift-ci bot commented Dec 16, 2025

Uh oh!

coderabbitai bot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

kaovilai commented Dec 16, 2025

Uh oh!

weshayutin commented Dec 16, 2025

Uh oh!

Uh oh!

kaovilai Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

shubham-pampattiwar Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Dec 17, 2025

Uh oh!

openshift-ci bot commented Dec 18, 2025

Uh oh!

mpryc left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

coderabbitai bot commented Dec 16, 2025 •

edited

Loading

shubham-pampattiwar Dec 17, 2025 •

edited

Loading