Skip to content

Conversation

@shubham-pampattiwar
Copy link
Member

Summary

  • Adds new readiness conditions to DPA status: VeleroReady, NodeAgentReady, NonAdminReady, VMFileRestoreReady
  • Tracks actual pod/deployment readiness, not just resource creation
  • DPA requeues every 30 seconds until all enabled components are ready
  • Disabled components show True with reason ComponentDisabled

Problem

Previously, DPA could show Reconciled=True while underlying pods were not running due to deployment issues (e.g., CrashLoopBackOff, image pull errors). This made it difficult to diagnose issues.

Solution

Added separate readiness conditions that check:

  • Deployments: Status.ReadyReplicas == Spec.Replicas
  • DaemonSets: Status.NumberReady == Status.DesiredNumberScheduled

Test plan

  • Unit tests added and passing (make test)
  • Deploy on cluster and verify conditions appear on DPA status
  • Verify conditions update correctly when pods become ready/not ready
  • Verify disabled components show ComponentDisabled reason

🤖 Generated with Claude Code

This change addresses the gap where DPA could show Reconciled=True
while underlying pods (Velero, Non-Admin, VMFR, NodeAgent) were not
actually running due to deployment issues.

Added new readiness conditions:
- VeleroReady: Tracks Velero deployment readiness
- NodeAgentReady: Tracks NodeAgent DaemonSet readiness
- NonAdminReady: Tracks Non-Admin controller deployment readiness
- VMFileRestoreReady: Tracks VM File Restore controller readiness

Behavior:
- Disabled components show True with reason ComponentDisabled
- Missing deployments show False with reason ComponentNotFound
- DPA requeues every 30 seconds until all components are ready
- Existing Reconciled condition unchanged (tracks resource creation)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 16, 2025
@openshift-ci
Copy link

openshift-ci bot commented Dec 16, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 16, 2025

Walkthrough

Introduces readiness condition constants for DPA components, implements centralized readiness evaluation logic for Velero, NodeAgent, NonAdmin, and VMFileRestore through a new readiness module, integrates readiness checks into the main reconciliation loop, and adds comprehensive test coverage for all readiness scenarios.

Changes

Cohort / File(s) Summary
API Types Constants
api/v1alpha1/dataprotectionapplication_types.go
Added readiness condition constants (ConditionVeleroReady, ConditionNodeAgentReady, ConditionNonAdminReady, ConditionVMFileRestoreReady) and corresponding reason constants (ReasonDeploymentReady, ReasonDeploymentNotReady, ReasonDaemonSetReady, ReasonDaemonSetNotReady, ReasonComponentDisabled, ReasonComponentNotFound) for semantic labeling of readiness checks.
Readiness Implementation
internal/controller/readiness.go
New module implementing centralized readiness evaluation. Adds updateReadinessConditions orchestrator function and component-specific readiness check functions (updateVeleroReadinessCondition, updateNodeAgentReadinessCondition, updateNonAdminReadinessCondition, updateVMFileRestoreReadinessCondition) that fetch resources, evaluate replica readiness, handle missing components, and update status conditions.
Readiness Tests
internal/controller/readiness_test.go
Comprehensive test suite validating readiness logic across all components using Ginkgo/Gomega, including scenarios for missing deployments, zero ready replicas, disabled features, and combined readiness checks. Includes helper constructors for test resources.
Controller Integration
internal/controller/dataprotectionapplication_controller.go
Integrated readiness evaluation into reconciliation loop. After reconciliation, calls updateReadinessConditions to evaluate all components, requeues with 30-second delay if not all components are ready, and updates DPA status while preserving error state. Added time import for RequeueAfter scheduling.
Cleanup
internal/controller/velero.go
Removed comment and blank line from ReconcileVeleroDeployment with no functional code changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • readiness.go requires careful review of each component's readiness logic (Deployment replica checks, DaemonSet replica checks, disabled component handling, error propagation)
  • readiness_test.go verification that test cases cover all branch conditions and resource states
  • Controller integration in dataprotectionapplication_controller.go to ensure readiness checks are properly sequenced and error handling preserves meaningful error information
  • Constants naming and usage consistency across new readiness condition types
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between f75a38d and 93f70f0.

📒 Files selected for processing (5)
  • api/v1alpha1/dataprotectionapplication_types.go (1 hunks)
  • internal/controller/dataprotectionapplication_controller.go (2 hunks)
  • internal/controller/readiness.go (1 hunks)
  • internal/controller/readiness_test.go (1 hunks)
  • internal/controller/velero.go (0 hunks)
💤 Files with no reviewable changes (1)
  • internal/controller/velero.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • internal/controller/readiness_test.go
  • api/v1alpha1/dataprotectionapplication_types.go
  • internal/controller/dataprotectionapplication_controller.go
  • internal/controller/readiness.go
🧬 Code graph analysis (1)
internal/controller/readiness_test.go (2)
pkg/common/common.go (2)
  • Velero (19-19)
  • NodeAgent (20-20)
tests/e2e/lib/scheme.go (1)
  • Scheme (23-23)
🔇 Additional comments (11)
api/v1alpha1/dataprotectionapplication_types.go (2)

37-43: LGTM!

The readiness condition constants are well-named and consistently follow the existing naming pattern. They comprehensively cover all DPA components (Velero, NodeAgent, NonAdmin, VMFileRestore) as intended.


45-53: LGTM!

The readiness reason constants comprehensively cover all deployment and daemonset readiness scenarios (ready, not ready, disabled, not found). Clear and semantic naming.

internal/controller/dataprotectionapplication_controller.go (3)

23-23: LGTM!

Time import correctly added to support the RequeueAfter scheduling at line 156.


140-147: LGTM!

The error handling correctly prioritizes reconciliation errors over readiness errors (line 144-146), preventing the masking of more critical reconciliation failures. The readiness error is returned when no prior error exists, addressing the concern from the previous review.


154-158: LGTM!

The 30-second requeue interval for non-ready components provides a reasonable balance between responsiveness and resource usage. The simpler polling approach (versus watches) is appropriate for this use case and can be optimized later if needed.

internal/controller/readiness_test.go (1)

36-517: LGTM!

Comprehensive test suite with excellent coverage:

  • All four readiness functions tested (Velero, NodeAgent, NonAdmin, VMFileRestore)
  • Multiple scenarios per component (not found, not ready, ready, disabled)
  • Edge cases handled (zero desired pods for DaemonSet at lines 224-245)
  • Combined readiness scenarios validate integration
  • Clean test structure with reusable helpers

The test implementation validates both the boolean return values and the status condition fields, ensuring complete verification of the readiness logic.

internal/controller/readiness.go (5)

32-66: LGTM!

The orchestrator function correctly aggregates readiness across all components with proper error propagation. The logical AND aggregation ensures that all enabled components must be ready for overall readiness. Clear separation between required (Velero) and optional components.


68-114: LGTM!

Velero readiness logic correctly validates that all replicas are ready and at least one exists (lines 94-95). The NotFound case is properly handled (lines 77-85), and informative messages include replica counts for debugging.


116-170: LGTM!

NodeAgent readiness correctly handles the edge case where no nodes match the selector (DesiredNumberScheduled == 0 at line 150), treating this as ready rather than failing. The disabled component check (lines 119-127) ensures optional components report correctly.


172-228: LGTM!

NonAdmin readiness follows the consistent deployment readiness pattern, with proper handling of the disabled state and NotFound cases. The logic correctly ensures all replicas are ready.


230-286: LGTM!

VMFileRestore readiness implementation is consistent with other deployment checks, properly handling disabled and NotFound states. The readiness validation ensures operational stability.


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 16, 2025
@kaovilai
Copy link
Member

summary lgtm will tal tmr.

@weshayutin
Copy link
Contributor

        - hypershift
      disableFsBackup: false
      logLevel: debug
  logFormat: text
  nonAdmin:
    enable: true
  vmFileRestore:
    enable: true
status:
  conditions:
    - lastTransitionTime: '2025-12-16T21:15:44Z'
      message: Reconcile complete
      reason: Complete
      status: 'True'
      type: Reconciled
    - lastTransitionTime: '2025-12-16T21:15:54Z'
      message: 'Velero deployment ready: 1/1 replicas'
      reason: DeploymentReady
      status: 'True'
      type: VeleroReady
    - lastTransitionTime: '2025-12-16T21:15:54Z'
      message: 'NodeAgent DaemonSet ready: 3/3 pods ready'
      reason: DaemonSetReady
      status: 'True'
      type: NodeAgentReady
    - lastTransitionTime: '2025-12-16T21:18:45Z'
      message: 'Non-Admin controller ready: 1/1 replicas'
      reason: DeploymentReady
      status: 'True'
      type: NonAdminReady
    - lastTransitionTime: '2025-12-16T21:16:14Z'
      message: 'VM File Restore controller ready: 1/1 replicas'
      reason: DeploymentReady
      status: 'True'
      type: VMFileRestoreReady

@shubham-pampattiwar nice.. looks

weshayutin
weshayutin previously approved these changes Dec 16, 2025
Comment on lines +151 to +155
// Requeue if not all components are ready (to check again later)
if !allReady {
return ctrl.Result{RequeueAfter: 30 * time.Second}, err
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using watch would be more efficient and reactive?

Copy link
Member Author

@shubham-pampattiwar shubham-pampattiwar Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. For now using requeue to keep it simple - watches could cause frequent reconciles. Can optimize in a follow-up if 30s proves too slow in practice.

Address review comment: propagate readinessErr instead of just logging,
so that API errors during readiness checks trigger a reconcile retry.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@shubham-pampattiwar shubham-pampattiwar marked this pull request as ready for review December 17, 2025 18:18
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 17, 2025
@openshift-ci openshift-ci bot requested review from mrnold and sseago December 17, 2025 18:19
@openshift-ci
Copy link

openshift-ci bot commented Dec 17, 2025

@shubham-pampattiwar: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.20-e2e-test-cli-aws 93f70f0 link false /test 4.20-e2e-test-cli-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci
Copy link

openshift-ci bot commented Dec 18, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shubham-pampattiwar, sseago, weshayutin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [shubham-pampattiwar,sseago]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@mpryc mpryc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants