Skip to content

Conversation

@clobrano
Copy link
Contributor

Extend cluster-restore.sh to support restoring etcd data in Two Node with Fencing clusters, where etcd is controlled by a Pacemaker resource-agent and runs as a Podman container instead of a Kubernetes static pod.

The script preserves existing static pod restore behavior when Pacemaker is not detected.

Restore process for Pacemaker-managed etcd:

  1. Auto-detect Pacemaker management or use PACEMAKER_MANAGED_ETCD env var [new]
  2. Disable Pacemaker podman-etcd resource agent to stop container [new - Pacemaker alternative to static pod stop]
  3. Backup existing etcd data directory [existing]
  4. Move podman-etcd configuration files to backup location [new]
  5. Restore snapshot using etcdctl with cluster initialization flags [existing, with new Pacemaker-specific flags added]
  6. Set force_new_cluster attribute for single-member bootstrap [new]
  7. Re-enable Pacemaker resource to start etcd with restored data [new - Pacemaker alternative to static pod restart]

New environment variables:

  • PACEMAKER_MANAGED_ETCD: Force Pacemaker restore mode
  • ETCD_CONTAINER_NAME: Podman container name (default: "etcd")
  • ETCD_ADVERTISE_IP: Override advertise IP (auto-detected from etcd.env)

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 10, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 10, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link

coderabbitai bot commented Dec 10, 2025

Walkthrough

Adds TNF-specific etcd restore and disable scripts, podman-etcd helper functions and a restore completion hook, refactors ScriptController to deploy topology-aware scripts with unit tests, and threads an Infrastructure lister into operator startup.

Changes

Cohort / File(s) Summary
TNF-specific etcd scripts
bindata/etcd/cluster-restore-tnf.sh, bindata/etcd/disable-etcd-tnf.sh
New Bash scripts for Pacemaker/TNF environments: cluster-restore-tnf.sh implements etcd snapshot restore flow (root check, env sourcing, advertise IP detection, etcdctl selection, snapshot restore, pacemaker interactions, file backups, and error handling); disable-etcd-tnf.sh stops the Pacemaker-managed etcd resource and waits for the container to stop.
Etcd common tools & restore message
bindata/etcd/etcd-common-tools, bindata/etcd/cluster-restore.sh
Added wait_for_podman_etcd_to_stop() (polling loop to detect podman-managed etcd stop) and print_restore_completion_message() to etcd-common-tools; appended a call to print_restore_completion_message at the end of cluster-restore.sh.
ScriptController topology-aware deployment
pkg/operator/scriptcontroller/scriptcontroller.go
Introduced EnvVarGetter interface, replaced the concrete env-var controller field with an EnvVarGetter, added an infraLister field and updated constructor signature to accept InfrastructureLister and EnvVarGetter. manageScriptConfigMap now selects and deploys TNF-specific vs standard cluster-restore.sh and disable-etcd.sh based on topology while still deploying common scripts.
Unit tests for script controller
pkg/operator/scriptcontroller/scriptcontroller_test.go
Added getTestController helper and tests TestManageScriptConfigMap and TestManageScriptConfigMap_MissingEnvVars covering multiple topology scenarios; asserts presence and contents of generated ConfigMap entries (topology-specific scripts, common scripts, and etcd.env).
Operator startup wiring
pkg/operator/starter.go
Threaded configInformers.Config().V1().Infrastructures().Lister() into operator startup call sites so the Infrastructure lister is passed into controller construction.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Review focus:
    • pkg/operator/scriptcontroller/scriptcontroller.go — constructor signature changes, correct use of infraLister and EnvVarGetter, topology detection and script selection.
    • bindata/etcd/cluster-restore-tnf.sh & bindata/etcd/disable-etcd-tnf.sh — shell safety, pacemaker interactions, timeout/error handling, backup/file operations.
    • bindata/etcd/etcd-common-tools — correctness of polling loop and timeout behavior in wait_for_podman_etcd_to_stop, and content of print_restore_completion_message.
    • pkg/operator/scriptcontroller/scriptcontroller_test.go — adequacy of topology coverage and mocking of infra lister and env vars.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 10, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clobrano
Once this PR has been reviewed and has the lgtm label, please assign dusk125 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

return 0
fi

if systemctl status pacemaker >/dev/null 2>&1 ; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is smart, until somebody else comes around and uses pacemaker on the control plane 🤣
Can we make this more specific? there is likely the infra CRD somewhere on /etc/kubernetes/manifest.

Otherwise, this is also a advantage of having your own restore script installed by the operator only on TNF - that gives you the guarantee that you definitely run on TNF, allowing you to remove this altogether

Comment on lines 82 to 88
function get_etcd_advertise_ip() {
# Try to detect IP from etcd.env (NODE_<nodename>_IP variable)
NODENAME_UNDERSCORE=$(echo "${NODENAME}" | tr '-' '_')
NODE_IP_VAR="NODE_${NODENAME_UNDERSCORE}_IP"
IP="${!NODE_IP_VAR}"
echo "$IP"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it always be localhost:2379?

done

echo "timed out waiting for etcd resources to start (timeout: $timeout seconds)"
return 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you would want to exit 1 here directly, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you also don't exit on the return of a 1 below

Comment on lines 104 to 133
function ensure_etcd_container_stopped() {
echo "ensure ${ETCD_CONTAINER_NAME} container is stopped"
local start=$SECONDS
local timeout=120
local disable_etcd_called=0
while [ $((SECONDS - start)) -lt "$timeout" ]; do
if podman container exists "${ETCD_CONTAINER_NAME}" && \
[[ "$(podman inspect -f '{{.State.Status}}' "${ETCD_CONTAINER_NAME}")" != "exited" ]]; then

if [ "$disable_etcd_called" -eq 1 ]; then
sleep 1
continue
fi

if ! pcs resource disable etcd; then
echo "could not disable podman-etcd. Please ensure the ${ETCD_CONTAINER_NAME} container is stopped"
exit 1
fi

disable_etcd_called=1
continue
fi

echo "${ETCD_CONTAINER_NAME} container is stopped"
return 0
done

echo "timed out waiting for ${ETCD_CONTAINER_NAME} to stop"
return 1
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably a good candidate for ./disable-etcd.sh as a TNF specific script


# Extend restore flags with Pacemaker-specific arguments
ETCDCTL_RESTORE_FLAGS+=(
--skip-hash-check=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is definitely something you should keep and not skip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

podman-etcd also makes copies (actually using cp) of the etcd db when a learner starts, which seemed to be incompatible with the hash check.
I was indeed considering to give some "special" name to such backups and let the script decide whether to use or not the hash check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which seemed to be incompatible with the hash check.

oof, that's probably something for us to take a deeper look at - I assume those are corrupted?

Comment on lines 155 to 157
--name="$NODENAME"
--initial-cluster="$NODENAME=https://$IP:2380"
--initial-advertise-peer-urls="https://$IP:2380"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you really need this, given that you just go and force-new-cluster in the end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I tried first without and failed with some "mismatch" error, IIRC

echo "could not wait for etcd resources to restart"
fi

cat << 'EOF'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually a really awesome idea, thanks for adding it

Comment on lines 291 to 292
$ oc get co/etcd -w
(Wait to see if the etcd operator completes the rollout successfully)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's also oc adm wait-for-stable-cluster

Then monitor again:
$ oc get co/etcd -w
3. IF revision.json is still not being re-created on this node after the patch, manually restart the static pods:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkill cluster-etcd-operator might also do the trick, even though this is slightly less nice than what you're doing below ;)

$ oc get co/etcd -w
(Wait to see if the etcd operator completes the rollout successfully)
2. IF the rollout does NOT complete, apply the forceRedeploymentReason patch:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also apply the oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": {"useUnsupportedUnsafeNonHANonProductionUnstableEtcd": true}}}'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TNF clusters already have this set to allow external Etcd management

@clobrano clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch 3 times, most recently from 17397af to ec62339 Compare December 11, 2025 17:08
@clobrano clobrano marked this pull request as ready for review December 16, 2025 09:48
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 16, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pkg/operator/scriptcontroller/scriptcontroller.go (1)

132-134: Use %w for proper error wrapping.

The error at line 133 uses %v which loses the original error type. Using %w enables callers to use errors.Is() and errors.Unwrap().

-	if isTNF, err := ceohelpers.IsExternalEtcdCluster(ctx, c.infraLister); err != nil {
-		return nil, false, fmt.Errorf("failed to detect cluster topology: %v", err)
+	if isTNF, err := ceohelpers.IsExternalEtcdCluster(ctx, c.infraLister); err != nil {
+		return nil, false, fmt.Errorf("failed to detect cluster topology: %w", err)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 2e46bfd and ec62339.

📒 Files selected for processing (7)
  • bindata/etcd/cluster-restore-tnf.sh (1 hunks)
  • bindata/etcd/cluster-restore.sh (1 hunks)
  • bindata/etcd/disable-etcd-tnf.sh (1 hunks)
  • bindata/etcd/etcd-common-tools (1 hunks)
  • pkg/operator/scriptcontroller/scriptcontroller.go (3 hunks)
  • pkg/operator/scriptcontroller/scriptcontroller_test.go (1 hunks)
  • pkg/operator/starter.go (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • bindata/etcd/etcd-common-tools
  • pkg/operator/scriptcontroller/scriptcontroller_test.go
  • bindata/etcd/disable-etcd-tnf.sh
  • bindata/etcd/cluster-restore.sh
  • pkg/operator/starter.go
  • bindata/etcd/cluster-restore-tnf.sh
  • pkg/operator/scriptcontroller/scriptcontroller.go
🧬 Code graph analysis (2)
bindata/etcd/cluster-restore-tnf.sh (2)
bindata/etcd/cluster-restore.sh (2)
  • source_required_dependency (21-29)
  • usage (34-40)
bindata/etcd/disable-etcd-tnf.sh (1)
  • source_required_dependency (17-25)
pkg/operator/scriptcontroller/scriptcontroller.go (3)
pkg/etcdenvvar/envvarcontroller.go (1)
  • Enqueueable (65-67)
pkg/operator/ceohelpers/external_etcd_status.go (1)
  • IsExternalEtcdCluster (23-35)
bindata/assets.go (1)
  • MustAsset (15-21)
🔇 Additional comments (16)
bindata/etcd/cluster-restore.sh (1)

143-143: LGTM!

Clean addition of the post-restore completion message hook. The function is properly sourced from etcd-common-tools earlier in the script.

bindata/etcd/disable-etcd-tnf.sh (1)

1-28: LGTM!

Script structure follows the established patterns from other etcd scripts. Proper root check, strict mode settings, and dependency sourcing.

bindata/etcd/etcd-common-tools (2)

96-115: Robust implementation with clear timeout handling.

The function properly handles the detection flow and timeout. One consideration: the crm_resource --locate check may return an error when the resource doesn't exist vs. when it's stopped. The current logic handles this by continuing to retry, which is reasonable.


117-144: LGTM!

Good use of single-quoted 'EOF' to prevent variable expansion in the heredoc, keeping the $() constructs literal for user execution. The step-by-step remediation instructions are helpful.

pkg/operator/starter.go (1)

471-478: LGTM!

Clean integration of the infrastructure lister into the script controller. This enables topology-aware script deployment (TNF vs standard) while reusing the existing informer infrastructure already initialized in RunOperator.

pkg/operator/scriptcontroller/scriptcontroller_test.go (3)

20-35: LGTM!

Clean helper function that properly constructs test dependencies with fakes. Good separation of test setup from assertions.


37-117: Well-structured table-driven tests covering topology variants.

Good coverage of the three topology modes with both positive (expected content) and negative (unexpected content) assertions. The verification of common scripts presence ensures shared functionality is not accidentally removed.


119-131: LGTM!

Good error case coverage. Verifying the specific error message ensures the error is from the expected code path.

bindata/etcd/cluster-restore-tnf.sh (5)

43-49: Indirect variable expansion is valid but verify NODENAME is set.

The ${!NODE_IP_VAR} indirect expansion works correctly in Bash. The function assumes NODENAME is set before being called; this is handled in setup_pacemaker_restore at line 73-77 which sets it with a default from hostname.


51-68: Consider making the expected node count configurable.

The check [ "$(echo "$output" | wc -l)" -eq 2 ] assumes exactly 2 nodes for TNF. This is correct for dual-replica topology but hardcodes the assumption. Since this is a TNF-specific script, this is acceptable.


85-92: Verify --skip-hash-check=true is intentional.

Past review discussion indicates this was added due to podman-etcd making copies that are "incompatible with the hash check." If this is due to backup corruption, that's a deeper issue to investigate. If the copies are valid but have different hashes due to the copy mechanism, this is acceptable.

Based on past review comments, this appears to be a known workaround. Please confirm the backups are not corrupted and this is expected behavior for podman-etcd copies.


1-42: LGTM!

Script header, strict mode settings, root check, and dependency sourcing follow the established patterns from other etcd scripts. The simplified usage message correctly reflects that TNF restore only needs the snapshot (not static pod resources).


113-165: LGTM!

The main restore flow is well-structured with proper validation and error handling. Good safeguard at line 114-117 preventing BACKUP_DIR from being the same as ETCD_DATA_DIR.

pkg/operator/scriptcontroller/scriptcontroller.go (3)

28-32: Good abstraction for testability.

The EnvVarGetter interface cleanly abstracts the environment variable controller, enabling easier unit testing through mocking.


129-140: LGTM - topology-aware script selection is well structured.

The conditional logic correctly selects TNF-specific or standard scripts based on the cluster topology, and the scripts are deployed under consistent names for downstream consumers.


152-154: All required script assets are present in bindata/etcd/: both TNF-specific (cluster-restore-tnf.sh, disable-etcd-tnf.sh) and standard (cluster-restore.sh, disable-etcd.sh) scripts exist and are properly referenced by the conditional topology detection logic.

@clobrano clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch from ec62339 to e46a48e Compare December 16, 2025 13:21
@clobrano
Copy link
Contributor Author

/retest-required

@clobrano clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch from e46a48e to 4873375 Compare December 18, 2025 08:39
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
bindata/etcd/cluster-restore-tnf.sh (1)

131-134: Simplify ETCD_CLIENT initialization.

The expression ${ETCD_ETCDCTL_BIN+etcdctl} always evaluates to "etcdctl" when ETCD_ETCDCTL_BIN is set (which it always is from etcd-common-tools). This seems like it should reference the variable value instead.

🔎 Apply this diff to fix the initialization:
-ETCD_CLIENT="${ETCD_ETCDCTL_BIN+etcdctl}"
+ETCD_CLIENT="${ETCD_ETCDCTL_BIN:-etcdctl}"
 if [ -n "${ETCD_ETCDUTL_BIN}" ]; then
   ETCD_CLIENT="${ETCD_ETCDUTL_BIN}"
 fi

This uses the variable's value if set, or defaults to "etcdctl" if unset.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between e46a48e and 4873375.

📒 Files selected for processing (7)
  • bindata/etcd/cluster-restore-tnf.sh (1 hunks)
  • bindata/etcd/cluster-restore.sh (1 hunks)
  • bindata/etcd/disable-etcd-tnf.sh (1 hunks)
  • bindata/etcd/etcd-common-tools (1 hunks)
  • pkg/operator/scriptcontroller/scriptcontroller.go (3 hunks)
  • pkg/operator/scriptcontroller/scriptcontroller_test.go (1 hunks)
  • pkg/operator/starter.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • bindata/etcd/cluster-restore.sh
  • bindata/etcd/disable-etcd-tnf.sh
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • bindata/etcd/etcd-common-tools
  • pkg/operator/scriptcontroller/scriptcontroller.go
  • bindata/etcd/cluster-restore-tnf.sh
  • pkg/operator/scriptcontroller/scriptcontroller_test.go
  • pkg/operator/starter.go
🧬 Code graph analysis (2)
pkg/operator/scriptcontroller/scriptcontroller.go (3)
pkg/etcdenvvar/envvarcontroller.go (1)
  • Enqueueable (65-67)
pkg/operator/ceohelpers/external_etcd_status.go (1)
  • IsExternalEtcdCluster (23-35)
bindata/assets.go (1)
  • MustAsset (15-21)
bindata/etcd/cluster-restore-tnf.sh (2)
bindata/etcd/cluster-restore.sh (2)
  • source_required_dependency (21-29)
  • usage (34-40)
bindata/etcd/disable-etcd-tnf.sh (1)
  • source_required_dependency (17-25)
🔇 Additional comments (11)
bindata/etcd/etcd-common-tools (2)

96-115: LGTM! The wait function implementation is solid.

The polling logic correctly handles three states: command failure (retry), non-empty location (etcd still running), and empty location (stopped). The 120-second timeout is reasonable for Pacemaker-managed etcd stop operations.


117-143: Clear and actionable completion message.

The step-by-step instructions with concrete commands will help operators verify and troubleshoot the restore process. Good use of a heredoc for multi-line output.

pkg/operator/starter.go (1)

471-478: LGTM! Correct wiring of infrastructure lister and env var controller.

The infrastructure lister is properly obtained from the config informers and passed alongside the existing envVarController to enable topology-aware script deployment.

pkg/operator/scriptcontroller/scriptcontroller_test.go (3)

20-35: LGTM! Clean test helper setup.

The helper correctly initializes the controller with fake dependencies and a no-op enqueue function for testing.


37-117: Good topology coverage with bidirectional assertions.

The test cases effectively verify that each topology gets its specific scripts while excluding content from other topologies. Testing both expectedContent and unexpectedContent is a solid approach to catch incorrect script selection.


119-131: LGTM! Error handling test validates graceful failure.

Good coverage for the missing environment variables scenario.

bindata/etcd/cluster-restore-tnf.sh (3)

143-155: LGTM! Data directory backup and restore sequence is well-structured.

The script correctly backs up the existing member directory before removal, and creates a fresh data directory for the restore. The sequence prevents accidental data loss.


167-182: LGTM! Pacemaker integration for cluster bootstrap.

The use of crm_attribute for force_new_cluster and subsequent pcs resource enable correctly implements the Pacemaker-based etcd restart flow. The decision to continue to print_restore_completion_message even on timeout (per past review discussion) provides operators with guidance regardless of wait outcome.


51-68: The hardcoded check for exactly 2 etcd instances may not align with single-node initialization and lacks documentation.

Line 58 checks [ "$(echo "$output" | wc -l)" -eq 2 ], but the script initializes with a single-node cluster (--initial-cluster="$NODENAME=..."). This mismatch is unexplained. Either document why exactly 2 lines are expected (e.g., "TNF always has 2 nodes"), or make this check more robust by verifying >= 1 if the goal is confirming at least one instance started.

pkg/operator/scriptcontroller/scriptcontroller.go (2)

28-32: LGTM! Clean interface design.

The EnvVarGetter interface properly abstracts the environment variable functionality needed by the controller, following Go interface conventions.


129-155: LGTM! Topology-aware script selection is well-implemented.

The logic correctly:

  1. Detects TNF topology via IsExternalEtcdCluster
  2. Handles the error case appropriately
  3. Selects the correct script variants based on topology
  4. Deploys them under standard names (cluster-restore.sh, disable-etcd.sh)

This allows consumers to use consistent script names regardless of the underlying topology.

@clobrano clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch from 4873375 to 5804bac Compare December 18, 2025 08:51
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
bindata/etcd/cluster-restore-tnf.sh (1)

19-27: Consider consolidating duplicate helper functions.

The source_required_dependency function is duplicated across cluster-restore-tnf.sh, disable-etcd-tnf.sh (lines 17-25), and cluster-restore.sh (lines 20-28). While this duplication ensures self-contained scripts, consider moving such common helpers to etcd-common-tools if these functions evolve or grow in complexity.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 4873375 and 5804bac.

📒 Files selected for processing (7)
  • bindata/etcd/cluster-restore-tnf.sh (1 hunks)
  • bindata/etcd/cluster-restore.sh (1 hunks)
  • bindata/etcd/disable-etcd-tnf.sh (1 hunks)
  • bindata/etcd/etcd-common-tools (1 hunks)
  • pkg/operator/scriptcontroller/scriptcontroller.go (3 hunks)
  • pkg/operator/scriptcontroller/scriptcontroller_test.go (1 hunks)
  • pkg/operator/starter.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • bindata/etcd/etcd-common-tools
  • pkg/operator/scriptcontroller/scriptcontroller_test.go
  • bindata/etcd/cluster-restore.sh
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • bindata/etcd/disable-etcd-tnf.sh
  • pkg/operator/scriptcontroller/scriptcontroller.go
  • bindata/etcd/cluster-restore-tnf.sh
  • pkg/operator/starter.go
🧬 Code graph analysis (2)
pkg/operator/scriptcontroller/scriptcontroller.go (3)
pkg/etcdenvvar/envvarcontroller.go (1)
  • Enqueueable (65-67)
pkg/operator/ceohelpers/external_etcd_status.go (1)
  • IsExternalEtcdCluster (23-35)
bindata/assets.go (1)
  • MustAsset (15-21)
bindata/etcd/cluster-restore-tnf.sh (2)
bindata/etcd/cluster-restore.sh (2)
  • source_required_dependency (21-29)
  • usage (34-40)
bindata/etcd/disable-etcd-tnf.sh (1)
  • source_required_dependency (17-25)
🔇 Additional comments (10)
bindata/etcd/cluster-restore-tnf.sh (4)

43-49: LGTM!

The variable indirection pattern with ${!NODE_IP_VAR} correctly retrieves the IP address from dynamically constructed environment variable names.


51-68: LGTM!

The polling logic with a 5-minute timeout and XML parsing to detect etcd startup is appropriate. The check for exactly 2 lines on line 58 appears to be a topology-specific validation.


94-102: LGTM!

The resource disable flow with pcs resource disable followed by wait_for_podman_etcd_to_stop (from etcd-common-tools) properly ensures a clean shutdown before proceeding with the restore.


178-182: LGTM!

The approach of printing the completion message even on timeout (lines 178-180) is intentional per previous discussion. This ensures operators always receive verification instructions, which is appropriate since timeout doesn't necessarily indicate restore failure.

pkg/operator/starter.go (1)

471-478: LGTM!

The addition of the infrastructure lister at line 475 correctly wires topology information into the ScriptController, enabling topology-aware script deployment. This follows the established pattern of passing listers to controllers.

bindata/etcd/disable-etcd-tnf.sh (2)

1-15: LGTM!

The script setup with strict shell options (errexit, pipefail, errtrace) and root check follows best practices for operational scripts.


27-34: LGTM!

The script correctly sources required dependencies and calls wait_for_podman_etcd_to_stop from etcd-common-tools with appropriate error handling. The past review concern about the undefined function has been resolved.

pkg/operator/scriptcontroller/scriptcontroller.go (3)

28-32: LGTM!

The EnvVarGetter interface provides a clean abstraction over environment variable access and listener management, improving testability and decoupling from concrete implementations.


34-40: LGTM!

The refactoring to use InfraLister and EnvVarGetter interface (lines 37-38, 46-47, 53-54) properly enables topology detection while maintaining clean separation of concerns. The initialization is correct and follows established patterns.

Also applies to: 42-55


129-154: LGTM!

The topology-aware script selection logic is well-implemented:

  • Line 132: Proper error handling for topology detection failure
  • Lines 135-140: Clear conditional selection of TNF vs standard scripts
  • Lines 142-150: Common scripts (quorum-restore.sh, cluster-backup.sh, etcd-common-tools) deployed regardless of topology
  • Lines 153-154: Topology-specific scripts deployed under standardized names (cluster-restore.sh, disable-etcd.sh), ensuring consistent paths for consuming code

This approach maintains backward compatibility while supporting TNF deployments.

Add dedicated etcd management scripts for Two Node with Fencing (TNF)
deployments that use Pacemaker-managed Podman containers instead of
Kubernetes static pods.

Changes:
- Add cluster-restore-tnf.sh with Pacemaker resource management
  - setup_pacemaker_restore() stops etcd resource and prepares restore flags
  - wait_for_podman_etcd_start() verifies etcd resource starts via pcs status
  - get_etcd_advertise_ip() detects advertise IP from etcd.env
  - Uses crm_attribute to set force_new_cluster on restart
- Add disable-etcd-tnf.sh for disabling Pacemaker-managed etcd
  - Uses 'pcs resource disable etcd' instead of moving static pod manifests
- Extend etcd-common-tools with shared TNF utilities
  - wait_for_podman_etcd_to_stop() waits for crm_resource to confirm stop
  - print_restore_completion_message() outputs rollout monitoring instructions

These scripts will be conditionally deployed by the cluster-etcd-operator
based on control plane topology detection.
Implement conditional script deployment in the script controller to
automatically deploy TNF-specific or standard etcd scripts based on the
detected control plane topology.

This maintains backward compatibility while enabling automatic support
for TNF deployments without manual script selection.
@clobrano clobrano force-pushed the enhancement/tnf-support-for-cluster-restore-script branch from 5804bac to 62a5b4c Compare December 18, 2025 10:30
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
bindata/etcd/disable-etcd-tnf.sh (1)

17-25: Consider extracting shared helper to reduce duplication.

The source_required_dependency function is duplicated across cluster-restore.sh, cluster-restore-tnf.sh, and this file. While consistency is valuable, this duplication could lead to maintenance issues if the error handling logic needs to change.

🔎 Optional: Move to etcd-common-tools

Consider moving this helper to etcd-common-tools so it can be shared across all scripts. However, this creates a chicken-and-egg problem (you need the helper to source the file that contains it), so the current duplication may be acceptable.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 5804bac and 62a5b4c.

📒 Files selected for processing (7)
  • bindata/etcd/cluster-restore-tnf.sh (1 hunks)
  • bindata/etcd/cluster-restore.sh (1 hunks)
  • bindata/etcd/disable-etcd-tnf.sh (1 hunks)
  • bindata/etcd/etcd-common-tools (1 hunks)
  • pkg/operator/scriptcontroller/scriptcontroller.go (3 hunks)
  • pkg/operator/scriptcontroller/scriptcontroller_test.go (1 hunks)
  • pkg/operator/starter.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • bindata/etcd/etcd-common-tools
  • pkg/operator/scriptcontroller/scriptcontroller_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • pkg/operator/starter.go
  • bindata/etcd/cluster-restore.sh
  • pkg/operator/scriptcontroller/scriptcontroller.go
  • bindata/etcd/disable-etcd-tnf.sh
  • bindata/etcd/cluster-restore-tnf.sh
🧬 Code graph analysis (2)
bindata/etcd/disable-etcd-tnf.sh (2)
bindata/etcd/cluster-restore-tnf.sh (1)
  • source_required_dependency (19-27)
bindata/etcd/cluster-restore.sh (1)
  • source_required_dependency (21-29)
bindata/etcd/cluster-restore-tnf.sh (2)
bindata/etcd/cluster-restore.sh (2)
  • source_required_dependency (21-29)
  • usage (34-40)
bindata/etcd/disable-etcd-tnf.sh (1)
  • source_required_dependency (17-25)
🔇 Additional comments (10)
bindata/etcd/cluster-restore.sh (1)

143-143: LGTM!

The addition of print_restore_completion_message at the end of the restore flow appropriately provides operators with post-restore guidance and verification instructions. The function is properly sourced from etcd-common-tools (line 32).

pkg/operator/starter.go (1)

471-478: LGTM!

Threading the Infrastructure lister into ScriptController initialization properly enables topology-aware script deployment. The lister is sourced from the already-initialized configInformers and aligns with the pattern used by other controllers in this file.

bindata/etcd/disable-etcd-tnf.sh (1)

30-34: LGTM!

The script correctly calls wait_for_podman_etcd_to_stop (from etcd-common-tools) and handles failures appropriately with a clear error message and non-zero exit code.

bindata/etcd/cluster-restore-tnf.sh (4)

70-110: LGTM!

The setup_pacemaker_restore function properly orchestrates the TNF-specific restore preparation:

  • Validates required environment (hostname, IP address)
  • Constructs appropriate etcdctl restore flags for single-member bootstrap
  • Disables the Pacemaker etcd resource and waits for clean shutdown
  • Safely moves configuration files with proper error handling

The array initialization at line 86 uses assignment (=), not append, so no pre-initialization is needed.


142-164: LGTM!

The backup and restore logic properly:

  • Backs up existing data-dir with cleanup of old backups
  • Completely removes and recreates data-dir for a clean restore
  • Uses the appropriate etcd client binary (etcdctl or etcdutl)
  • Includes a clear comment explaining why revision bumping is skipped

The error handling and success messaging provide good operator feedback.


166-181: LGTM!

The restart and completion logic appropriately:

  • Sets the force_new_cluster attribute for single-member bootstrap
  • Re-enables the Pacemaker etcd resource
  • Shows completion message regardless of wait timeout to provide verification instructions

The intentional non-exit on wait_for_podman_etcd_start failure (line 177-179) was confirmed in past reviews as correct behavior for providing operators with verification guidance even on timeout.


51-68: No action required. The 2-line check is appropriate for TNF's constrained topology.

TNF clusters are validated to contain exactly 2 nodes by the configuration layer (pkg/tnf/pkg/config/cluster.go enforces len(nodes.Items) == 2). The pacemaker cluster setup is also hardcoded for 2 nodes. The check for exactly 2 lines in the pcs status xml output correctly verifies that both nodes have their etcd resource started. This is not a fragile assumption but an appropriate implementation for a topology that is constrained by design, not convention.

pkg/operator/scriptcontroller/scriptcontroller.go (3)

28-32: LGTM!

The EnvVarGetter interface properly abstracts environment variable access and listener management, enabling clean dependency injection and testability. The interface methods align with their usage at lines 56 and 119.


42-56: LGTM!

The constructor signature and initialization properly incorporate the infrastructure lister and environment variable getter. The listener registration (line 56) ensures the controller receives updates when environment variables change.


129-154: LGTM!

The topology-aware script deployment logic properly:

  • Detects cluster topology using IsExternalEtcdCluster with appropriate error handling
  • Selects TNF-specific scripts (cluster-restore-tnf.sh, disable-etcd-tnf.sh) for external etcd clusters
  • Falls back to standard scripts for traditional deployments
  • Deploys common scripts (quorum-restore, cluster-backup, etcd-common-tools) for both topologies
  • Maps topology-specific scripts to standard names (cluster-restore.sh, disable-etcd.sh) so consumers don't need topology awareness

This design cleanly separates topology concerns from script consumers.

@clobrano
Copy link
Contributor Author

/retest-required

1 similar comment
@clobrano
Copy link
Contributor Author

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 18, 2025

@clobrano: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/unit 62a5b4c link true /test unit
ci/prow/e2e-agnostic-ovn 62a5b4c link true /test e2e-agnostic-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants