[TNF] support restore for pacemaker-managed etcd #1521

clobrano · 2025-12-10T13:14:53Z

Extend cluster-restore.sh to support restoring etcd data in Two Node with Fencing clusters, where etcd is controlled by a Pacemaker resource-agent and runs as a Podman container instead of a Kubernetes static pod.

The script preserves existing static pod restore behavior when Pacemaker is not detected.

Restore process for Pacemaker-managed etcd:

Auto-detect Pacemaker management or use PACEMAKER_MANAGED_ETCD env var [new]
Disable Pacemaker podman-etcd resource agent to stop container [new - Pacemaker alternative to static pod stop]
Backup existing etcd data directory [existing]
Move podman-etcd configuration files to backup location [new]
Restore snapshot using etcdctl with cluster initialization flags [existing, with new Pacemaker-specific flags added]
Set force_new_cluster attribute for single-member bootstrap [new]
Re-enable Pacemaker resource to start etcd with restored data [new - Pacemaker alternative to static pod restart]

New environment variables:

PACEMAKER_MANAGED_ETCD: Force Pacemaker restore mode
ETCD_CONTAINER_NAME: Podman container name (default: "etcd")
ETCD_ADVERTISE_IP: Override advertise IP (auto-detected from etcd.env)

openshift-ci · 2025-12-10T13:14:57Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2025-12-10T13:15:00Z

Walkthrough

Adds TNF-specific etcd restore and disable scripts, podman-etcd helper functions and a restore completion hook, refactors ScriptController to deploy topology-aware scripts with unit tests, and threads an Infrastructure lister into operator startup.

Changes

Cohort / File(s)	Summary
TNF-specific etcd scripts `bindata/etcd/cluster-restore-tnf.sh`, `bindata/etcd/disable-etcd-tnf.sh`	New Bash scripts for Pacemaker/TNF environments: `cluster-restore-tnf.sh` implements etcd snapshot restore flow (root check, env sourcing, advertise IP detection, etcdctl selection, snapshot restore, pacemaker interactions, file backups, and error handling); `disable-etcd-tnf.sh` stops the Pacemaker-managed etcd resource and waits for the container to stop.
Etcd common tools & restore message `bindata/etcd/etcd-common-tools`, `bindata/etcd/cluster-restore.sh`	Added `wait_for_podman_etcd_to_stop()` (polling loop to detect podman-managed etcd stop) and `print_restore_completion_message()` to `etcd-common-tools`; appended a call to `print_restore_completion_message` at the end of `cluster-restore.sh`.
ScriptController topology-aware deployment `pkg/operator/scriptcontroller/scriptcontroller.go`	Introduced `EnvVarGetter` interface, replaced the concrete env-var controller field with an `EnvVarGetter`, added an `infraLister` field and updated constructor signature to accept `InfrastructureLister` and `EnvVarGetter`. `manageScriptConfigMap` now selects and deploys TNF-specific vs standard `cluster-restore.sh` and `disable-etcd.sh` based on topology while still deploying common scripts.
Unit tests for script controller `pkg/operator/scriptcontroller/scriptcontroller_test.go`	Added `getTestController` helper and tests `TestManageScriptConfigMap` and `TestManageScriptConfigMap_MissingEnvVars` covering multiple topology scenarios; asserts presence and contents of generated ConfigMap entries (topology-specific scripts, common scripts, and `etcd.env`).
Operator startup wiring `pkg/operator/starter.go`	Threaded `configInformers.Config().V1().Infrastructures().Lister()` into operator startup call sites so the Infrastructure lister is passed into controller construction.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Review focus:
- pkg/operator/scriptcontroller/scriptcontroller.go — constructor signature changes, correct use of infraLister and EnvVarGetter, topology detection and script selection.
- bindata/etcd/cluster-restore-tnf.sh & bindata/etcd/disable-etcd-tnf.sh — shell safety, pacemaker interactions, timeout/error handling, backup/file operations.
- bindata/etcd/etcd-common-tools — correctness of polling loop and timeout behavior in wait_for_podman_etcd_to_stop, and content of print_restore_completion_message.
- pkg/operator/scriptcontroller/scriptcontroller_test.go — adequacy of topology coverage and mocking of infra lister and env vars.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2025-12-10T13:15:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clobrano
Once this PR has been reviewed and has the lgtm label, please assign dusk125 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tjungblu · 2025-12-10T14:46:57Z