-
Notifications
You must be signed in to change notification settings - Fork 149
[TNF] support restore for pacemaker-managed etcd #1521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[TNF] support restore for pacemaker-managed etcd #1521
Conversation
|
Skipping CI for Draft Pull Request. |
WalkthroughAdds TNF-specific etcd restore and disable scripts, podman-etcd helper functions and a restore completion hook, refactors ScriptController to deploy topology-aware scripts with unit tests, and threads an Infrastructure lister into operator startup. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
✨ Finishing touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: clobrano The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
bindata/etcd/cluster-restore.sh
Outdated
| return 0 | ||
| fi | ||
|
|
||
| if systemctl status pacemaker >/dev/null 2>&1 ; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is smart, until somebody else comes around and uses pacemaker on the control plane 🤣
Can we make this more specific? there is likely the infra CRD somewhere on /etc/kubernetes/manifest.
Otherwise, this is also a advantage of having your own restore script installed by the operator only on TNF - that gives you the guarantee that you definitely run on TNF, allowing you to remove this altogether
bindata/etcd/cluster-restore.sh
Outdated
| function get_etcd_advertise_ip() { | ||
| # Try to detect IP from etcd.env (NODE_<nodename>_IP variable) | ||
| NODENAME_UNDERSCORE=$(echo "${NODENAME}" | tr '-' '_') | ||
| NODE_IP_VAR="NODE_${NODENAME_UNDERSCORE}_IP" | ||
| IP="${!NODE_IP_VAR}" | ||
| echo "$IP" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it always be localhost:2379?
bindata/etcd/cluster-restore.sh
Outdated
| done | ||
|
|
||
| echo "timed out waiting for etcd resources to start (timeout: $timeout seconds)" | ||
| return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like you would want to exit 1 here directly, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you also don't exit on the return of a 1 below
bindata/etcd/cluster-restore.sh
Outdated
| function ensure_etcd_container_stopped() { | ||
| echo "ensure ${ETCD_CONTAINER_NAME} container is stopped" | ||
| local start=$SECONDS | ||
| local timeout=120 | ||
| local disable_etcd_called=0 | ||
| while [ $((SECONDS - start)) -lt "$timeout" ]; do | ||
| if podman container exists "${ETCD_CONTAINER_NAME}" && \ | ||
| [[ "$(podman inspect -f '{{.State.Status}}' "${ETCD_CONTAINER_NAME}")" != "exited" ]]; then | ||
|
|
||
| if [ "$disable_etcd_called" -eq 1 ]; then | ||
| sleep 1 | ||
| continue | ||
| fi | ||
|
|
||
| if ! pcs resource disable etcd; then | ||
| echo "could not disable podman-etcd. Please ensure the ${ETCD_CONTAINER_NAME} container is stopped" | ||
| exit 1 | ||
| fi | ||
|
|
||
| disable_etcd_called=1 | ||
| continue | ||
| fi | ||
|
|
||
| echo "${ETCD_CONTAINER_NAME} container is stopped" | ||
| return 0 | ||
| done | ||
|
|
||
| echo "timed out waiting for ${ETCD_CONTAINER_NAME} to stop" | ||
| return 1 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably a good candidate for ./disable-etcd.sh as a TNF specific script
bindata/etcd/cluster-restore.sh
Outdated
|
|
||
| # Extend restore flags with Pacemaker-specific arguments | ||
| ETCDCTL_RESTORE_FLAGS+=( | ||
| --skip-hash-check=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is definitely something you should keep and not skip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
podman-etcd also makes copies (actually using cp) of the etcd db when a learner starts, which seemed to be incompatible with the hash check.
I was indeed considering to give some "special" name to such backups and let the script decide whether to use or not the hash check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which seemed to be incompatible with the hash check.
oof, that's probably something for us to take a deeper look at - I assume those are corrupted?
bindata/etcd/cluster-restore.sh
Outdated
| --name="$NODENAME" | ||
| --initial-cluster="$NODENAME=https://$IP:2380" | ||
| --initial-advertise-peer-urls="https://$IP:2380" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you really need this, given that you just go and force-new-cluster in the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I tried first without and failed with some "mismatch" error, IIRC
bindata/etcd/cluster-restore.sh
Outdated
| echo "could not wait for etcd resources to restart" | ||
| fi | ||
|
|
||
| cat << 'EOF' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually a really awesome idea, thanks for adding it
bindata/etcd/cluster-restore.sh
Outdated
| $ oc get co/etcd -w | ||
| (Wait to see if the etcd operator completes the rollout successfully) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's also oc adm wait-for-stable-cluster
bindata/etcd/cluster-restore.sh
Outdated
| Then monitor again: | ||
| $ oc get co/etcd -w | ||
| 3. IF revision.json is still not being re-created on this node after the patch, manually restart the static pods: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkill cluster-etcd-operator might also do the trick, even though this is slightly less nice than what you're doing below ;)
bindata/etcd/cluster-restore.sh
Outdated
| $ oc get co/etcd -w | ||
| (Wait to see if the etcd operator completes the rollout successfully) | ||
| 2. IF the rollout does NOT complete, apply the forceRedeploymentReason patch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also apply the oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": {"useUnsupportedUnsafeNonHANonProductionUnstableEtcd": true}}}'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TNF clusters already have this set to allow external Etcd management
17397af to
ec62339
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/operator/scriptcontroller/scriptcontroller.go (1)
132-134: Use%wfor proper error wrapping.The error at line 133 uses
%vwhich loses the original error type. Using%wenables callers to useerrors.Is()anderrors.Unwrap().- if isTNF, err := ceohelpers.IsExternalEtcdCluster(ctx, c.infraLister); err != nil { - return nil, false, fmt.Errorf("failed to detect cluster topology: %v", err) + if isTNF, err := ceohelpers.IsExternalEtcdCluster(ctx, c.infraLister); err != nil { + return nil, false, fmt.Errorf("failed to detect cluster topology: %w", err)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (7)
bindata/etcd/cluster-restore-tnf.sh(1 hunks)bindata/etcd/cluster-restore.sh(1 hunks)bindata/etcd/disable-etcd-tnf.sh(1 hunks)bindata/etcd/etcd-common-tools(1 hunks)pkg/operator/scriptcontroller/scriptcontroller.go(3 hunks)pkg/operator/scriptcontroller/scriptcontroller_test.go(1 hunks)pkg/operator/starter.go(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
bindata/etcd/etcd-common-toolspkg/operator/scriptcontroller/scriptcontroller_test.gobindata/etcd/disable-etcd-tnf.shbindata/etcd/cluster-restore.shpkg/operator/starter.gobindata/etcd/cluster-restore-tnf.shpkg/operator/scriptcontroller/scriptcontroller.go
🧬 Code graph analysis (2)
bindata/etcd/cluster-restore-tnf.sh (2)
bindata/etcd/cluster-restore.sh (2)
source_required_dependency(21-29)usage(34-40)bindata/etcd/disable-etcd-tnf.sh (1)
source_required_dependency(17-25)
pkg/operator/scriptcontroller/scriptcontroller.go (3)
pkg/etcdenvvar/envvarcontroller.go (1)
Enqueueable(65-67)pkg/operator/ceohelpers/external_etcd_status.go (1)
IsExternalEtcdCluster(23-35)bindata/assets.go (1)
MustAsset(15-21)
🔇 Additional comments (16)
bindata/etcd/cluster-restore.sh (1)
143-143: LGTM!Clean addition of the post-restore completion message hook. The function is properly sourced from
etcd-common-toolsearlier in the script.bindata/etcd/disable-etcd-tnf.sh (1)
1-28: LGTM!Script structure follows the established patterns from other etcd scripts. Proper root check, strict mode settings, and dependency sourcing.
bindata/etcd/etcd-common-tools (2)
96-115: Robust implementation with clear timeout handling.The function properly handles the detection flow and timeout. One consideration: the
crm_resource --locatecheck may return an error when the resource doesn't exist vs. when it's stopped. The current logic handles this by continuing to retry, which is reasonable.
117-144: LGTM!Good use of single-quoted
'EOF'to prevent variable expansion in the heredoc, keeping the$()constructs literal for user execution. The step-by-step remediation instructions are helpful.pkg/operator/starter.go (1)
471-478: LGTM!Clean integration of the infrastructure lister into the script controller. This enables topology-aware script deployment (TNF vs standard) while reusing the existing informer infrastructure already initialized in
RunOperator.pkg/operator/scriptcontroller/scriptcontroller_test.go (3)
20-35: LGTM!Clean helper function that properly constructs test dependencies with fakes. Good separation of test setup from assertions.
37-117: Well-structured table-driven tests covering topology variants.Good coverage of the three topology modes with both positive (expected content) and negative (unexpected content) assertions. The verification of common scripts presence ensures shared functionality is not accidentally removed.
119-131: LGTM!Good error case coverage. Verifying the specific error message ensures the error is from the expected code path.
bindata/etcd/cluster-restore-tnf.sh (5)
43-49: Indirect variable expansion is valid but verifyNODENAMEis set.The
${!NODE_IP_VAR}indirect expansion works correctly in Bash. The function assumesNODENAMEis set before being called; this is handled insetup_pacemaker_restoreat line 73-77 which sets it with a default fromhostname.
51-68: Consider making the expected node count configurable.The check
[ "$(echo "$output" | wc -l)" -eq 2 ]assumes exactly 2 nodes for TNF. This is correct for dual-replica topology but hardcodes the assumption. Since this is a TNF-specific script, this is acceptable.
85-92: Verify--skip-hash-check=trueis intentional.Past review discussion indicates this was added due to podman-etcd making copies that are "incompatible with the hash check." If this is due to backup corruption, that's a deeper issue to investigate. If the copies are valid but have different hashes due to the copy mechanism, this is acceptable.
Based on past review comments, this appears to be a known workaround. Please confirm the backups are not corrupted and this is expected behavior for podman-etcd copies.
1-42: LGTM!Script header, strict mode settings, root check, and dependency sourcing follow the established patterns from other etcd scripts. The simplified usage message correctly reflects that TNF restore only needs the snapshot (not static pod resources).
113-165: LGTM!The main restore flow is well-structured with proper validation and error handling. Good safeguard at line 114-117 preventing BACKUP_DIR from being the same as ETCD_DATA_DIR.
pkg/operator/scriptcontroller/scriptcontroller.go (3)
28-32: Good abstraction for testability.The
EnvVarGetterinterface cleanly abstracts the environment variable controller, enabling easier unit testing through mocking.
129-140: LGTM - topology-aware script selection is well structured.The conditional logic correctly selects TNF-specific or standard scripts based on the cluster topology, and the scripts are deployed under consistent names for downstream consumers.
152-154: All required script assets are present in bindata/etcd/: both TNF-specific (cluster-restore-tnf.sh,disable-etcd-tnf.sh) and standard (cluster-restore.sh,disable-etcd.sh) scripts exist and are properly referenced by the conditional topology detection logic.
ec62339 to
e46a48e
Compare
|
/retest-required |
e46a48e to
4873375
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
bindata/etcd/cluster-restore-tnf.sh (1)
131-134: Simplify ETCD_CLIENT initialization.The expression
${ETCD_ETCDCTL_BIN+etcdctl}always evaluates to"etcdctl"whenETCD_ETCDCTL_BINis set (which it always is frometcd-common-tools). This seems like it should reference the variable value instead.🔎 Apply this diff to fix the initialization:
-ETCD_CLIENT="${ETCD_ETCDCTL_BIN+etcdctl}" +ETCD_CLIENT="${ETCD_ETCDCTL_BIN:-etcdctl}" if [ -n "${ETCD_ETCDUTL_BIN}" ]; then ETCD_CLIENT="${ETCD_ETCDUTL_BIN}" fiThis uses the variable's value if set, or defaults to "etcdctl" if unset.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (7)
bindata/etcd/cluster-restore-tnf.sh(1 hunks)bindata/etcd/cluster-restore.sh(1 hunks)bindata/etcd/disable-etcd-tnf.sh(1 hunks)bindata/etcd/etcd-common-tools(1 hunks)pkg/operator/scriptcontroller/scriptcontroller.go(3 hunks)pkg/operator/scriptcontroller/scriptcontroller_test.go(1 hunks)pkg/operator/starter.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- bindata/etcd/cluster-restore.sh
- bindata/etcd/disable-etcd-tnf.sh
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
bindata/etcd/etcd-common-toolspkg/operator/scriptcontroller/scriptcontroller.gobindata/etcd/cluster-restore-tnf.shpkg/operator/scriptcontroller/scriptcontroller_test.gopkg/operator/starter.go
🧬 Code graph analysis (2)
pkg/operator/scriptcontroller/scriptcontroller.go (3)
pkg/etcdenvvar/envvarcontroller.go (1)
Enqueueable(65-67)pkg/operator/ceohelpers/external_etcd_status.go (1)
IsExternalEtcdCluster(23-35)bindata/assets.go (1)
MustAsset(15-21)
bindata/etcd/cluster-restore-tnf.sh (2)
bindata/etcd/cluster-restore.sh (2)
source_required_dependency(21-29)usage(34-40)bindata/etcd/disable-etcd-tnf.sh (1)
source_required_dependency(17-25)
🔇 Additional comments (11)
bindata/etcd/etcd-common-tools (2)
96-115: LGTM! The wait function implementation is solid.The polling logic correctly handles three states: command failure (retry), non-empty location (etcd still running), and empty location (stopped). The 120-second timeout is reasonable for Pacemaker-managed etcd stop operations.
117-143: Clear and actionable completion message.The step-by-step instructions with concrete commands will help operators verify and troubleshoot the restore process. Good use of a heredoc for multi-line output.
pkg/operator/starter.go (1)
471-478: LGTM! Correct wiring of infrastructure lister and env var controller.The infrastructure lister is properly obtained from the config informers and passed alongside the existing
envVarControllerto enable topology-aware script deployment.pkg/operator/scriptcontroller/scriptcontroller_test.go (3)
20-35: LGTM! Clean test helper setup.The helper correctly initializes the controller with fake dependencies and a no-op enqueue function for testing.
37-117: Good topology coverage with bidirectional assertions.The test cases effectively verify that each topology gets its specific scripts while excluding content from other topologies. Testing both
expectedContentandunexpectedContentis a solid approach to catch incorrect script selection.
119-131: LGTM! Error handling test validates graceful failure.Good coverage for the missing environment variables scenario.
bindata/etcd/cluster-restore-tnf.sh (3)
143-155: LGTM! Data directory backup and restore sequence is well-structured.The script correctly backs up the existing member directory before removal, and creates a fresh data directory for the restore. The sequence prevents accidental data loss.
167-182: LGTM! Pacemaker integration for cluster bootstrap.The use of
crm_attributeforforce_new_clusterand subsequentpcs resource enablecorrectly implements the Pacemaker-based etcd restart flow. The decision to continue toprint_restore_completion_messageeven on timeout (per past review discussion) provides operators with guidance regardless of wait outcome.
51-68: The hardcoded check for exactly 2 etcd instances may not align with single-node initialization and lacks documentation.Line 58 checks
[ "$(echo "$output" | wc -l)" -eq 2 ], but the script initializes with a single-node cluster (--initial-cluster="$NODENAME=..."). This mismatch is unexplained. Either document why exactly 2 lines are expected (e.g., "TNF always has 2 nodes"), or make this check more robust by verifying>= 1if the goal is confirming at least one instance started.pkg/operator/scriptcontroller/scriptcontroller.go (2)
28-32: LGTM! Clean interface design.The
EnvVarGetterinterface properly abstracts the environment variable functionality needed by the controller, following Go interface conventions.
129-155: LGTM! Topology-aware script selection is well-implemented.The logic correctly:
- Detects TNF topology via
IsExternalEtcdCluster- Handles the error case appropriately
- Selects the correct script variants based on topology
- Deploys them under standard names (
cluster-restore.sh,disable-etcd.sh)This allows consumers to use consistent script names regardless of the underlying topology.
4873375 to
5804bac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
bindata/etcd/cluster-restore-tnf.sh (1)
19-27: Consider consolidating duplicate helper functions.The
source_required_dependencyfunction is duplicated acrosscluster-restore-tnf.sh,disable-etcd-tnf.sh(lines 17-25), andcluster-restore.sh(lines 20-28). While this duplication ensures self-contained scripts, consider moving such common helpers toetcd-common-toolsif these functions evolve or grow in complexity.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (7)
bindata/etcd/cluster-restore-tnf.sh(1 hunks)bindata/etcd/cluster-restore.sh(1 hunks)bindata/etcd/disable-etcd-tnf.sh(1 hunks)bindata/etcd/etcd-common-tools(1 hunks)pkg/operator/scriptcontroller/scriptcontroller.go(3 hunks)pkg/operator/scriptcontroller/scriptcontroller_test.go(1 hunks)pkg/operator/starter.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- bindata/etcd/etcd-common-tools
- pkg/operator/scriptcontroller/scriptcontroller_test.go
- bindata/etcd/cluster-restore.sh
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
bindata/etcd/disable-etcd-tnf.shpkg/operator/scriptcontroller/scriptcontroller.gobindata/etcd/cluster-restore-tnf.shpkg/operator/starter.go
🧬 Code graph analysis (2)
pkg/operator/scriptcontroller/scriptcontroller.go (3)
pkg/etcdenvvar/envvarcontroller.go (1)
Enqueueable(65-67)pkg/operator/ceohelpers/external_etcd_status.go (1)
IsExternalEtcdCluster(23-35)bindata/assets.go (1)
MustAsset(15-21)
bindata/etcd/cluster-restore-tnf.sh (2)
bindata/etcd/cluster-restore.sh (2)
source_required_dependency(21-29)usage(34-40)bindata/etcd/disable-etcd-tnf.sh (1)
source_required_dependency(17-25)
🔇 Additional comments (10)
bindata/etcd/cluster-restore-tnf.sh (4)
43-49: LGTM!The variable indirection pattern with
${!NODE_IP_VAR}correctly retrieves the IP address from dynamically constructed environment variable names.
51-68: LGTM!The polling logic with a 5-minute timeout and XML parsing to detect etcd startup is appropriate. The check for exactly 2 lines on line 58 appears to be a topology-specific validation.
94-102: LGTM!The resource disable flow with
pcs resource disablefollowed bywait_for_podman_etcd_to_stop(from etcd-common-tools) properly ensures a clean shutdown before proceeding with the restore.
178-182: LGTM!The approach of printing the completion message even on timeout (lines 178-180) is intentional per previous discussion. This ensures operators always receive verification instructions, which is appropriate since timeout doesn't necessarily indicate restore failure.
pkg/operator/starter.go (1)
471-478: LGTM!The addition of the infrastructure lister at line 475 correctly wires topology information into the ScriptController, enabling topology-aware script deployment. This follows the established pattern of passing listers to controllers.
bindata/etcd/disable-etcd-tnf.sh (2)
1-15: LGTM!The script setup with strict shell options (
errexit,pipefail,errtrace) and root check follows best practices for operational scripts.
27-34: LGTM!The script correctly sources required dependencies and calls
wait_for_podman_etcd_to_stopfrom etcd-common-tools with appropriate error handling. The past review concern about the undefined function has been resolved.pkg/operator/scriptcontroller/scriptcontroller.go (3)
28-32: LGTM!The
EnvVarGetterinterface provides a clean abstraction over environment variable access and listener management, improving testability and decoupling from concrete implementations.
34-40: LGTM!The refactoring to use
InfraListerandEnvVarGetterinterface (lines 37-38, 46-47, 53-54) properly enables topology detection while maintaining clean separation of concerns. The initialization is correct and follows established patterns.Also applies to: 42-55
129-154: LGTM!The topology-aware script selection logic is well-implemented:
- Line 132: Proper error handling for topology detection failure
- Lines 135-140: Clear conditional selection of TNF vs standard scripts
- Lines 142-150: Common scripts (quorum-restore.sh, cluster-backup.sh, etcd-common-tools) deployed regardless of topology
- Lines 153-154: Topology-specific scripts deployed under standardized names (cluster-restore.sh, disable-etcd.sh), ensuring consistent paths for consuming code
This approach maintains backward compatibility while supporting TNF deployments.
Add dedicated etcd management scripts for Two Node with Fencing (TNF) deployments that use Pacemaker-managed Podman containers instead of Kubernetes static pods. Changes: - Add cluster-restore-tnf.sh with Pacemaker resource management - setup_pacemaker_restore() stops etcd resource and prepares restore flags - wait_for_podman_etcd_start() verifies etcd resource starts via pcs status - get_etcd_advertise_ip() detects advertise IP from etcd.env - Uses crm_attribute to set force_new_cluster on restart - Add disable-etcd-tnf.sh for disabling Pacemaker-managed etcd - Uses 'pcs resource disable etcd' instead of moving static pod manifests - Extend etcd-common-tools with shared TNF utilities - wait_for_podman_etcd_to_stop() waits for crm_resource to confirm stop - print_restore_completion_message() outputs rollout monitoring instructions These scripts will be conditionally deployed by the cluster-etcd-operator based on control plane topology detection.
Implement conditional script deployment in the script controller to automatically deploy TNF-specific or standard etcd scripts based on the detected control plane topology. This maintains backward compatibility while enabling automatic support for TNF deployments without manual script selection.
5804bac to
62a5b4c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
bindata/etcd/disable-etcd-tnf.sh (1)
17-25: Consider extracting shared helper to reduce duplication.The
source_required_dependencyfunction is duplicated acrosscluster-restore.sh,cluster-restore-tnf.sh, and this file. While consistency is valuable, this duplication could lead to maintenance issues if the error handling logic needs to change.🔎 Optional: Move to etcd-common-tools
Consider moving this helper to
etcd-common-toolsso it can be shared across all scripts. However, this creates a chicken-and-egg problem (you need the helper to source the file that contains it), so the current duplication may be acceptable.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (7)
bindata/etcd/cluster-restore-tnf.sh(1 hunks)bindata/etcd/cluster-restore.sh(1 hunks)bindata/etcd/disable-etcd-tnf.sh(1 hunks)bindata/etcd/etcd-common-tools(1 hunks)pkg/operator/scriptcontroller/scriptcontroller.go(3 hunks)pkg/operator/scriptcontroller/scriptcontroller_test.go(1 hunks)pkg/operator/starter.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- bindata/etcd/etcd-common-tools
- pkg/operator/scriptcontroller/scriptcontroller_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
pkg/operator/starter.gobindata/etcd/cluster-restore.shpkg/operator/scriptcontroller/scriptcontroller.gobindata/etcd/disable-etcd-tnf.shbindata/etcd/cluster-restore-tnf.sh
🧬 Code graph analysis (2)
bindata/etcd/disable-etcd-tnf.sh (2)
bindata/etcd/cluster-restore-tnf.sh (1)
source_required_dependency(19-27)bindata/etcd/cluster-restore.sh (1)
source_required_dependency(21-29)
bindata/etcd/cluster-restore-tnf.sh (2)
bindata/etcd/cluster-restore.sh (2)
source_required_dependency(21-29)usage(34-40)bindata/etcd/disable-etcd-tnf.sh (1)
source_required_dependency(17-25)
🔇 Additional comments (10)
bindata/etcd/cluster-restore.sh (1)
143-143: LGTM!The addition of
print_restore_completion_messageat the end of the restore flow appropriately provides operators with post-restore guidance and verification instructions. The function is properly sourced frometcd-common-tools(line 32).pkg/operator/starter.go (1)
471-478: LGTM!Threading the Infrastructure lister into
ScriptControllerinitialization properly enables topology-aware script deployment. The lister is sourced from the already-initializedconfigInformersand aligns with the pattern used by other controllers in this file.bindata/etcd/disable-etcd-tnf.sh (1)
30-34: LGTM!The script correctly calls
wait_for_podman_etcd_to_stop(frometcd-common-tools) and handles failures appropriately with a clear error message and non-zero exit code.bindata/etcd/cluster-restore-tnf.sh (4)
70-110: LGTM!The
setup_pacemaker_restorefunction properly orchestrates the TNF-specific restore preparation:
- Validates required environment (hostname, IP address)
- Constructs appropriate etcdctl restore flags for single-member bootstrap
- Disables the Pacemaker etcd resource and waits for clean shutdown
- Safely moves configuration files with proper error handling
The array initialization at line 86 uses assignment (
=), not append, so no pre-initialization is needed.
142-164: LGTM!The backup and restore logic properly:
- Backs up existing data-dir with cleanup of old backups
- Completely removes and recreates data-dir for a clean restore
- Uses the appropriate etcd client binary (etcdctl or etcdutl)
- Includes a clear comment explaining why revision bumping is skipped
The error handling and success messaging provide good operator feedback.
166-181: LGTM!The restart and completion logic appropriately:
- Sets the
force_new_clusterattribute for single-member bootstrap- Re-enables the Pacemaker etcd resource
- Shows completion message regardless of wait timeout to provide verification instructions
The intentional non-exit on
wait_for_podman_etcd_startfailure (line 177-179) was confirmed in past reviews as correct behavior for providing operators with verification guidance even on timeout.
51-68: No action required. The 2-line check is appropriate for TNF's constrained topology.TNF clusters are validated to contain exactly 2 nodes by the configuration layer (
pkg/tnf/pkg/config/cluster.goenforceslen(nodes.Items) == 2). The pacemaker cluster setup is also hardcoded for 2 nodes. The check for exactly 2 lines in thepcs status xmloutput correctly verifies that both nodes have their etcd resource started. This is not a fragile assumption but an appropriate implementation for a topology that is constrained by design, not convention.pkg/operator/scriptcontroller/scriptcontroller.go (3)
28-32: LGTM!The
EnvVarGetterinterface properly abstracts environment variable access and listener management, enabling clean dependency injection and testability. The interface methods align with their usage at lines 56 and 119.
42-56: LGTM!The constructor signature and initialization properly incorporate the infrastructure lister and environment variable getter. The listener registration (line 56) ensures the controller receives updates when environment variables change.
129-154: LGTM!The topology-aware script deployment logic properly:
- Detects cluster topology using
IsExternalEtcdClusterwith appropriate error handling- Selects TNF-specific scripts (
cluster-restore-tnf.sh,disable-etcd-tnf.sh) for external etcd clusters- Falls back to standard scripts for traditional deployments
- Deploys common scripts (quorum-restore, cluster-backup, etcd-common-tools) for both topologies
- Maps topology-specific scripts to standard names (
cluster-restore.sh,disable-etcd.sh) so consumers don't need topology awarenessThis design cleanly separates topology concerns from script consumers.
|
/retest-required |
1 similar comment
|
/retest-required |
|
@clobrano: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Extend cluster-restore.sh to support restoring etcd data in Two Node with Fencing clusters, where etcd is controlled by a Pacemaker resource-agent and runs as a Podman container instead of a Kubernetes static pod.
The script preserves existing static pod restore behavior when Pacemaker is not detected.
Restore process for Pacemaker-managed etcd:
New environment variables: