Skip to content

Conversation

@fonta-rh
Copy link
Contributor

Add etcd troubleshooting skill for Claude Code

Adds a comprehensive Claude Code skill that helps troubleshoot etcd issues on two-node fencing clusters.

The skill enables automated diagnosis and remediation of common etcd/Pacemaker problems.

New feature: Claude Code Skill (.claude/commands/etcd/):

  • Interactive troubleshooting capability with systematic decision trees
  • Automated diagnostic data collection from Pacemaker, etcd, and OpenShift
  • Analysis guidelines for cluster state, resource failures, and error patterns
  • Remediation recommendations with verification steps
  • Permission framework for safe read-only operations vs. requiring approval for state changes

Diagnostic Tools:

  • Ansible playbooks for validation and data collection
  • Shell scripts with automatic proxy.env detection for cluster access
  • Master orchestration script collecting both VM and cluster-level diagnostics

Helper:

  • force-new-cluster.yml - Automated recovery for cluster ID mismatches and split-brain scenarios

Documentation:

  • Etcd operations guide (clustering, recovery, monitoring, failures)
  • Pacemaker administration reference
  • Comprehensive skill usage and permission configuration docs

Tested with:

  • Cluster ID mismatch recovery
  • Resource failure cleanup
  • Failed learner rejoins
  • Transient CIB communication errors

@openshift-ci openshift-ci bot requested review from clobrano and eggfoobar October 29, 2025 10:27
@openshift-ci
Copy link

openshift-ci bot commented Oct 29, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fonta-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 29, 2025
@fonta-rh fonta-rh changed the title Claude tool: etcd troubleshooting skill NO-JIRA: Claude tool - etcd troubleshooting skill Oct 29, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 29, 2025
@openshift-ci-robot
Copy link

@fonta-rh: This pull request explicitly references no jira issue.

Details

In response to this:

Add etcd troubleshooting skill for Claude Code

Adds a comprehensive Claude Code skill that helps troubleshoot etcd issues on two-node fencing clusters.

The skill enables automated diagnosis and remediation of common etcd/Pacemaker problems.

New feature: Claude Code Skill (.claude/commands/etcd/):

  • Interactive troubleshooting capability with systematic decision trees
  • Automated diagnostic data collection from Pacemaker, etcd, and OpenShift
  • Analysis guidelines for cluster state, resource failures, and error patterns
  • Remediation recommendations with verification steps
  • Permission framework for safe read-only operations vs. requiring approval for state changes

Diagnostic Tools:

  • Ansible playbooks for validation and data collection
  • Shell scripts with automatic proxy.env detection for cluster access
  • Master orchestration script collecting both VM and cluster-level diagnostics

Helper:

  • force-new-cluster.yml - Automated recovery for cluster ID mismatches and split-brain scenarios

Documentation:

  • Etcd operations guide (clustering, recovery, monitoring, failures)
  • Pacemaker administration reference
  • Comprehensive skill usage and permission configuration docs

Tested with:

  • Cluster ID mismatch recovery
  • Resource failure cleanup
  • Failed learner rejoins
  • Transient CIB communication errors

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@fonta-rh fonta-rh added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. and removed jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Oct 29, 2025
Copy link
Contributor

@clobrano clobrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments


```bash
# From repository root
./.claude/commands/etcd/scripts/collect-all-diagnostics.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having scripts in an hidden directories works against discoverability. What's the reasoning behind this choice?
Also, .claude/commands is a special one for Claude. One could try to execute it as a "custom command":

[in Claude CLI] > /etcd:scripts:collect-all-diagnostic.sh

I'm curious to see if this works :D

@@ -0,0 +1,2052 @@
#!/bin/sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to keep this file in sync automatically?

when: podman_inspect.rc == 0

- name: Get podman logs for etcd
ansible.builtin.command: podman logs --tail 100 etcd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why limit to 100? Would it be too much to get all logs?

Comment on lines +9 to +10
- Added `force-new-cluster.yml` for automated etcd cluster recovery via CIB attributes
- Ansible conversion of Carlo Lobrano's shell script using `cluster_vms` inventory group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a warning about the playbook assuming the first node in the inventory is the leader.
Another option would be to add some confirmation step in the playbook itself, to ensure the target node is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants