Skip to content

Comments

[Hyperfleet 547]: Pull Secret Service DDR#89

Open
ldornele wants to merge 2 commits intoopenshift-hyperfleet:mainfrom
ldornele:HYPERFLEET-547
Open

[Hyperfleet 547]: Pull Secret Service DDR#89
ldornele wants to merge 2 commits intoopenshift-hyperfleet:mainfrom
ldornele:HYPERFLEET-547

Conversation

@ldornele
Copy link
Contributor

This PR introduces the Pull Secret Service - Design Decision Record (DDR) documentation for HyperFleet.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 16, 2026

Important

Review skipped

Review was skipped as selected files did not have any reviewable changes.

💤 Files selected but had no reviewable changes (1)
  • hyperfleet/docs/pull-secret-service-ddr.md

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@ciaranRoche ciaranRoche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some work gone into this document, thanks for sticking with it and putting it through


## 2. Architecture Components

> **🔗 Design Principles Applied:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include error recovery and partial failure handling. There is no discussion of what happens when external API calls fail. For example what if Quay succeeds but RHIT fails?


---

#### 3.2.3 Trigger Credential Rotation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more details on credential rotation as it is not 100% clear. What is the dual credential period?
How long as both credentials valid?
How does the service know the new creds are being used by the cluster
What triggers deletion of old credentials?
Who is the actor that confirms the cluster is healthy with new creds?


**Endpoint**: `POST /v1/clusters/{cluster_id}/pull-secrets`

**Purpose**: Generate or retrieve pull secret for a cluster. Idempotent operation - returns existing credentials if already created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some more information as to the how this is idempotent, what is the key? How are concurrent requests handled 🙏

EP4[POST /v1/clusters/ID/pull-secrets/rotations<br/>Trigger rotation]
EP5[GET /v1/clusters/ID/pull-secrets/rotations<br/>List rotations]
EP6[GET /v1/clusters/ID/pull-secrets/rotations/ID<br/>Get rotation status]
EP7[GET /v1/health<br/>Health check]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the standards in hyperfleet we have healthz and readyz

I also think it is worth mentioning what these actually check, liveness vs readiness!

EP5[GET /v1/clusters/ID/pull-secrets/rotations<br/>List rotations]
EP6[GET /v1/clusters/ID/pull-secrets/rotations/ID<br/>Get rotation status]
EP7[GET /v1/health<br/>Health check]
EP8[GET /v1/metrics<br/>Prometheus metrics]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth mentioning some metrics that are exposed, to help later on in determining SLI/SLO's for the service


**Benefits**:
- ✅ Single deployment to manage
- ✅ Shared credential pool (cost efficient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see credential pool dotted around this document, can you add an explanation to what this pool is please

GeneratePullSecretRequest:
type: object
required:
- cluster_external_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct, prob some claude stuff left over, seems cluster_id, cluster_external_ID and ClusterID get interchanged in this document.

- ❌ **Pool Inefficiency**: Total pool size = N × high_water_mark (over-provisioning)
- ❌ **Operational Overhead**: Must manage N instances (upgrades, monitoring, backups)

**Cost Estimate (3 HyperFleet instances)**:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NP : we can probably remove the cost estimates from this document. Focus more on the use cases vs the cost

}
```

**Response 400 Bad Request**:
Copy link
Contributor

@rafabene rafabene Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error responses bellow are inconsistent with the
https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/stan
dards/error-model.md

All error responses in this DDR should follow the standard format, e.g.:

  {
    "type": "https://api.hyperfleet.io/errors/resource-not-found",
    "title": "Resource Not Found",
    "status": 404,
    "detail": "Pull secret for cluster 'cls-123' not found",
    "instance": "/v1/clusters/cls-123/pull-secrets",
    "code": "HYPERFLEET-NTF-001",
    "timestamp": "2025-01-15T10:31:00.456Z",
    "trace_id": "5cf92f3577b34da6a3ce929d0e0e4737"
  }


### 3.3 Error Handling

All error responses follow RFC 7807 (Problem Details for HTTP APIs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Section 3.3 references RFC 7807, but HyperFleet uses RFC 9457 (the successor) with application/problem+json


redhat:
enabled: true
partner_code: "ocm-service"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RHIT registry configuration example uses partner_code: "ocm-service", which
directly contradicts Principle 6 (Dedicated Partner Code) defined earlier in this same
document.

Principle 6 explicitly states:
"HyperFleet must establish its own partner identity with dedicated partner code hyperfleet"
"Using ocm-service would create quota conflicts and billing confusion"

The configuration should be:

  rhit:
    enabled: true
    base_url: "https://registry.access.redhat.com/api/v1"
    partner_code: "hyperfleet"  # NOT "ocm-service" - see Principle 6
    cert_secret: "rhit-client-cert"

This looks like a copy-paste artifact from the AMS codebase analysis.


**Layer 3: Encryption at Rest** - *Data protected in storage*
- **Purpose**: Encrypt tokens/passwords before saving to database
- **Implementation**: PostgreSQL `pgcrypto` extension with AES-256-GCM encryption
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document claims "AES-256-GCM via pgcrypto" encryption (line ~554), but the code examples
(line ~611) use pgp_sym_encrypt / pgp_sym_decrypt, which is OpenPGP symmetric encryption, not
AES-256-GCM.

Key issues:

  1. Wrong algorithm: pgcrypto's pgp_sym_encrypt defaults to AES-128 in CFB mode, not
    AES-256-GCM. These are different algorithms with different security properties (GCM provides
    authenticated encryption; CFB does not).
  2. Key exposure risk: Using pgp_sym_encrypt means the encryption key is sent to PostgreSQL in
    every SQL query. This key can appear in:
    - PostgreSQL query logs (log_statement = 'all')
    - pg_stat_statements view
    - pg_stat_activity during execution
    - Database audit logs
  3. No key rotation strategy: The document mentions 90-day key rotation but doesn't discuss how existing rows would be re-encrypted with a new key.

For a credential management service, consider application-level encryption (encrypt in Go
before DB insert) using crypto/aes with GCM mode. This keeps the key out of database queries
entirely and matches the claimed algorithm

Pod3[Pod 3<br/>API :8080<br/>Metrics :8081]
end

Service[Service: pull-secret-service<br/>Type: ClusterIP<br/>Ports: 80 API, 8081 Metrics]
Copy link
Contributor

@rafabene rafabene Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster_external_id:
type: string
example: cls-abc123def456
cloud_provider:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contradicts Principle 2 (Cloud Agnostic), which states:
"No cloud-specific parameters in API contract"

  1. It creates cloud-specific coupling in what should be a cloud-agnostic service
  2. Adding a new cloud provider requires an API schema change (new enum value)
  3. The caller (adapter) shouldn't need to tell the service which cloud it runs on — the
    service shouldn't care

style RHIT fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
```

### 4.4 Kubernetes Deployment Topology (Both Options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Section "4.4 Kubernetes Deployment Topology (Both Options)" appears twice


The Pull Secret Service exposes a **RESTful API** for credential lifecycle management. All endpoints are authenticated using Kubernetes ServiceAccount tokens and follow standard HTTP semantics.

### 3.1 API Overview
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API Overview diagram lists 8 endpoints, but only 5 are specified in the "Core
Endpoints" section. Two endpoints are missing their detailed specification:

1. DELETE /v1/clusters/{cluster_id}/pull-secrets

No documentation for:

  • Expected behavior (soft delete? hard delete? cascade to rotations?)
  • What happens to active/pending rotations when credentials are deleted
  • Response body and status codes (200 vs 204?)
  • Whether it revokes credentials at the registry level (Quay robot account deletion, RHIT SA
    removal) or only deletes the local record

2. GET /v1/clusters/{cluster_id}/pull-secrets/rotations

No documentation for:

  • Pagination strategy (offset-based? cursor-based?)
  • Filter/query parameters (by status? by date range?)
  • Response format and sort order
  • Whether it returns all historical rotations or only recent ones

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants