[Hyperfleet 547]: Pull Secret Service DDR#89
[Hyperfleet 547]: Pull Secret Service DDR#89ldornele wants to merge 2 commits intoopenshift-hyperfleet:mainfrom
Conversation
|
Important Review skippedReview was skipped as selected files did not have any reviewable changes. 💤 Files selected but had no reviewable changes (1)
You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Comment |
ciaranRoche
left a comment
There was a problem hiding this comment.
Some work gone into this document, thanks for sticking with it and putting it through
|
|
||
| ## 2. Architecture Components | ||
|
|
||
| > **🔗 Design Principles Applied:** |
There was a problem hiding this comment.
Should we include error recovery and partial failure handling. There is no discussion of what happens when external API calls fail. For example what if Quay succeeds but RHIT fails?
|
|
||
| --- | ||
|
|
||
| #### 3.2.3 Trigger Credential Rotation |
There was a problem hiding this comment.
Can you add more details on credential rotation as it is not 100% clear. What is the dual credential period?
How long as both credentials valid?
How does the service know the new creds are being used by the cluster
What triggers deletion of old credentials?
Who is the actor that confirms the cluster is healthy with new creds?
|
|
||
| **Endpoint**: `POST /v1/clusters/{cluster_id}/pull-secrets` | ||
|
|
||
| **Purpose**: Generate or retrieve pull secret for a cluster. Idempotent operation - returns existing credentials if already created. |
There was a problem hiding this comment.
Can you add some more information as to the how this is idempotent, what is the key? How are concurrent requests handled 🙏
| EP4[POST /v1/clusters/ID/pull-secrets/rotations<br/>Trigger rotation] | ||
| EP5[GET /v1/clusters/ID/pull-secrets/rotations<br/>List rotations] | ||
| EP6[GET /v1/clusters/ID/pull-secrets/rotations/ID<br/>Get rotation status] | ||
| EP7[GET /v1/health<br/>Health check] |
There was a problem hiding this comment.
Following the standards in hyperfleet we have healthz and readyz
I also think it is worth mentioning what these actually check, liveness vs readiness!
| EP5[GET /v1/clusters/ID/pull-secrets/rotations<br/>List rotations] | ||
| EP6[GET /v1/clusters/ID/pull-secrets/rotations/ID<br/>Get rotation status] | ||
| EP7[GET /v1/health<br/>Health check] | ||
| EP8[GET /v1/metrics<br/>Prometheus metrics] |
There was a problem hiding this comment.
Might be worth mentioning some metrics that are exposed, to help later on in determining SLI/SLO's for the service
|
|
||
| **Benefits**: | ||
| - ✅ Single deployment to manage | ||
| - ✅ Shared credential pool (cost efficient) |
There was a problem hiding this comment.
I see credential pool dotted around this document, can you add an explanation to what this pool is please
| GeneratePullSecretRequest: | ||
| type: object | ||
| required: | ||
| - cluster_external_id |
There was a problem hiding this comment.
Is this correct, prob some claude stuff left over, seems cluster_id, cluster_external_ID and ClusterID get interchanged in this document.
| - ❌ **Pool Inefficiency**: Total pool size = N × high_water_mark (over-provisioning) | ||
| - ❌ **Operational Overhead**: Must manage N instances (upgrades, monitoring, backups) | ||
|
|
||
| **Cost Estimate (3 HyperFleet instances)**: |
There was a problem hiding this comment.
NP : we can probably remove the cost estimates from this document. Focus more on the use cases vs the cost
| } | ||
| ``` | ||
|
|
||
| **Response 400 Bad Request**: |
There was a problem hiding this comment.
The error responses bellow are inconsistent with the
https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/stan
dards/error-model.md
All error responses in this DDR should follow the standard format, e.g.:
{
"type": "https://api.hyperfleet.io/errors/resource-not-found",
"title": "Resource Not Found",
"status": 404,
"detail": "Pull secret for cluster 'cls-123' not found",
"instance": "/v1/clusters/cls-123/pull-secrets",
"code": "HYPERFLEET-NTF-001",
"timestamp": "2025-01-15T10:31:00.456Z",
"trace_id": "5cf92f3577b34da6a3ce929d0e0e4737"
}|
|
||
| ### 3.3 Error Handling | ||
|
|
||
| All error responses follow RFC 7807 (Problem Details for HTTP APIs): |
There was a problem hiding this comment.
Section 3.3 references RFC 7807, but HyperFleet uses RFC 9457 (the successor) with application/problem+json
|
|
||
| redhat: | ||
| enabled: true | ||
| partner_code: "ocm-service" |
There was a problem hiding this comment.
The RHIT registry configuration example uses partner_code: "ocm-service", which
directly contradicts Principle 6 (Dedicated Partner Code) defined earlier in this same
document.
Principle 6 explicitly states:
"HyperFleet must establish its own partner identity with dedicated partner code hyperfleet"
"Using ocm-service would create quota conflicts and billing confusion"
The configuration should be:
rhit:
enabled: true
base_url: "https://registry.access.redhat.com/api/v1"
partner_code: "hyperfleet" # NOT "ocm-service" - see Principle 6
cert_secret: "rhit-client-cert"This looks like a copy-paste artifact from the AMS codebase analysis.
|
|
||
| **Layer 3: Encryption at Rest** - *Data protected in storage* | ||
| - **Purpose**: Encrypt tokens/passwords before saving to database | ||
| - **Implementation**: PostgreSQL `pgcrypto` extension with AES-256-GCM encryption |
There was a problem hiding this comment.
The document claims "AES-256-GCM via pgcrypto" encryption (line ~554), but the code examples
(line ~611) use pgp_sym_encrypt / pgp_sym_decrypt, which is OpenPGP symmetric encryption, not
AES-256-GCM.
Key issues:
- Wrong algorithm: pgcrypto's pgp_sym_encrypt defaults to AES-128 in CFB mode, not
AES-256-GCM. These are different algorithms with different security properties (GCM provides
authenticated encryption; CFB does not). - Key exposure risk: Using pgp_sym_encrypt means the encryption key is sent to PostgreSQL in
every SQL query. This key can appear in:
- PostgreSQL query logs (log_statement = 'all')
- pg_stat_statements view
- pg_stat_activity during execution
- Database audit logs - No key rotation strategy: The document mentions 90-day key rotation but doesn't discuss how existing rows would be re-encrypted with a new key.
For a credential management service, consider application-level encryption (encrypt in Go
before DB insert) using crypto/aes with GCM mode. This keeps the key out of database queries
entirely and matches the claimed algorithm
| Pod3[Pod 3<br/>API :8080<br/>Metrics :8081] | ||
| end | ||
|
|
||
| Service[Service: pull-secret-service<br/>Type: ClusterIP<br/>Ports: 80 API, 8081 Metrics] |
There was a problem hiding this comment.
Per the https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/standards/health-endpoints.md and https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/standards/metrics.md should adopt the standard port 9090 for metrics
| cluster_external_id: | ||
| type: string | ||
| example: cls-abc123def456 | ||
| cloud_provider: |
There was a problem hiding this comment.
This contradicts Principle 2 (Cloud Agnostic), which states:
"No cloud-specific parameters in API contract"
- It creates cloud-specific coupling in what should be a cloud-agnostic service
- Adding a new cloud provider requires an API schema change (new enum value)
- The caller (adapter) shouldn't need to tell the service which cloud it runs on — the
service shouldn't care
| style RHIT fill:#f3e5f5,stroke:#4a148c,stroke-width:2px | ||
| ``` | ||
|
|
||
| ### 4.4 Kubernetes Deployment Topology (Both Options) |
There was a problem hiding this comment.
Section "4.4 Kubernetes Deployment Topology (Both Options)" appears twice
|
|
||
| The Pull Secret Service exposes a **RESTful API** for credential lifecycle management. All endpoints are authenticated using Kubernetes ServiceAccount tokens and follow standard HTTP semantics. | ||
|
|
||
| ### 3.1 API Overview |
There was a problem hiding this comment.
The API Overview diagram lists 8 endpoints, but only 5 are specified in the "Core
Endpoints" section. Two endpoints are missing their detailed specification:
1. DELETE /v1/clusters/{cluster_id}/pull-secrets
No documentation for:
- Expected behavior (soft delete? hard delete? cascade to rotations?)
- What happens to active/pending rotations when credentials are deleted
- Response body and status codes (200 vs 204?)
- Whether it revokes credentials at the registry level (Quay robot account deletion, RHIT SA
removal) or only deletes the local record
2. GET /v1/clusters/{cluster_id}/pull-secrets/rotations
No documentation for:
- Pagination strategy (offset-based? cursor-based?)
- Filter/query parameters (by status? by date range?)
- Response format and sort order
- Whether it returns all historical rotations or only recent ones
This PR introduces the Pull Secret Service - Design Decision Record (DDR) documentation for HyperFleet.