Fix issue with reconciliation of resources using UseAsync=true#562
Open
fabiencastarede wants to merge 3 commits intocrossplane:mainfrom
Open
Fix issue with reconciliation of resources using UseAsync=true#562fabiencastarede wants to merge 3 commits intocrossplane:mainfrom
fabiencastarede wants to merge 3 commits intocrossplane:mainfrom
Conversation
When a provider pod restarts, the ephemeral Terraform workspace state stored in /tmp/<uid>/ is lost. For resources with UseAsync=true, this causes the Refresh operation to fail to detect existing resources, triggering duplicate resource creation. This fix detects async resources that were previously created by checking for the crossplane.io/external-create-succeeded annotation and uses Import instead of Refresh. Import reconstructs state directly from the cloud provider API, avoiding the duplicate creation issue. Signed-off-by: Fabien Castarède <fcastarede@waadoo.cloud>
Signed-off-by: Fabien Castarède <fcastarede@waadoo.cloud>
Signed-off-by: Fabien Castarède <fcastarede@waadoo.cloud>
981b4a3 to
6797ec8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes a critical issue #561 where resources configured with
UseAsync=truecreate duplicate cloud resources after provider pod restarts or Kubernetes cluster backup/restore operations (e.g., Velero).Problem
Resources with
UseAsync = truestore Terraform workspace state in ephemeral pod storage (/tmp/<workspace-id>/). When the provider pod restarts:external-namewith new resource ID, orphaning the original resourceReproduction
Solution
When an async resource has the
external-create-succeededannotation (indicating prior successful creation) but workspace state is missing, use Import instead of Refresh to reconstruct state directly from the cloud provider API.Code Changes
File:
pkg/controller/external.goAdded logic in the
Observe()function before callingRefresh():Required import added:
"github.com/crossplane/crossplane-runtime/v2/pkg/meta"How It Works
external-create-succeededannotationImport()to reconstruct Terraform state from cloud provider API using theexternal-nameas resource IDWhy This Works
external-nameannotation persists in Kubernetes (not lost on pod restart)external-create-succeededannotation persists in Kubernetes (backed up by Velero)Testing
Tested successfully with provider-ovh managing OVH Managed Kubernetes Clusters (a resource with
UseAsync = true).Test Scenarios
2. Verify creation succeeded
3. Delete provider pod
4. Wait for pod restart
5. Check for duplicates
✅ external-name unchanged
✅ Resource synced
2. Backup with Velero
3. Reset Kubernetes cluster
4. Restore with Velero
5. Check for duplicates
✅ external-name preserved
✅ Resource synced with existing cloud resource
2. Drain/cordon node A
3. Pod reschedules to node B
4. Check for duplicates
✅ Resource remains synced
Debug Logs Confirming Fix
After applying the fix, provider logs show:
Impact
UseAsync = truethat have been previously createdAlternative Solutions Considered
1. PersistentVolume for Terraform Workspaces
Rejected: Adds infrastructure complexity, requires storage provisioning, doesn't scale well with multiple provider instances
2. Store tfstate in Kubernetes Secrets
Rejected: Large state files could exceed secret size limits (1MB), performance concerns with frequent updates
3. Disable UseAsync
Rejected: Removes async operation tracking capability, breaks long-running operations
4. Velero Filesystem Backup
Rejected: Only solves Velero restore case, doesn't help with pod restarts or node failures
5. Pre-Create Existence Check
Partially Rejected: Doesn't handle all edge cases, Import is more robust and already well-tested
Additional Notes
Provider Implementation Considerations
When using this fix, ensure your provider's
GetIDFnconfigurations handle emptyexternalNamevalues correctly during initial resource creation:This prevents incomplete IDs (e.g.,
service_name/instead ofservice_name/resource_id) from being set in tfstate before resource creation completes.Related Issues
This fix addresses duplicate resource creation issues that have been reported by multiple users of upjet-based providers, particularly for resources requiring long-running operations such as:
Checklist
References