Argo-cd: Cache inconsistency of child resources

Created on 5 Aug 2020  路  4Comments  路  Source: argoproj/argo-cd

On one Argo CD instance (v1.6.2-f282a33), a pod was deleted as a result of a Argo CD Rollout Restart action. The pod was part of a Rollout's ReplicaSet. Even though the pod truly disappeared from kubernetes, it still remained visible in Argo CD.

The inconsistent state remained for roughly ~24 hours, which is our default cache invalidation period. After which the state was corrected and the pod disappeared from the UI.

bug

Most helpful comment

I think the offending line is this:
https://github.com/argoproj/gitops-engine/blob/master/pkg/cache/cluster.go#L29

This problem causes everything that is set to auto sync to continuously sync over and over. This is not a bug that should be put off, and it needs to be addressed before v1.7 is launched.

We manage 10 clusters with each of our argocd instances, and due to the numerous bugs in the 1.6.2 release (see: extra lines added in manifests generated in the ui if you select autosync, inability to process helm hooks properly, resources going unknown / missing and not syncing, and a host of other issues) we attempted to use the latest release on our test clusters. Due to this cache nonsense it's been attempting to reapply the deployments to all clusters repeatedly. Eventually after 24 hours it updates correctly. I would consider this a critical bug as it destroys the entire purpose of gitops and trust in the status of any object in the clusters as returned via argocd.

All 4 comments

I think the offending line is this:
https://github.com/argoproj/gitops-engine/blob/master/pkg/cache/cluster.go#L29

This problem causes everything that is set to auto sync to continuously sync over and over. This is not a bug that should be put off, and it needs to be addressed before v1.7 is launched.

We manage 10 clusters with each of our argocd instances, and due to the numerous bugs in the 1.6.2 release (see: extra lines added in manifests generated in the ui if you select autosync, inability to process helm hooks properly, resources going unknown / missing and not syncing, and a host of other issues) we attempted to use the latest release on our test clusters. Due to this cache nonsense it's been attempting to reapply the deployments to all clusters repeatedly. Eventually after 24 hours it updates correctly. I would consider this a critical bug as it destroys the entire purpose of gitops and trust in the status of any object in the clusters as returned via argocd.

Agree. This issue used to happen once every few months, something has changed and now we are seeing it much more often. To mitigate the problem we've added ARGOCD_CLUSTER_CACHE_RESYNC_DURATION env variable. This allows reducing cluster force refresh period (to e.g. 1hr ARGOCD_CLUSTER_CACHE_RESYNC_DURATION=1hr)

Which application did you add that env var to? Application-controller I assume?

We see the same kind of issues with v1.6.2 and applications get usually stuck. We have about 1200-1400 applications.

Was this page helpful?
0 / 5 - 0 ratings