Argo-cd: Sync gets stuck and must be terminated/restarted manually in order to work

Created on 11 Nov 2020  路  5Comments  路  Source: argoproj/argo-cd

Checklist:

  • [x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • [x] I've included steps to reproduce the bug.
  • [x] I've pasted the output of argocd version.

Describe the bug

I'm trying to deploy an ArgoCD Application that contains a configmap, two certificates (Certificate custom resource from cert-manager) and a KafkaConnect instance (from Strimzi operator).

I defined the following annotations: argocd.argoproj.io/sync-wave (to make sure to have configmap and certificates before the kafkaconnect instance) and argocd.argoproj.io/sync-options on CRDs. When the application is deployed, the sync gets stuck: it keeps saying OutOfSync and Syncing (see attached image). However, if I stop the sync (click on Syncing, terminate) and then Sync the application again, then it successfully deploys all the defined resources.

Although I am using Custom Resources here (Certificate from cert-manager and KafkaConnect from Strimzi), the related custom health seem to exist already (https://github.com/argoproj/argo-cd/tree/master/resource_customizations).

The main problem is that I have several applications of this kind, so I would like to be able to automate this (instead of relying on manually stopping the sync and restarting it for all these applications). Any idea?

To Reproduce

Deploy an ArgoCD Application that contains the resources mentioned above. The Sync phase will start by itself and get stuck.

Expected behavior

The Sync should not get stuck and continue, without needing any manual action (terminate Sync and start it again)

Screenshots

argocd

Version

argocd: v1.7.6+b04c25e
  BuildDate: 2020-09-19T00:50:44Z
  GitCommit: b04c25eca8f1660359e325acd4be5338719e59a0
  GitTreeState: clean
  GoVersion: go1.14.1
  Compiler: gc
  Platform: linux/amd64
argocd-server: v1.7.6+b04c25e
  BuildDate: 2020-09-19T00:52:04Z
  GitCommit: b04c25eca8f1660359e325acd4be5338719e59a0
  GitTreeState: clean
  GoVersion: go1.14.1
  Compiler: gc
  Platform: linux/amd64
  Ksonnet Version: v0.13.1
  Kustomize Version: {Version:kustomize/v3.6.1 GitCommit:c97fa946d576eb6ed559f17f2ac43b3b5a8d5dbd BuildDate:2020-05-27T20:47:35Z GoOs:linux GoArch:amd64}
  Helm Version: version.BuildInfo{Version:"v3.2.0", GitCommit:"e11b7ce3b12db2941e90399e874513fbd24bcb71", GitTreeState:"clean", GoVersion:"go1.13.10"}
  Kubectl Version: v1.17.8
bug

Most helpful comment

Thinking about this a bit more - It may be that the application syncing issue is just a symptom of a wider issue?
The health check does eventually get to a Healthy state and is in a "Degraded" state when the certificate issuance is pending.

I can make a PR for the small change above, which reduces issues when using cert-manager certificates, but that won't fix the underlying sync issue. Where after getting into a degraded state, the application will wait for all resources to report "healthy" and seemingly deadlocks or waits for an event that never arrives.

All 5 comments

I have been playing with a similar issue on version v1.7.8+ef5010c, in an app of apps scenario.

When the syncing hit a "Degraded" state, as part of the certificate issuing, it seems that the application syncing started waiting on everything all over again and would never get the healthy notifications.

I think it is an issue with the certificate health check, I overrode the default check by changing the argocd-cm.yaml file with the below (only replacing Degraded with Progressing):

data:
  resource.customizations: |
    cert-manager.io/Certificate:
      health.lua: |
        hs = {}
        if obj.status ~= nil then
          if obj.status.conditions ~= nil then
            for i, condition in ipairs(obj.status.conditions) do
              if condition.type == "Ready" and condition.status == "False" then
                hs.status = "Progressing"
                hs.message = condition.message
                return hs
              end
              if condition.type == "Ready" and condition.status == "True" then
                hs.status = "Healthy"
                hs.message = condition.message
                return hs
              end
            end
          end
        end

        hs.status = "Progressing"
        hs.message = "Waiting for certificate"
        return hs

Seems to have resolved the issue for me (at least in the very small number of tests I have done since).

Thanks a lot, this seems to work for this specific ArgoCD Application! On other applications I have similar issues that require further investigation, but now at least I know how I can start approaching the problem.

Just wondering: it looks like the Certificate health check proposed by argocd itself (mentioned in https://argoproj.github.io/argo-cd/operator-manual/health/ and in https://github.com/argoproj/argo-cd/tree/master/resource_customizations/cert-manager.io/Certificate) causes this issue. Do you think it would make sense to open a PR to change that check, in order to fix that?

I can say this is happening also to us.

We have a very odd situation, where we have 4 environments, with pretty much the same configuration, e.g. same ArgoCD application deployed in all of them. And only ONE of those, very frequently has this sync issue.

I'd like to understand how to debug it, because there is no apparent problem, other than random syncs getting stuck D:

Our setup does not include custom health checks, but we do have sync waves.

Just wondering: it looks like the Certificate health check proposed by argocd itself (mentioned in https://argoproj.github.io/argo-cd/operator-manual/health/ and in https://github.com/argoproj/argo-cd/tree/master/resource_customizations/cert-manager.io/Certificate) causes this issue. Do you think it would make sense to open a PR to change that check, in order to fix that?

Yes, if the health check is not functioning, please send a PR.

Thinking about this a bit more - It may be that the application syncing issue is just a symptom of a wider issue?
The health check does eventually get to a Healthy state and is in a "Degraded" state when the certificate issuance is pending.

I can make a PR for the small change above, which reduces issues when using cert-manager certificates, but that won't fix the underlying sync issue. Where after getting into a degraded state, the application will wait for all resources to report "healthy" and seemingly deadlocks or waits for an event that never arrives.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

duboisf picture duboisf  路  3Comments

chiragthaker picture chiragthaker  路  3Comments

KarstenSiemer picture KarstenSiemer  路  3Comments

eroji picture eroji  路  3Comments

rosscdh picture rosscdh  路  3Comments