Argo-cd: StatefulSet with OnDelete Update Strategy stuck progressing

Created on 3 Jul 2019 · 25Comments · Source: argoproj/argo-cd

Describe the bug
If I have a statefulset and the strategy is OnDelete instead of RollingUpdate argocd gets stuck in progressing once the statefulset is synced.
To Reproduce

Create a statefulset and make sure the updateStrategy.type: OnDelete
Make some change to the statefulset so that a sync is required
Health says "waiting for rolling update to complete" which will never occur as it isn't a rolling update

from that point on the statefulset resource will be marked as progressing. To clear the state we have to go in and clear out the kubernetes update annotations.

Version
0.12.0

bug cherry-pic1.7 core

Source

cchanley2003

👍9

Most helpful comment

I am also hitting this bug on relatively new versions of k8s and argocd

argocd: 1.5.7
kubernetes: 1.16.9

This is pretty bad as I manage multiple argocd clusters and doing manual operations like this is not safe nor scalable. Is there any other viable workaround for this problem?

muzcategui1106-gs on 1 Jul 2020

👍9

All 25 comments

Is this the same issue as https://github.com/argoproj/argo-cd/issues/668 which was a kubernetes bug?

jessesuen on 6 Aug 2019

No, we are running k8s 1.12.3. Rollout status is blank/non-existent in this case because there is nothing rolling out. Ideally the app should update the statefulset and not go into progressing. It is up to the operator to manually delete the pods associated to the sts.

cchanley2003 on 14 Aug 2019

Thank you @cchanley2003 - do you have an example repo by any chance to help investigate please?

alexec on 14 Aug 2019

I don't, but it should be pretty simple to reproduce. Steps:

Create an argocd app with a statefulset, it can be any statefulset. Make sure the updateStrategy.type: OnDelete in that statesfulset, not RollingUpdate
Make a change to that statefulset (label, etc) in git
Tell argo to sync the statefulset
It will correctly update the statefulset resource object but it will then go into progressing with "waiting for rolling update to complete" , but there is no rolling update to complete.
With an ondelete strategy the statefulset object is updated, but the associated pods don't change until a user deletes them.

Basically argocd seems to be assuming that all statefulsets are RollingUpdates and treats them accordingly. But if the sts is OnDelete there is nothing to roll out (so it should never say progressing in this case). The sync is done and the user has to manually delete the pods associated with the statefulset at their leisure.

cchanley2003 on 14 Aug 2019

I'm seeing this also. But even if I delete the pods so that the sts controller creates them with the new config, ArgoCD says it's still progressing: waiting for statefulset rolling update to complete 3 pods at revision pzoo-7877c859dd...

From kubectl get sts -o yaml:

status:
  collisionCount: 0
  currentRevision: pzoo-5c7ddf6f7b
  observedGeneration: 2
  readyReplicas: 3
  replicas: 3
  updateRevision: pzoo-7877c859dd
  updatedReplicas: 3

I was able to force this to progress by first deleting the pzoo pods, one at a time, so they could be rebuilt; then deleting controllerrevision pzoo-5c7ddf6f7b.

cjyar on 10 Sep 2019

Yes, you have to clear out the update annotations on the statefulset to make argo stop reporting this.

cchanley2003 on 11 Sep 2019

FWIW, we should be identical to ‘kubectl rollout status’, because the code
was literally copied from there. This means if there a bug here, it would
also exist in upstream kubernetes.

On Tue, Sep 10, 2019 at 3:11 PM Chad Hanley notifications@github.com
wrote:

Yes, you have to clear out the update annotations on the statefulset to
make argo stop reporting this.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/argoproj/argo-cd/issues/1881?email_source=notifications&email_token=ADAW76N3LY4PTEYR3AZAOIDQJALPVA5CNFSM4H5KXVO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6MU2EI#issuecomment-530140433,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADAW76K5WEF3IYLLAQLDV4TQJALPVANCNFSM4H5KXVOQ
.

jessesuen on 11 Sep 2019

There is no rollout status in this case as no rollout is occurring. Believe argo cd is assuming there should be rollout status. Rollout status in this case is blank in k8s which is correct.

cchanley2003 on 11 Sep 2019

I want to expand on my previous post. The fact that "rollout status" is even in the conversation is an indication of a bug. If an operator updated a statefulset whose update policy was OnDelete and then said to me "I am going to run kubectl rollout status to check on it' I would reply "Do not do that. Updating a statefulset with this configuration does not kick off a rollout. There is no rollout to get status on." This is the same thing as if we updated a secret, configmap, etc. It is purely a kubernetes resource update that does not kick off a rollout. So argocd should not be running rolluout status for this case. So that is why there is no upstream bug in kubernetes. Because there is no rollout to get status on. We should not check rollout status in this case.

cchanley2003 on 11 Sep 2019

Thanks for the detailed explanation. This is indeed a bug.

Here is the behavior of kubectl rollout status on a StatefulSet which has OnDelete strategy.

$ kubectl rollout status statefulset/web
error: rollout status is only available for RollingUpdate strategy type

Here is a StatefulSet which is perpetually stuck in Progressing with the error:

"waiting for statefulset rolling update to complete 0 pods at revision web-6cd6b6b6c9..."

kind: StatefulSet
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"StatefulSet","metadata":{"annotations":{},"labels":{"app.kubernetes.io/instance":"extensions"},"name":"web","namespace":"statefulset"},"spec":{"replicas":3,"selector":{"m
atchLabels":{"app":"nginx"}},"serviceName":"nginx","template":{"metadata":{"labels":{"app":"nginx"}},"spec":{"containers":[{"image":"k8s.gcr.io/nginx-slim:0.8","name":"nginx","ports":[{"containerPort":80,"name":"web"}]}],"terminationGracePeriodSeconds":10}},"updateStrategy":{"type":"OnDelete"}}}
  creationTimestamp: "2019-09-13T06:30:43Z"
  generation: 2
  labels:
    app.kubernetes.io/instance: extensions
  name: web
  namespace: statefulset
  resourceVersion: "7425954"
  selfLink: /apis/apps/v1/namespaces/statefulset/statefulsets/web
  uid: 0308ade6-d5f0-11e9-9e69-42010aa8005f
spec:
  podManagementPolicy: OrderedReady
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: nginx
  serviceName: nginx
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nginx
    spec:
      containers:
      - image: k8s.gcr.io/nginx-slim:0.9
        imagePullPolicy: IfNotPresent
        name: nginx
        ports:
        - containerPort: 80
          name: web
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 10
  updateStrategy:
    type: OnDelete
status:
  collisionCount: 0
  currentReplicas: 3
  currentRevision: web-578cfc4b46
  observedGeneration: 2
  readyReplicas: 3
  replicas: 3
  updateRevision: web-6cd6b6b6c9

I'm not sure what we should decide to report the Health as, for a StatefulSet who's spec has changed, and because of that, the live pods are not running what is reflected in the StatefulSet spec.

Should we continue to consider it "Healthy" if say 3/3 pods are running even if the spec doesn't match the running pods?

jessesuen on 13 Sep 2019

This is also a problem with extensions/v1beta1 DaemonSets which have an OnDelete policy

jessesuen on 13 Sep 2019

I would consider it healthy if the pods were out of sync with the sts or ds. I am not aware of any mechanism within k8s to check if a pod is out of sync with sts/ds. For this issue my expectation is that once the sts/ds is synced the application is healthy and in-sync.

My opinion is that it is buyer beware when using the OnDelete strategy and that users must be aware that their pods could be out of sync with their sts/ds. It would be a feature level request to report on the pods being out of sync with the sts/ds.

cchanley2003 on 13 Sep 2019

I don’t consider this fixed. The logic needs to change

jessesuen on 3 Oct 2019

👍2

Do you want to include this in v1.4?

alexec on 3 Oct 2019

we are experiencing this as well on version v1.5.0+bdda410.

serhatcetinkaya on 5 May 2020

Being hit pretty hard by a combination of https://github.com/argoproj/argo-cd/issues/1460 with this issue

We are having issues with redis coming up and figuring out master in HA mode. To try and fix that, I've changed the updateStrategy of redis to OnDelete but then I hit this. Would you folks suggest any other workaround?

edit: to be fair it isnt exacltly https://github.com/argoproj/argo-cd/issues/1460, but potentially https://github.com/argoproj/argo-cd/issues/3547, it started happening after we upgraded and happens by chance everytime there is a rolliing update of the redis sts

dudadornelles on 27 May 2020

Hi @dudadornelles,

I've hit the issue couple more times and found a functioning workaround. If you follow the steps below it takes couple of minutes to fix it.

Whenever this issue happens on argocd application view I see multiple controller revisions. If you check live manifest of running pods you will see only one of those revisions are in use and the rest can be deleted securely. What I do is I delete every controller revision except the one that is in use and restart one of the running pods (basically just run kubectl delete $pod). after replacement pod becomes healthy and running the argocd app becomes healthy as well.

I hope it works for you and everyone else who are desperate for a solution :)

serhatcetinkaya on 15 Jun 2020

I am also hitting this bug on relatively new versions of k8s and argocd

argocd: 1.5.7
kubernetes: 1.16.9

This is pretty bad as I manage multiple argocd clusters and doing manual operations like this is not safe nor scalable. Is there any other viable workaround for this problem?

muzcategui1106-gs on 1 Jul 2020

👍9

We are facing the same issue for our daemonsets

argocd-server: v1.5.7+e7d1553
k8s: v1.17.0

alwaysastudent on 28 Jul 2020

👍1

Currently running into this as well with a StatefulSet app

ArgoCD v1.5.4+36bade7
Kubernetes v1.17.2

* Update *
Changing the updateStrategy from OnDelete to RollingUpdate solves this issue for my app.

## Statefulsets rolling update update strategy
## Ref: https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#rolling-update
##
-- updateStrategy: OnDelete
++ updateStrategy: RollingUpdate

one70six on 21 Aug 2020

Noticed this issue in ArgoCD v1.6.1+159674e with Kubernetes v1.17.6 when using Strimzi operator. It creates Statefulsets for Kafka and ZooKeeper with update strategy OnDelete.

Although in my case, while the health of Kafka and ZooKeeper Statefulset is Progressing for eternity, the overall application health is marked Healthy which is what my pipeline is checking (argocd app wait appname) so not a blocker.