Describe the bug
If I have a statefulset and the strategy is OnDelete instead of RollingUpdate argocd gets stuck in progressing once the statefulset is synced.
To Reproduce
from that point on the statefulset resource will be marked as progressing. To clear the state we have to go in and clear out the kubernetes update annotations.
Version
0.12.0
Is this the same issue as https://github.com/argoproj/argo-cd/issues/668 which was a kubernetes bug?
No, we are running k8s 1.12.3. Rollout status is blank/non-existent in this case because there is nothing rolling out. Ideally the app should update the statefulset and not go into progressing. It is up to the operator to manually delete the pods associated to the sts.
Thank you @cchanley2003 - do you have an example repo by any chance to help investigate please?
I don't, but it should be pretty simple to reproduce. Steps:
Basically argocd seems to be assuming that all statefulsets are RollingUpdates and treats them accordingly. But if the sts is OnDelete there is nothing to roll out (so it should never say progressing in this case). The sync is done and the user has to manually delete the pods associated with the statefulset at their leisure.
I'm seeing this also. But even if I delete the pods so that the sts controller creates them with the new config, ArgoCD says it's still progressing: waiting for statefulset rolling update to complete 3 pods at revision pzoo-7877c859dd...
From kubectl get sts -o yaml:
status:
collisionCount: 0
currentRevision: pzoo-5c7ddf6f7b
observedGeneration: 2
readyReplicas: 3
replicas: 3
updateRevision: pzoo-7877c859dd
updatedReplicas: 3
I was able to force this to progress by first deleting the pzoo pods, one at a time, so they could be rebuilt; then deleting controllerrevision pzoo-5c7ddf6f7b.
Yes, you have to clear out the update annotations on the statefulset to make argo stop reporting this.
FWIW, we should be identical to ‘kubectl rollout status’, because the code
was literally copied from there. This means if there a bug here, it would
also exist in upstream kubernetes.
On Tue, Sep 10, 2019 at 3:11 PM Chad Hanley notifications@github.com
wrote:
Yes, you have to clear out the update annotations on the statefulset to
make argo stop reporting this.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/argoproj/argo-cd/issues/1881?email_source=notifications&email_token=ADAW76N3LY4PTEYR3AZAOIDQJALPVA5CNFSM4H5KXVO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6MU2EI#issuecomment-530140433,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADAW76K5WEF3IYLLAQLDV4TQJALPVANCNFSM4H5KXVOQ
.
There is no rollout status in this case as no rollout is occurring. Believe argo cd is assuming there should be rollout status. Rollout status in this case is blank in k8s which is correct.
I want to expand on my previous post. The fact that "rollout status" is even in the conversation is an indication of a bug. If an operator updated a statefulset whose update policy was OnDelete and then said to me "I am going to run kubectl rollout status to check on it' I would reply "Do not do that. Updating a statefulset with this configuration does not kick off a rollout. There is no rollout to get status on." This is the same thing as if we updated a secret, configmap, etc. It is purely a kubernetes resource update that does not kick off a rollout. So argocd should not be running rolluout status for this case. So that is why there is no upstream bug in kubernetes. Because there is no rollout to get status on. We should not check rollout status in this case.
Thanks for the detailed explanation. This is indeed a bug.
Here is the behavior of kubectl rollout status on a StatefulSet which has OnDelete strategy.
$ kubectl rollout status statefulset/web
error: rollout status is only available for RollingUpdate strategy type
Here is a StatefulSet which is perpetually stuck in Progressing with the error:
"waiting for statefulset rolling update to complete 0 pods at revision web-6cd6b6b6c9..."
kind: StatefulSet
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"apps/v1","kind":"StatefulSet","metadata":{"annotations":{},"labels":{"app.kubernetes.io/instance":"extensions"},"name":"web","namespace":"statefulset"},"spec":{"replicas":3,"selector":{"m
atchLabels":{"app":"nginx"}},"serviceName":"nginx","template":{"metadata":{"labels":{"app":"nginx"}},"spec":{"containers":[{"image":"k8s.gcr.io/nginx-slim:0.8","name":"nginx","ports":[{"containerPort":80,"name":"web"}]}],"terminationGracePeriodSeconds":10}},"updateStrategy":{"type":"OnDelete"}}}
creationTimestamp: "2019-09-13T06:30:43Z"
generation: 2
labels:
app.kubernetes.io/instance: extensions
name: web
namespace: statefulset
resourceVersion: "7425954"
selfLink: /apis/apps/v1/namespaces/statefulset/statefulsets/web
uid: 0308ade6-d5f0-11e9-9e69-42010aa8005f
spec:
podManagementPolicy: OrderedReady
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
app: nginx
serviceName: nginx
template:
metadata:
creationTimestamp: null
labels:
app: nginx
spec:
containers:
- image: k8s.gcr.io/nginx-slim:0.9
imagePullPolicy: IfNotPresent
name: nginx
ports:
- containerPort: 80
name: web
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 10
updateStrategy:
type: OnDelete
status:
collisionCount: 0
currentReplicas: 3
currentRevision: web-578cfc4b46
observedGeneration: 2
readyReplicas: 3
replicas: 3
updateRevision: web-6cd6b6b6c9
I'm not sure what we should decide to report the Health as, for a StatefulSet who's spec has changed, and because of that, the live pods are not running what is reflected in the StatefulSet spec.
Should we continue to consider it "Healthy" if say 3/3 pods are running even if the spec doesn't match the running pods?
This is also a problem with extensions/v1beta1 DaemonSets which have an OnDelete policy
I would consider it healthy if the pods were out of sync with the sts or ds. I am not aware of any mechanism within k8s to check if a pod is out of sync with sts/ds. For this issue my expectation is that once the sts/ds is synced the application is healthy and in-sync.
My opinion is that it is buyer beware when using the OnDelete strategy and that users must be aware that their pods could be out of sync with their sts/ds. It would be a feature level request to report on the pods being out of sync with the sts/ds.
I don’t consider this fixed. The logic needs to change
Do you want to include this in v1.4?
we are experiencing this as well on version v1.5.0+bdda410.
Being hit pretty hard by a combination of https://github.com/argoproj/argo-cd/issues/1460 with this issue
We are having issues with redis coming up and figuring out master in HA mode. To try and fix that, I've changed the updateStrategy of redis to OnDelete but then I hit this. Would you folks suggest any other workaround?
edit: to be fair it isnt exacltly https://github.com/argoproj/argo-cd/issues/1460, but potentially https://github.com/argoproj/argo-cd/issues/3547, it started happening after we upgraded and happens by chance everytime there is a rolliing update of the redis sts
Hi @dudadornelles,
I've hit the issue couple more times and found a functioning workaround. If you follow the steps below it takes couple of minutes to fix it.
Whenever this issue happens on argocd application view I see multiple controller revisions. If you check live manifest of running pods you will see only one of those revisions are in use and the rest can be deleted securely. What I do is I delete every controller revision except the one that is in use and restart one of the running pods (basically just run kubectl delete $pod). after replacement pod becomes healthy and running the argocd app becomes healthy as well.
I hope it works for you and everyone else who are desperate for a solution :)
I am also hitting this bug on relatively new versions of k8s and argocd
argocd: 1.5.7
kubernetes: 1.16.9
This is pretty bad as I manage multiple argocd clusters and doing manual operations like this is not safe nor scalable. Is there any other viable workaround for this problem?
We are facing the same issue for our daemonsets
argocd-server: v1.5.7+e7d1553
k8s: v1.17.0
Currently running into this as well with a StatefulSet app
* Update *
Changing the updateStrategy from OnDelete to RollingUpdate solves this issue for my app.
## Statefulsets rolling update update strategy
## Ref: https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#rolling-update
##
-- updateStrategy: OnDelete
++ updateStrategy: RollingUpdate
Noticed this issue in ArgoCD v1.6.1+159674e with Kubernetes v1.17.6 when using Strimzi operator. It creates Statefulsets for Kafka and ZooKeeper with update strategy OnDelete.
Although in my case, while the health of Kafka and ZooKeeper Statefulset is Progressing for eternity, the overall application health is marked Healthy which is what my pipeline is checking (argocd app wait appname) so not a blocker.
Is this solved in a newer version of argocd? If yes which version? Thanks!
This is available in 1.7 . Recommended to upgrade to use the most recent patch release: https://github.com/argoproj/argo-cd/releases/tag/v1.7.6
Just upgraded to that version today, will give it a try later, thanks for the update!
This is available in 1.7 . Recommended to upgrade to use the most recent patch release: https://github.com/argoproj/argo-cd/releases/tag/v1.7.6
@alexmt - can we directly jump from v1.5.7 to v1.7.6 ?
Most helpful comment
I am also hitting this bug on relatively new versions of k8s and argocd
argocd: 1.5.7
kubernetes: 1.16.9
This is pretty bad as I manage multiple argocd clusters and doing manual operations like this is not safe nor scalable. Is there any other viable workaround for this problem?