Flux: Fluxcd suddenly deletes all resources though git is unchanged

Created on 23 Jun 2020  路  2Comments  路  Source: fluxcd/flux

Describe the bug

Flux suddenly deleted all resources it managed though no change was pushed to git.

This is a severe issue - has anybody else observed this?

To Reproduce

Since this seems to be a sporadic issue we have not seen before, I can't give a description how to reproduce the issue.

Expected behavior

If git remains unchanged, flux should not delete stuff.

Logs

The log below shows the last successful sync and then the start of the delete action. Please observe that the git commit id is unchanged between the last apply and the delete action.

Jun 22, 2020 @ 21:24:27.690 ts=2020-06-22T19:24:27.689816419Z caller=loop.go:133 component=sync-loop event=refreshed url=ssh://[email protected]/mycompany/force-flux.git branch=preprod HEAD=1aaf36045c357db33d83d3c6970da40d28788924
Jun 22, 2020 @ 21:25:48.544 ts=2020-06-22T19:25:48.544604693Z caller=sync.go:73 component=daemon info="trying to sync git changes to the cluster" old=1aaf36045c357db33d83d3c6970da40d28788924 new=1aaf36045c357db33d83d3c6970da40d28788924
Jun 22, 2020 @ 21:25:54.343 ts=2020-06-22T19:25:54.343208824Z caller=sync.go:539 method=Sync cmd=apply args= count=27
Jun 22, 2020 @ 21:25:55.071 ts=2020-06-22T19:25:55.070844159Z caller=sync.go:605 method=Sync cmd="kubectl apply -f -" took=727.543635ms err=null output="namespace/flux-system unchanged\nnamespace/flux-tiller unchanged\nnamespace/storage-operator unchanged\nnamespace/vault unchanged\nclusterrole.rbac.authorization.k8s.io/azure-storage-operator configured\nserviceaccount/azure-storage-operator unchanged\ncustomresourcedefinition.apiextensions.k8s.io/azurestorages.k8s.craft.supply unchanged\nserviceaccount/helm-operator unchanged\nclusterrole.rbac.authorization.k8s.io/helm-operator unchanged\ncustomresourcedefinition.apiextensions.k8s.io/helmreleases.helm.fluxcd.io configured\nserviceaccount/tiller unchanged\nservice/tiller-deploy unchanged\nclusterrolebinding.rbac.authorization.k8s.io/azure-storage-operator unchanged\nsecret/ca-secret unchanged\nclusterrolebinding.rbac.authorization.k8s.io/flux-tiller unchanged\nclusterrolebinding.rbac.authorization.k8s.io/helm-operator unchanged\nsecret/helm-repositories-992k4745f2 unchanged\ndeployment.apps/azure-storage-operator configured\ndeployment.apps/helm-operator unchanged\ndeployment.apps/tiller-deploy configured\nexternalsecret.kubernetes-client.io/azure-service-principal unchanged\nexternalsecret.kubernetes-client.io/azure-service-principal unchanged\nhelmrelease.helm.fluxcd.io/external-secrets unchanged\nhelmrelease.helm.fluxcd.io/prometheus-blackbox-exporter unchanged\npoddisruptionbudget.policy/tiller-deploy unchanged\nhelmrelease.helm.fluxcd.io/vault unchanged\nazurestorage.k8s.craft.supply/vault-azurestorage unchanged"
Jun 22, 2020 @ 21:29:29.074 ts=2020-06-22T19:29:29.074707482Z caller=loop.go:133 component=sync-loop event=refreshed url=ssh://[email protected]/mycompany/force-flux.git branch=preprod HEAD=1aaf36045c357db33d83d3c6970da40d28788924
Jun 22, 2020 @ 21:30:55.997 ts=2020-06-22T19:30:55.997407382Z caller=sync.go:73 component=daemon info="trying to sync git changes to the cluster" old=1aaf36045c357db33d83d3c6970da40d28788924 new=1aaf36045c357db33d83d3c6970da40d28788924
Jun 22, 2020 @ 21:31:00.187 ts=2020-06-22T19:31:00.17682842Z caller=sync.go:159 info="cluster resource not in resources to be synced; deleting" dry-run=false resource=<cluster>:clusterrolebinding/flux-tiller
Jun 22, 2020 @ 21:31:00.187 ts=2020-06-22T19:31:00.187009366Z caller=sync.go:159 info="cluster resource not in resources to be synced; deleting" dry-run=false resource=flux-system:helmrelease/external-secrets
Jun 22, 2020 @ 21:31:00.187 ts=2020-06-22T19:31:00.187034366Z caller=sync.go:159 info="cluster resource not in resources to be synced; deleting" dry-run=false resource=<cluster>:customresourcedefinition/helmreleases.helm.fluxcd.io
Jun 22, 2020 @ 21:31:00.187 ts=2020-06-22T19:31:00.187052367Z caller=sync.go:159 info="cluster resource not in resources to be synced; deleting" dry-run=false resource=vault:azurestorage/vault-azurestorage
Jun 22, 2020 @ 21:31:00.187 ts=2020-06-22T19:31:00.187070567Z caller=sync.go:159 info="cluster resource not in resources to be synced; deleting" dry-run=false resource=<cluster>:customresourcedefinition/azurestorages.k8s.craft.supply
Jun 22, 2020 @ 21:31:00.187 ts=2020-06-22T19:31:00.187092067Z caller=sync.go:159 info="cluster resource not in resources to be synced; deleting" dry-run=false resource=vault:externalsecret/azure-service-principal
[...]
Jun 22, 2020 @ 21:31:00.187 | ts=2020-06-22T19:31:00.187523469Z caller=sync.go:539 method=Sync cmd=delete args= count=27

Additional context

  • Flux version: 1.19.0
  • Kubernetes version: 1.15.7 (Azure AKS)
  • Git provider: bitbucket
  • Container registry provider: n/a
blocked-needs-validation bug

Most helpful comment

OK... we tracked it down, and since it was such a pain for us, I'd like to share our findings here.

The root cause was in the generator command in .flux.yaml:

   generators:
     # use kustomize as manifest generator and replace ENV variables (i.e. ${VAR}) as post processing step via envsubst
     - command: kustomize build . | envsubst '${CLUSTER} ${STAGE} ${LINE}'

The problem with pipelines is that the overall exit code is generally the exit code of the last command, and even though kustomize may fail due to invalid input or a failed/interrupted git checkout, envsubst will always succeed, even if the pipeline produces empty output. This can lead to the deletion of resources, if flux garbage collection is on.

In order to fix it, we changed our generator command as follows:
/bin/bash -c 'set -o pipefail; kustomize build . | envsubst \${CLUSTER},\${STAGE},\${LINE}'

All 2 comments

The issue just became a bit clearer (even though not less scary). The very same thing happened on a different cluster at almost the same point in time. The one thing the two occurrences had in common was that they synced with the same GIT repository.
We suspect that at the given time, the bitbucket cloud GIT repo was (partly) unavailable, resulting in an erroneous checkout and consequently an empty output of the command from .flux.yaml. fluxd just did what it was supposed to do, deleting everything.

The question is, how can we safeguard against issues like this in the future? Any hints are welcome.

OK... we tracked it down, and since it was such a pain for us, I'd like to share our findings here.

The root cause was in the generator command in .flux.yaml:

   generators:
     # use kustomize as manifest generator and replace ENV variables (i.e. ${VAR}) as post processing step via envsubst
     - command: kustomize build . | envsubst '${CLUSTER} ${STAGE} ${LINE}'

The problem with pipelines is that the overall exit code is generally the exit code of the last command, and even though kustomize may fail due to invalid input or a failed/interrupted git checkout, envsubst will always succeed, even if the pipeline produces empty output. This can lead to the deletion of resources, if flux garbage collection is on.

In order to fix it, we changed our generator command as follows:
/bin/bash -c 'set -o pipefail; kustomize build . | envsubst \${CLUSTER},\${STAGE},\${LINE}'

Was this page helpful?
0 / 5 - 0 ratings