Flux: consistent timeout on fluxctl sync or fluxctl release (No way to debug)

Created on 4 Sep 2020 · 12Comments · Source: fluxcd/flux

Is there a standard way to debug a flux release or sync timeout message? Flux has been working for more than a year and now fluxctl sync and fluxctl release consitently time out. increasing the timeout to 10 minutes has the same outcome. verbose flags show no extra information. This is on latest flux release. Logs show no errors or warnings... Where should be looking to dig deeper into the cause? it's driving me crazy.

$ helm ls -n flux
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
flux            flux            8               2020-09-03 13:07:10.5118373 -0700 PDT   deployed        flux-1.5.0              1.20.2
helm-operator   flux            6               2020-08-11 23:16:01.8447855 -0700 PDT   deployed        helm-operator-1.2.0     1.2.0
$ fluxctl release --dry-run --update-all-images --all
Submitting dry-run release ...
WORKLOAD                STATUS   UPDATES
staging:deployment/aps  success  aps: 121989234971.dkr.ecr.us-west-2.amazonaws.com/kube-platform:stg-71fead0 -> stg-1659ed5
 ~ 
$ fluxctl release --update-all-images --all
Submitting release ...
Error: timeout
Run 'fluxctl release --help' for usage.

Expected behavior

A useful message of the source of failure somewhere.

Logs

ts=2020-09-03T23:36:25.026086631Z caller=warming.go:206 component=warmer updated=121989234971.dkr.ecr.us-west-2.amazonaws.com/kube-activities successful=1 attempted=1
ts=2020-09-03T23:36:25.165516072Z caller=warming.go:198 component=warmer info="refreshing image" image=121989234971.dkr.ecr.us-west-2.amazonaws.com/kube-platform tag_count=37 to_update=1 of_which_refresh=1 of_which_missing=0
ts=2020-09-03T23:36:25.297228321Z caller=warming.go:206 component=warmer updated=121989234971.dkr.ecr.us-west-2.amazonaws.com/kube-platform successful=1 attempted=1
ts=2020-09-03T23:37:30.297215071Z caller=warming.go:198 component=warmer info="refreshing image" image=121989234971.dkr.ecr.us-west-2.amazonaws.com/kube-platform tag_count=37 to_update=1 of_which_refresh=1 of_which_missing=0
ts=2020-09-03T23:37:30.456804492Z caller=warming.go:206 component=warmer updated=121989234971.dkr.ecr.us-west-2.amazonaws.com/kube-platform successful=1 attempted=1

Additional context

Flux version: 1.20.1 and 1.20.2
Kubernetes version: 1.18.6
Git provider: github
Container registry provider: aws ecr

blocked-needs-validation bug

Source

zoranpopovic

👍2

Most helpful comment

I can briefly document the source of the issue for my cluster. When flux uninstalls a particular resource (in my case istio operator ) in a way uninstall fails due to finalizers on some resources not finishing, no further flux operations will succeed syncs additions removals. I had to look at all leftover resources in istio-system namespace (the namespace where installation and removal happened) see which ones shouldn't be there. It turns out kubectl cannot easily delete such resources either, you have to remove the finalizer by patching the resource first, then removing it after the patch. as soon as you remove all offending resources (those that flux tried to delete but failed) things will start working again.

In general the silent flux sync fail seem to happen due to cluster not in the state flux wants it to be due to a dangling resource. it doesn't say which one, however, so you've got to do your own debugging work. hopefully the new version will provide precise reason and offending resource when it fails.

zoranpopovic on 29 Sep 2020

👍2

All 12 comments

FWIW - I'm having the same issues. flux just stopped refreshing my container images which are tagged as automated. If I auto-commit to flux-patch.yaml and manually update the image tags, then flux _usually_ will update them. Nothing in my flux logs that is interesting -- same as the post above. It is very frustrating, and I have no insight into debugging it.

When I list images on a particular container....

jenkins01:deployment/posapi         posapi             REDACTED.dkr.ecr.us-east-1.amazonaws.com/redacted/posapi
                                                       |   PR-221-dc6f2328                                                                          16 Sep 20 10:35 UTC
                                                       |   PR-221-latest                                                                            16 Sep 20 10:35 UTC
                                                       |   feature_HOSPENG-1274_bridge-role-inherited-entitlements-dc6f2328                         16 Sep 20 10:35 UTC
                                                       |   feature_HOSPENG-1274_bridge-role-inherited-entitlements-latest                           16 Sep 20 10:35 UTC
                                                       |   feature_HOSPENG-1280_bridge-role-inherited-entitlements-dc6f2328                         16 Sep 20 10:35 UTC
                                                       |   feature_HOSPENG-1280_bridge-role-inherited-entitlements-latest                           16 Sep 20 10:35 UTC
                                                       |   feature_HOSPENG-1280_employee-specific-policies-da00e5c5                                 15 Sep 20 23:49 UTC
                                                       |   feature_HOSPENG-1280_employee-specific-policies-latest                                   15 Sep 20 23:49 UTC
                                                       |   master-000f7097                                                                          15 Sep 20 23:49 UTC
                                                       |   master-latest                                                                            15 Sep 20 23:49 UTC
                                                       : (42 image(s) omitted)
                                                       '-> master-48f6faf7                                                                          15 Sep 20 15:41 UTC

Note that the deployed container master-48f6faf7 is older than master-000f7097, but flux will not update it.

When I ran fluxctl release on this ns, it just times out, and prints nothing useful in the flux logs. What can I do to resolve this?

FYI I'm running the latest release which I believe is flux 0.20.2 if I recall. I just upgraded it two days ago.

davisford on 16 Sep 2020

I am having the same exact problem. any one found the solution yet?

aji-suprana on 20 Sep 2020

same problem here. automated releases fail silently... nothing happens.

flux 1.20.2

jfassad on 29 Sep 2020

zoranpopovic on 29 Sep 2020

👍2

Interestingly enough I too run istio-operator @zoranpopovic
do you have any tips on how I can find the offending dangling resource ? I'm not sure how to start. Thank you.

jfassad on 29 Sep 2020

I would look into what flux did right before it stopped syncing. did you remove a resource or helm chart. if you know the namespace where the change occurred, (or if not look at all namespaces) by running this bash function:

function kubectlgetall {
  for i in $(kubectl api-resources --verbs=list --namespaced -o name | grep -v "events.events.k8s.io" | grep -v "events" | sort | uniq); do
    echo "Resource:" $i
    kubectl -n ${1} get --ignore-not-found ${i}
  done
}

Hopefully you'll see that some of the resources shouldn't be there because flux or helm-operator should have uninstalled them. Most likely culprits are CRDs.

Looking at flux pod log in k9s you can possibly find entries about flux trying to delete some resources, those deletes fail but flux doesn't say so.

Once you find them, It is possible that resources are not removable because finalizers on that resource are not finished (istio-operator is an example). the only way to remove such elements is to patch the resource by removing the finalizer, and then removing it after the patch (see for details https://github.com/kubernetes/kubernetes/issues/60538#issuecomment-369099998).

Good luck.

zoranpopovic on 30 Sep 2020

the only way to remove such elements is to patch the resource by removing the finalizer, and then removing it after the patch

Just adding that when you manually delete a Finalizer from a resource, you'll need to do the cleanup that is related to that Finalizer on your own.

stealthybox on 30 Sep 2020

FWIW, I enabled logging in EKS, and am seeing some errors like this:

I1007 22:28:53.857158       1 deployment_controller.go:484] Error syncing deployment shaun/postools: Operation cannot be fulfilled on replicasets.apps "postools-657878cf6c": the object has been modified; please apply your changes to the latest version and try again

Not sure if that has an impact...still debugging.

davisford on 8 Oct 2020

I've also found that a complete uninstall of flux and re-install of flux sometimes resolves it, and it will release the stale automated deployments. Can anyone from flux comment on this issue? What else can I do to provide debug information to help fix this issue. It seems there are several that are having it.

davisford on 9 Oct 2020

It seems different people are having different issues here that manifest itself somewhat similarly.. I've managed to reproduce what is causing my problem.

My problem happens when I have cert-manager >= 1.60.0 CRDs in my repository.
this one, for example - https://github.com/jetstack/cert-manager/releases/download/v1.0.2/cert-manager.crds.yaml

If I put it in my repository fluxd will not be able to generate and add the automatic commits for automatic releases... it will hang and fail silently .

the regular git sync job will work and apply new changes that are committed to the repository until an automatic release happens. than everything will hang and stop working without errors.

I've tested and reproduced this problem in any flux release >= 1.20.1

jfassad on 9 Oct 2020

Thanks @jfassad I don't have that particular CRD. I have a few CRDs from nats.io and portworx, but none that are lingering.

I do see flux constantly trying to update one of my portworx containers, even though I tried to annotate that with flux.io/frozen=true and it doesn't even show when I list workloads...so I have no idea why it keeps doing this and not sure how to stop it and if it may also be contributing to halting flux automations.

Here's some logs which are repeated endlessly by flux:

ts=2020-10-09T15:08:19.687372495Z caller=warming.go:198 component=warmer info="refreshing image" image=portworx/oci-monitor tag_count=3788 to_update=2 of_which_refresh=0 of_which_missing=2
ts=2020-10-09T15:08:19.842812927Z caller=repocachemanager.go:226 component=warmer canonical_name=index.docker.io/portworx/oci-monitor auth={map[]} err="unknown blob" ref=portworx/oci-monitor:b471172_e9ee1d6
ts=2020-10-09T15:08:19.845604863Z caller=repocachemanager.go:226 component=warmer canonical_name=index.docker.io/portworx/oci-monitor auth={map[]} err="unknown blob" ref=portworx/oci-monitor:b471172_e9ee1d6
ts=2020-10-09T15:08:19.84569312Z caller=warming.go:206 component=warmer updated=portworx/oci-monitor successful=0 attempted=2
ts=2020-10-09T15:08:19.845836694Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"

Follow-up question: do any of you that are having problems with this use any operators in your cluster? I have the nats.io operator, but that's it.

davisford on 9 Oct 2020

Thanks @jfassad for the idea.

We do have cert-manager as part of our kustomization.yaml in test.

# file: base/cert-manager/kustomization.yaml
---                                                                                                                                                                                                                 
apiVersion: kustomize.config.k8s.io/v1beta1                                                                                                                                                                         
kind: Kustomization                                                                                                                                                                                                 
resources:                                                                                                                                                                                                          
  - https://github.com/jetstack/cert-manager/releases/download/v1.0.3/cert-manager.yaml                                                                                                                             
  - clusterissuer.yml                                                                                                                                                                                               
patchesStrategicMerge:                                                                                                                                                                                              
  - https://github.com/jetstack/cert-manager/releases/download/v1.0.3/cert-manager.crds.yaml

But it seems more like kustomize build . is too slow to generate the YAML and so flux fail. We tried all timeout options we could (--git-timeout, --sync-timeout, --rpc-timeout) but no progress.

Also we tried to downgrade flux to 1.19.0 but then kustomize version built-in is too old and so fails on build...

Our only working situation is to remove cert-manager from kustomize which speeds up considerably kustomize build . command.

Has anyone been successfully able to increase the this timeout ?

Thanks for your help =)