Is there a standard way to debug a flux release or sync timeout message? Flux has been working for more than a year and now fluxctl sync and fluxctl release consitently time out. increasing the timeout to 10 minutes has the same outcome. verbose flags show no extra information. This is on latest flux release. Logs show no errors or warnings... Where should be looking to dig deeper into the cause? it's driving me crazy.
$î‚° helm ls -n flux
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
flux flux 8 2020-09-03 13:07:10.5118373 -0700 PDT deployed flux-1.5.0 1.20.2
helm-operator flux 6 2020-08-11 23:16:01.8447855 -0700 PDT deployed helm-operator-1.2.0 1.2.0
$î‚° fluxctl release --dry-run --update-all-images --all
Submitting dry-run release ...
WORKLOAD STATUS UPDATES
staging:deployment/aps success aps: 121989234971.dkr.ecr.us-west-2.amazonaws.com/kube-platform:stg-71fead0 -> stg-1659ed5
~ î‚°
$î‚° fluxctl release --update-all-images --all
Submitting release ...
Error: timeout
Run 'fluxctl release --help' for usage.
Expected behavior
A useful message of the source of failure somewhere.
Logs
ts=2020-09-03T23:36:25.026086631Z caller=warming.go:206 component=warmer updated=121989234971.dkr.ecr.us-west-2.amazonaws.com/kube-activities successful=1 attempted=1
ts=2020-09-03T23:36:25.165516072Z caller=warming.go:198 component=warmer info="refreshing image" image=121989234971.dkr.ecr.us-west-2.amazonaws.com/kube-platform tag_count=37 to_update=1 of_which_refresh=1 of_which_missing=0
ts=2020-09-03T23:36:25.297228321Z caller=warming.go:206 component=warmer updated=121989234971.dkr.ecr.us-west-2.amazonaws.com/kube-platform successful=1 attempted=1
ts=2020-09-03T23:37:30.297215071Z caller=warming.go:198 component=warmer info="refreshing image" image=121989234971.dkr.ecr.us-west-2.amazonaws.com/kube-platform tag_count=37 to_update=1 of_which_refresh=1 of_which_missing=0
ts=2020-09-03T23:37:30.456804492Z caller=warming.go:206 component=warmer updated=121989234971.dkr.ecr.us-west-2.amazonaws.com/kube-platform successful=1 attempted=1
Additional context
FWIW - I'm having the same issues. flux just stopped refreshing my container images which are tagged as automated. If I auto-commit to flux-patch.yaml and manually update the image tags, then flux _usually_ will update them. Nothing in my flux logs that is interesting -- same as the post above. It is very frustrating, and I have no insight into debugging it.
When I list images on a particular container....
jenkins01:deployment/posapi posapi REDACTED.dkr.ecr.us-east-1.amazonaws.com/redacted/posapi
| PR-221-dc6f2328 16 Sep 20 10:35 UTC
| PR-221-latest 16 Sep 20 10:35 UTC
| feature_HOSPENG-1274_bridge-role-inherited-entitlements-dc6f2328 16 Sep 20 10:35 UTC
| feature_HOSPENG-1274_bridge-role-inherited-entitlements-latest 16 Sep 20 10:35 UTC
| feature_HOSPENG-1280_bridge-role-inherited-entitlements-dc6f2328 16 Sep 20 10:35 UTC
| feature_HOSPENG-1280_bridge-role-inherited-entitlements-latest 16 Sep 20 10:35 UTC
| feature_HOSPENG-1280_employee-specific-policies-da00e5c5 15 Sep 20 23:49 UTC
| feature_HOSPENG-1280_employee-specific-policies-latest 15 Sep 20 23:49 UTC
| master-000f7097 15 Sep 20 23:49 UTC
| master-latest 15 Sep 20 23:49 UTC
: (42 image(s) omitted)
'-> master-48f6faf7 15 Sep 20 15:41 UTC
Note that the deployed container master-48f6faf7 is older than master-000f7097, but flux will not update it.
When I ran fluxctl release on this ns, it just times out, and prints nothing useful in the flux logs. What can I do to resolve this?
FYI I'm running the latest release which I believe is flux 0.20.2 if I recall. I just upgraded it two days ago.
I am having the same exact problem. any one found the solution yet?
same problem here. automated releases fail silently... nothing happens.
flux 1.20.2
I can briefly document the source of the issue for my cluster. When flux uninstalls a particular resource (in my case istio operator ) in a way uninstall fails due to finalizers on some resources not finishing, no further flux operations will succeed syncs additions removals. I had to look at all leftover resources in istio-system namespace (the namespace where installation and removal happened) see which ones shouldn't be there. It turns out kubectl cannot easily delete such resources either, you have to remove the finalizer by patching the resource first, then removing it after the patch. as soon as you remove all offending resources (those that flux tried to delete but failed) things will start working again.
In general the silent flux sync fail seem to happen due to cluster not in the state flux wants it to be due to a dangling resource. it doesn't say which one, however, so you've got to do your own debugging work. hopefully the new version will provide precise reason and offending resource when it fails.
Interestingly enough I too run istio-operator @zoranpopovic
do you have any tips on how I can find the offending dangling resource ? I'm not sure how to start. Thank you.
I would look into what flux did right before it stopped syncing. did you remove a resource or helm chart. if you know the namespace where the change occurred, (or if not look at all namespaces) by running this bash function:
function kubectlgetall {
for i in $(kubectl api-resources --verbs=list --namespaced -o name | grep -v "events.events.k8s.io" | grep -v "events" | sort | uniq); do
echo "Resource:" $i
kubectl -n ${1} get --ignore-not-found ${i}
done
}
Hopefully you'll see that some of the resources shouldn't be there because flux or helm-operator should have uninstalled them. Most likely culprits are CRDs.
Looking at flux pod log in k9s you can possibly find entries about flux trying to delete some resources, those deletes fail but flux doesn't say so.
Once you find them, It is possible that resources are not removable because finalizers on that resource are not finished (istio-operator is an example). the only way to remove such elements is to patch the resource by removing the finalizer, and then removing it after the patch (see for details https://github.com/kubernetes/kubernetes/issues/60538#issuecomment-369099998).
Good luck.
the only way to remove such elements is to patch the resource by removing the finalizer, and then removing it after the patch
Just adding that when you manually delete a Finalizer from a resource, you'll need to do the cleanup that is related to that Finalizer on your own.
FWIW, I enabled logging in EKS, and am seeing some errors like this:
I1007 22:28:53.857158 1 deployment_controller.go:484] Error syncing deployment shaun/postools: Operation cannot be fulfilled on replicasets.apps "postools-657878cf6c": the object has been modified; please apply your changes to the latest version and try again
Not sure if that has an impact...still debugging.
I've also found that a complete uninstall of flux and re-install of flux sometimes resolves it, and it will release the stale automated deployments. Can anyone from flux comment on this issue? What else can I do to provide debug information to help fix this issue. It seems there are several that are having it.
It seems different people are having different issues here that manifest itself somewhat similarly.. I've managed to reproduce what is causing my problem.
My problem happens when I have cert-manager >= 1.60.0 CRDs in my repository.
this one, for example - https://github.com/jetstack/cert-manager/releases/download/v1.0.2/cert-manager.crds.yaml
If I put it in my repository fluxd will not be able to generate and add the automatic commits for automatic releases... it will hang and fail silently .
the regular git sync job will work and apply new changes that are committed to the repository until an automatic release happens. than everything will hang and stop working without errors.
I've tested and reproduced this problem in any flux release >= 1.20.1
Thanks @jfassad I don't have that particular CRD. I have a few CRDs from nats.io and portworx, but none that are lingering.
I do see flux constantly trying to update one of my portworx containers, even though I tried to annotate that with flux.io/frozen=true and it doesn't even show when I list workloads...so I have no idea why it keeps doing this and not sure how to stop it and if it may also be contributing to halting flux automations.
Here's some logs which are repeated endlessly by flux:
ts=2020-10-09T15:08:19.687372495Z caller=warming.go:198 component=warmer info="refreshing image" image=portworx/oci-monitor tag_count=3788 to_update=2 of_which_refresh=0 of_which_missing=2
ts=2020-10-09T15:08:19.842812927Z caller=repocachemanager.go:226 component=warmer canonical_name=index.docker.io/portworx/oci-monitor auth={map[]} err="unknown blob" ref=portworx/oci-monitor:b471172_e9ee1d6
ts=2020-10-09T15:08:19.845604863Z caller=repocachemanager.go:226 component=warmer canonical_name=index.docker.io/portworx/oci-monitor auth={map[]} err="unknown blob" ref=portworx/oci-monitor:b471172_e9ee1d6
ts=2020-10-09T15:08:19.84569312Z caller=warming.go:206 component=warmer updated=portworx/oci-monitor successful=0 attempted=2
ts=2020-10-09T15:08:19.845836694Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
Follow-up question: do any of you that are having problems with this use any operators in your cluster? I have the nats.io operator, but that's it.
Thanks @jfassad for the idea.
We do have cert-manager as part of our kustomization.yaml in test.
# file: base/cert-manager/kustomization.yaml
---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- https://github.com/jetstack/cert-manager/releases/download/v1.0.3/cert-manager.yaml
- clusterissuer.yml
patchesStrategicMerge:
- https://github.com/jetstack/cert-manager/releases/download/v1.0.3/cert-manager.crds.yaml
But it seems more like kustomize build . is too slow to generate the YAML and so flux fail. We tried all timeout options we could (--git-timeout, --sync-timeout, --rpc-timeout) but no progress.
Also we tried to downgrade flux to 1.19.0 but then kustomize version built-in is too old and so fails on build...
Our only working situation is to remove cert-manager from kustomize which speeds up considerably kustomize build . command.
Has anyone been successfully able to increase the this timeout ?
Thanks for your help =)
Most helpful comment
I can briefly document the source of the issue for my cluster. When flux uninstalls a particular resource (in my case istio operator ) in a way uninstall fails due to finalizers on some resources not finishing, no further flux operations will succeed syncs additions removals. I had to look at all leftover resources in istio-system namespace (the namespace where installation and removal happened) see which ones shouldn't be there. It turns out kubectl cannot easily delete such resources either, you have to remove the finalizer by patching the resource first, then removing it after the patch. as soon as you remove all offending resources (those that flux tried to delete but failed) things will start working again.
In general the silent flux sync fail seem to happen due to cluster not in the state flux wants it to be due to a dangling resource. it doesn't say which one, however, so you've got to do your own debugging work. hopefully the new version will provide precise reason and offending resource when it fails.