Flux: FluxCD seems to have a never expiring negative DNS cache

Created on 26 May 2020 · 7Comments · Source: fluxcd/flux

Describe the bug

If flux fail to resolve the GIT repository dns name for some reason, it does not seem to ever try to resolve again , seems like some never expiring ( very long ? ) negative caching is going on

In my case this can happen when starting up a cluster , with terraform, the dns service is not yet running when fluxcd is started.

Expected behavior

On Retry of cloning the repo a full dns resolution should be attempted again

Logs

Flux get stucked with

flux-apps-8b787fd74-9nbmw flux ts=2020-05-26T12:37:16.63374351Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone030946625'...\nssh: Could not resolve hostname gitlab.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"

DNS Can resolve the name correctly
kubectl port-forward -n kube-system service/kube-dns 5353:53

dig @localhost -p 5353 +tcp +short gitlab.com
172.65.251.78

Killing the pod fixes the issue meaning, to me at least, that is not a real problem with the resolution but apparently some forever negative caching is going on here

Additional context

Flux version: 1.19.0
Kubernetes version: 1.14.10
Git provider: gitlab.com
Container registry provider:

blocked-needs-validation bug

Source

primeroz

👍12

Most helpful comment

This may be a dupe of #3042

tshak on 19 Jun 2020

👍2

All 7 comments

I've run into this exact same issue. I've tried updating ndots, etc. This is especially common when the cluster first comes up for some reason.

Also in the environment I am deploying with, it is expected that connectivity will go out for multiple days sometimes. So these environments never recover.

brokenjacobs on 3 Jun 2020

I don't think ndots will help since the base image is alpine.

I am planning to build a version with Debian glibc just to see how that does

In my case when the cluster come up is totally up to chance if the dns server is already available when flux come up so if flux come up first I encounter this issue

primeroz on 3 Jun 2020

We have run into this recently, where the flux agent was unable to resolve the DNS for more than 5 days.
We have multiple instances of flux running in our clusters, both on the same node (so connectivity isn't a problem) only one of the flux agent had this issue.

krackjack29 on 5 Jun 2020

I just had this occur when dns in the cluster was definitely running. So it's not just dns availability. Of course there could have been a transient network outage that caused it.

What is odd is that if you exec into the pod dns resolves just fine. Via ssh/git/ping/host. But running fluxctl sync implies that the error comes back from an exec:

Error: can not connect to git repository with URL ssh://git@*********/*/*.git

Full error message: git clone --mirror: fatal: Could not read from remote repository., full output:
 Cloning into bare repository '/tmp/flux-gitclone439914880'...
ssh: Could not resolve hostname ********: Try again
fatal: Could not read from remote repository.

This isn't a golang error, it's an error from the git binary. Why would the git binary be caching NXDOMAIN or whatever it's caching here?

brokenjacobs on 5 Jun 2020

This may be a dupe of #3042

tshak on 19 Jun 2020

👍2

Looks like it is a dupe.

brokenjacobs on 21 Jun 2020

This issue may seem like a dupe but it certainly has not been fixed in 1.20.x

We have experienced this across many of our clusters past few days on the latest release.