Flux: Git sync stuck in "git repo not ready: ... Could not resolve hostname"

Created on 17 Apr 2020  路  21Comments  路  Source: fluxcd/flux

Describe the bug

We noticed that in one of clusters after period of time (from 4h to 12h) flux stopped to synchronise git state with cluster. In logs for every attempt there was error message:

ts=2020-04-17T13:17:50.333632284Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"

After some debugging it narrowed down to DNS pods getting killed because of preemptable nodes, which caused first error.
Seems like after initial git clone --mirror error repo state machine goes into NotReady state and not recovering from it on next sync.

Possibly sync is stopped here: https://github.com/fluxcd/flux/blob/master/pkg/git/repo.go#L491

Maybe related: https://github.com/fluxcd/flux/commit/65a36ff37507eb72bd72d323de3238a70ed47633

Also seems like issue is not appearing in 1.18.0

To Reproduce

  1. Install fluxcd
  2. Configure flux to clone some git repository
  3. Introduce disruption of DNS service (in our cause it was preemptable nodes)
  4. Wait for flux to try sync next time.
  5. On git sync clone --mirror fails and not recovering

Restart of flux pod solves state issues.

Expected behavior

After git sync failed to clone git repository it is expected to try again next time and continue syncing.

Logs

ts=2020-04-17T13:14:15.392190754Z caller=warming.go:180 component=warmer .... dial tcp [2a00:1450:400c:c07::52]:443: connect: cannot assign requested address"
ts=2020-04-17T13:14:20.409855924Z caller=warming.go:180 component=warmer .... dial tcp: lookup eu.gcr.io on 10.241.0.10:53: read udp 10.9.0.3:49907->10.241.0.10:53: read: connection refused"
ts=2020-04-17T13:14:25.445226769Z caller=warming.go:180 component=warmer .... dial tcp [2a00:1450:400c:c07::52]:443: connect: cannot assign requested address"
ts=2020-04-17T13:14:35.454503407Z caller=warming.go:180 component=warmer .... dial tcp: lookup eu.gcr.io on 10.241.0.10:53: read udp 10.9.0.3:50822->10.241.0.10:53: i/o timeout"
ts=2020-04-17T13:14:50.330833856Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-04-17T13:14:58.125384491Z caller=memcached.go:156 component=memcached err="error updating memcache servers: lookup flux-memcached on 10.241.0.10:53: no such host"
ts=2020-04-17T13:15:01.538772054Z caller=warming.go:198 component=warmer info="refreshing image" image=eu.gcr.io/...
ts=2020-04-17T13:15:01.88779753Z caller=warming.go:206 component=warmer updated=eu.gcr.io/...
ts=2020-04-17T13:15:21.394495122Z caller=warming.go:198 component=warmer info="refreshing image" image=eu.gcr.io/...
ts=2020-04-17T13:15:21.784532144Z caller=warming.go:206 component=warmer updated=eu.gcr.io/...
ts=2020-04-17T13:15:31.241835911Z caller=warming.go:198 component=warmer info="refreshing image" image=eu.gcr.io/...
ts=2020-04-17T13:15:31.651661887Z caller=warming.go:206 component=warmer updated=eu.gcr.io/...
ts=2020-04-17T13:15:50.331528463Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-04-17T13:16:50.332287525Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-04-17T13:16:59.150641747Z caller=warming.go:198 component=warmer info="refreshing image" image=eu.gcr.io/...
ts=2020-04-17T13:16:59.434819987Z caller=warming.go:206 component=warmer updated=eu.gcr.io/...
ts=2020-04-17T13:17:04.587246924Z caller=warming.go:198 component=warmer info="refreshing image" image=eu.gcr.io/...
ts=2020-04-17T13:17:04.907484875Z caller=warming.go:206 component=warmer updated=eu.gcr.io/...
ts=2020-04-17T13:17:11.406797895Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2020-04-17T13:17:11.407131164Z caller=images.go:27 component=sync-loop msg="no automated workloads"
ts=2020-04-17T13:17:50.333632284Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-04-17T13:18:50.334127475Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-04-17T13:19:50.334906573Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-04-17T13:20:50.335152806Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-04-17T13:21:50.335959104Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-04-17T13:22:11.409472142Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2020-04-17T13:22:11.409863167Z caller=images.go:27 component=sync-loop msg="no automated workloads"
ts=2020-04-17T13:22:50.33668982Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-04-17T13:23:50.337218821Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-04-17T13:24:50.337896126Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-04-17T13:25:50.338393614Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-04-17T13:26:04.359257312Z caller=warming.go:198 component=warmer info="refreshing image" image=eu.gcr.io/...
ts=2020-04-17T13:26:04.721327374Z caller=warming.go:206 component=warmer updated=eu.gcr.io/...
ts=2020-04-17T13:26:04.731628848Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2020-04-17T13:26:04.731882376Z caller=images.go:27 component=sync-loop msg="no automated workloads"
ts=2020-04-17T13:26:50.339800202Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-04-17T13:27:50.340519131Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-04-17T13:28:50.341078645Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-04-17T13:29:50.341550926Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-04-17T13:30:50.342268738Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-04-17T13:31:04.744994472Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2020-04-17T13:31:04.74535697Z caller=images.go:27 component=sync-loop msg="no automated workloads"
ts=2020-04-17T13:31:50.34279248Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-04-17T13:32:50.344290038Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone460590104'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"

Additional context

  • Flux version: 1.19.0
  • Kubernetes version: 1.15.9
  • Git provider: github.com
  • Container registry provider: eu.gcr.io
blocked-needs-validation bug

Most helpful comment

Simply deleting the flux pod (and letting the flux deployment create a new pod) resolved the issue for me.

$ kubectl -n flux delete pod -l app=flux
pod "flux-xxxxx-xxxxx" deleted

This problem seems to randomly crop up every few months and is a bit frustrating.

All 21 comments

To add, running fluxctl --k8s-fwd-ns=fluxcd sync results in

Error: can not connect to git repository with URL ssh://[email protected]/...

Full error message: git clone --mirror: fatal: Could not read from remote repository., full output:
 Cloning into bare repository '/tmp/flux-gitclone400237376'...
ssh: Could not resolve hostname github.com: Try again
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

Run 'fluxctl sync --help' for usage.

But trying to resolve hostname from inside container works.

we experienced the same issue after upgrading to 1.19.0 from 1.18.0. Kubernetes version 1.16. That's not to pin the problem on the new version, but in about a year of running Flux, it's the first time we've seen it.

a liveness check on the container which failed if the remote registry cannot be resolved would be helpful

@willholley hiddeco/flux:issue-3011-unreachable-a0d65e42 from the above PR should simply re-attempt when the failure is detected.

@hiddeco I'm not totally following where it refreshes. Would a state of RepoUnreachable not result in an error returned by https://github.com/fluxcd/flux/blob/65a36ff37507eb72bd72d323de3238a70ed47633/pkg/git/repo.go#L204 and then a failure to retry at https://github.com/fluxcd/flux/blob/65a36ff37507eb72bd72d323de3238a70ed47633/pkg/git/repo.go#L439?

The only way we've been able to recover from this state is by restarting the flux pods everywhere

It actually refreshes here: https://github.com/fluxcd/flux/pull/3013/files#diff-80107e0f51c7558548631f8328cf93ceR305

Log output from above image:

ts=2020-04-20T13:54:13.346147607Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: un
 not resolve host: unreachable-example.com\n"
ts=2020-04-20T13:59:13.346104974Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2020-04-20T13:59:13.346229632Z caller=images.go:27 component=sync-loop msg="no automated workloads"
ts=2020-04-20T13:59:13.349916243Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: un
 not resolve host: unreachable-example.com\n"
ts=2020-04-20T14:04:13.346418475Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2020-04-20T14:04:13.346510134Z caller=images.go:27 component=sync-loop msg="no automated workloads"
ts=2020-04-20T14:04:13.357056096Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: un
 not resolve host: unreachable-example.com\n"
ts=2020-04-20T14:09:13.346857642Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2020-04-20T14:09:13.346905568Z caller=images.go:27 component=sync-loop msg="no automated workloads"
ts=2020-04-20T14:09:13.357218899Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: un
 not resolve host: unreachable-example.com\n"

fluxct sync:

```
Error: git repository https://@unreachable-example.com/foo/test-repo is not ready to sync

Full error message: git clone --mirror: fatal: unable to access 'https://lasldaksdkasdk.com/foo/test-repo/': Could not resolve
Cloning into bare repository '/tmp/flux-gitclone364300143'...
fatal: unable to access 'https://unreachable-example.com/foo/test-repo/': Could not resolve host: unreachable-example.com

ah - I see. So should be fixed in the next release...

@willholley Are you aware that due to this bug no Flux tutorial using fluxctl works anymore?
It would be great to be able to explicitly specify a desired version during the Flux installation. Is there another workaround besides manually editing the image tag of the pod?

Using the currrent helm tutorial, even changing just the image to 1.18.0 using
--set image.tag=1.18.0
results in crash-backoff
Error: unknown flag: --ssh-keygen-format

So is the workaround reverting the pod back to 1.18.0? Because I'm getting the same error reported by @johnrkriter

Edit: Removed --ssh-keygen-format from the spec and now 1.18.0 works fine

As another datapoint, my underlying cluster OS was OOB Centos 7.7 using rancher to create my cluster.
When I switched to Ubuntu 18.04 for my OS, I no longer experience an issue.

This leads me to believe that there may be some underlying DNS reconfiguration on my end, but the error was repeatedly reproducible.

@noamt I did not attempt to use 1.18.0 after experiencing the same formatting spec issue you did.

@anjakammer that's probably a question for the Flux maintainers. @johnrkriter we reverted to 1.18.0 in our deployments - I think 1.19.0 is a bust due to this bug, unfortunately.

I have a similar issue.

Flux sync --k8s-fwd-ns=flux not working.

Error: git repository ssh://[email protected]/XXX-YYY/ZZZ-123 is not ready to sync (status: unreachable)

Logs
ts=2020-05-04T11:39:43.248157504Z caller=images.go:27 component=sync-loop msg="no automated workloads"
ts=2020-05-04T11:41:24.77104663Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone606183439'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-05-04T11:44:43.248314115Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2020-05-04T11:44:43.248377503Z caller=images.go:27 component=sync-loop msg="no automated workloads"
ts=2020-05-04T11:46:24.771300069Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone606183439'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-05-04T11:49:43.248602614Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2020-05-04T11:49:43.249028249Z caller=images.go:27 component=sync-loop msg="no automated workloads"
ts=2020-05-04T11:51:24.771622319Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone606183439'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-05-04T11:54:43.24931619Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2020-05-04T11:54:43.249473129Z caller=images.go:27 component=sync-loop msg="no automated workloads"
ts=2020-05-04T11:56:24.772004282Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone606183439'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"
ts=2020-05-04T11:59:43.249723538Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2020-05-04T11:59:43.250083855Z caller=images.go:27 component=sync-loop msg="no automated workloads"
ts=2020-05-04T12:01:24.772483917Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone606183439'...\nssh: Could not resolve hostname github.com: Try again\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"

Additional context

Flux version: 1.19.0
Kubernetes version:
Client Version: v1.15.3
Server Version: v1.14.10-gke.27
Git provider: Github
Container registry provider: GCR

3042

@willholley @johnrkriter @noamt can yoi provide me the command to install 1.18.0?

I tried using below command -
helm upgrade -i flux fluxcd/flux --set git.[email protected]/XXX-YYY-k8s --set git.branch=local --set syncGarbageCollection.enabled=true --namespace flux --set helm.versions=v3 --set image.tag=1.18.0

but its installing 1.19.0. Can you tell me what mistake I am doing here?

I actually found on my end that this might not have been an issue with FluxCD. I found that my cluster wasnt being bootstrapped with the proper upstream DNS in CoreDNS.

I would suggest to anyone that they run these 2 commands and make sure they work properly. If not, then your issue may not be FluxCD but instead CoreDNS being unhealthy(not unstable).
kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default

kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup www.google.com

For people who is struggling with rolling back to 1.18.0. Do not change version tag of container, change helm chart version from 1.3.0 to 1.2.0 instead.

@johnrkriter is right. I tried that and it fixed my issue. Thanks for the help

I have the same issue using flex v 1.20.2

# fluxctl sync --k8s-fwd-ns flux
Error: can not connect to git repository with URL ssh://[email protected]/DawidIzydor/flux-get-started

Full error message: git clone --mirror: fatal: Could not read from remote repository., full output:
 Cloning into bare repository '/tmp/flux-gitclone124247822'...
ssh: Could not resolve hostname github.com: Try again
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

Run 'fluxctl sync --help' for usage.

Simply deleting the flux pod (and letting the flux deployment create a new pod) resolved the issue for me.

$ kubectl -n flux delete pod -l app=flux
pod "flux-xxxxx-xxxxx" deleted

This problem seems to randomly crop up every few months and is a bit frustrating.

Was this page helpful?
0 / 5 - 0 ratings