Flux: Intermittent errors from Flux during synchronisation

Created on 27 Jul 2020 · 4Comments · Source: fluxcd/flux

Describe the bug

We are seeing intermittent errors from Flux when it synchronises manifests. It happens for <5% of the manifests (see the below graph from Prometheus.

We don't have a consistent way to reproduce it, when it happens the following messages are logged:

ts=2020-07-27T10:00:00.771306649Z caller=sync.go:606 method=Sync cmd="kubectl apply -f -" took=268.630315ms err="running kubectl: " output=

Here is also a graph from Prometheus showing the flakiness.
flux-sync-errors

Additional context

Flux version: 1.20.0
Kubernetes version: 1.17

We believe the underlying Go error is hidden by the error wrapping here https://github.com/fluxcd/flux/blob/master/pkg/cluster/kubernetes/sync.go#L601-L604 and adding more information to the error would help us understand why Flux intermittently is failing for some manifests.

blocked-needs-validation bug

Source

nwv-danvulpe

👍2

All 4 comments

We've upgrade late last week to 1.20.2 and we've not seen any intermittency since.
This could be related to https://github.com/fluxcd/flux/pull/3246 (we apply CRDs with flux).

I believe we can close this issue. If any intermittency happens again we will have access to the kubectl exit code from the logs and we could investigate more.

nwv-danvulpe on 24 Aug 2020

Spoke too soon, apparently. The flakiness is back and now we have more info from the logs.
It took about five days since the upgrade of Flux for the intermittent errors to start showing up.

method=Sync cmd="kubectl apply -f -" took=223.731064ms err="running kubectl: signal: killed, stderr: "

We have flux deployed with a memory limit of 1G and a cpu limit of 1. Occasionally it can get OOMKilled, but normally it consumes less than 100MB of memory.
The time where the above failure was reported it was rather quiet from a release point of view.

Looking further on the node where flux runs, we've found that kubectl was killed by oom killer:

[1067114.007864] Tasks state (memory values in pages):
[1067114.008887] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[1067114.010761] [  25319]     0 25319      256        1    32768        0          -998 pause
[1067114.012508] [   2589]     0  2589      195        1    40960        0           866 tini
[1067114.014220] [   2630]     0  2630   325689   117606  1273856        0           866 fluxd
[1067114.015993] [   9976]     0  9976     1075      548    49152        0           866 git
[1067114.017696] [   9977]     0  9977     1083      735    49152        0           866 ssh
[1067114.019388] [  10010]     0 10010    53463    27563   303104        0           866 kubectl
[1067114.021185] Memory cgroup out of memory: Kill process 2630 (fluxd) score 1315 or sacrifice child
[1067114.023133] Killed process 10010 (kubectl) total-vm:213852kB, anon-rss:81160kB, file-rss:29092kB, shmem-rss:0kB
[1067114.026650] oom_reaper: reaped process 10010 (kubectl), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB```

We will try and increase the memory and see if it happens again.

nwv-danvulpe on 26 Aug 2020

👍1

Flux was quite stable over the past two weeks and seems that for our usage of it the 2GB max memory limit has made the difference.

Closing as it's not really a bug, more a resource limit problem.

nwv-danvulpe on 10 Sep 2020

Had the same issue. 1.20.2 had better error messages as it logged out the signal not just the empty stderr.

Adding more resources fixed the problem.