Dashboard: Extremely dangerous logging

Created on 1 Jan 2018 · 32Comments · Source: kubernetes/dashboard

Environment

Dashboard version: 1.8.1
Kubernetes version: 1.9.0
Operating system: Linux

Steps to reproduce

Restarted Kubernetes master node, dashboard was running on another node.

Observed result

Dashboard starts logging the following 3 log lines in a very, VERY tight loop:

2018/01/01 12:20:47 Starting secret synchronizer for kubernetes-dashboard-key-holder in namespace kube-system
2018/01/01 12:20:47 Synchronizer kubernetes-dashboard-key-holder-kube-system exited with error: kubernetes-dashboard-key-holder-kube-system watch ended with timeout
2018/01/01 12:20:47 Restarting synchronizer: kubernetes-dashboard-key-holder-kube-system.

This was enough to quickly exhaust free disk space on the node where dashboard was running and bring the whole node down. :( The amount of this logging is unbelievable, close to 50MB/s?!

Expected result

Dashboard logging should not be nearly so aggressive. Having same log lines hundreds or thousands times per second is not useful at all.

Comments

Please, please, this is so dangerous behaviour, put time.Sleep(time.Second) wherever this loop is executing before even starting to contemplate how to fix/avoid it properly. I will be scared to run dashboard until I see fix for this, 'cause if master node even glitches again, I know I'm going to lose another node due to dashboard going wild.

Source

zcalusic

👍32

Most helpful comment

Why is this issue closed if it is not resolved?

This might become potentially severe problem for Google Cloud user since Stackdriver pricing changes, see https://cloud.google.com/stackdriver/pricing_v2#table

mathewcohle on 28 Jun 2018

👍6

All 32 comments

I will add some delay before attempting to reconnect however, there has to be some other issue that causes synchronizer to timeout. Either connection to apiserver is dropping or Dashboard has no permissions to synchronize resources. After all, this is only a symptom of some other underlying issue.

floreks on 2 Jan 2018

Restarting master node will of course make apiserver unavailable for several minutes, but dashboard continued having trouble even after apiserver came back. Only restarting dashboard pod brings it out of the vicious loop.

zcalusic on 2 Jan 2018

We ran into this recently with Kubernetes Dashboard 1.8.0 running on GKE 1.8.7. Kubernetes Dashboard was in this state (spamming these loglines ~8 times per second) and filling up our log servers rapidly ever since a master upgrade. Deleting the Kubernetes Dashboard pod (restarting it) fixed the issue, so I believe there's some underlying condition where it will never recover when it encounters an error.

We've upgraded to 1.8.3 to get the sleep noted in the linked PR. It may be helpful to implement an exponential sleep up to 90 seconds or so to further reduce log spam because 2 seconds isn't very long to sleep between long periods of failures.

jbiel on 19 Feb 2018

👍1

Same issue right now with freshly installed Kubernetes 1.9.2 ( via kubespray )
We've got 90k log records per minute (!)
Resolved with kubectl delete po [kubernetes-dashboard-....] -n [kube-system or whatever]

z1nkum on 21 Feb 2018

👍6

Still happening on Kubernetes 1.9.3 ( via kubespray )

ghost on 31 Mar 2018

👍2

Also happened to us on 1.9.2, caused 15GB of logs with our logging service, which we pay for by volume.

sebastianrosch on 16 Apr 2018

👍1

Same thing for us. Using GKE with
Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.8-gke.0", GitCommit:"6e5b33a290a99c067003632e0fd6be0ead48b233", GitTreeState:"clean", BuildDate:"2018-02-16T18:26:58Z", GoVersion:"go1.8.3b4", Compiler:"gc", Platform:"linux/amd64"}

Totally exhausted my papertrail account logs in an hour.

bpsizemore on 16 Apr 2018

It is all caused by using the outdated version of Dashboard.

floreks on 16 Apr 2018

Just to expand on the above – going from dashboard version 1.8.0 to 1.8.3 resolved this for me.

adamcharnock on 17 Apr 2018

Still present on:

Kubernetes 10.0
dashboard v1.8.3

Restarting the dashboard indeed fixes the issue, hence "upgrading" seems to fix it.
Restarting the apiserver (or complete master nodes) causes this bug to appear.

vdboor on 18 Apr 2018

Ah thank you @vdboor, you've just saved me from some expensive logging bills 👍

adamcharnock on 18 Apr 2018

Yep, got it again:

Kubernetes: 1.9.4-gke.1
dashboard: v1.8.3

maresja1 on 30 Apr 2018

when was this fixed?

i delete the old pod and it's working fine but just want to make sure it won't happen again.

tuananh on 30 May 2018

My interpretation of @vdboor comments (above) is that this is not fixed:

Restarting the dashboard [i.e. deleting the pod] indeed fixes the issue, hence "upgrading" seems to fix it.
Restarting the apiserver (or complete master nodes) causes this bug to appear.

adamcharnock on 30 May 2018

👍3

Why is this issue closed if it is not resolved?

This might become potentially severe problem for Google Cloud user since Stackdriver pricing changes, see https://cloud.google.com/stackdriver/pricing_v2#table

mathewcohle on 28 Jun 2018

👍6

Huh, checked Stackdrive billing section and noticed 1,9TB of logging caused by just one kubernetes dashboard. I'm happy that I found this before Stackdriver pricing changes..

jwarpenius on 29 Jun 2018

I see @floreks hasn't been mentioned since they closed this ticket. Perhaps the mention will bring some attention back to this...

adamcharnock on 29 Jun 2018

this is a very dangerous issue indeed.

tuananh on 30 Jun 2018

@adamcharnock @tuananh It seems, that @floreks fixed it in https://github.com/kubernetes/dashboard/commit/d64874dc725f9da5d8c121d200275015748804ee. This commit is available in version >1.8.2. Can you verify?

maciaszczykm on 2 Jul 2018

@adamcharnock @jwarpenius the core issue that was flooding logs with thousands of messages every second has been fixed in v1.8.2. I have added a 2-second delay between every synchronizer restart. Of course, there is a room for improvement here and we should force restart after x failed synchronizer restarts, however, the main issue was fixed. You can add some log rotation based on file size to the nodes and together with introduced delay there should be no issues with high storage consumption.

floreks on 2 Jul 2018

Thanks. I'm currently on 1.8.1. will update.

tuananh on 2 Jul 2018

Thank you @floreks, and I'm sorry to keep on prolonging the conversation on this issue. However, it seems that @vdboor and @maresja1 were still encountering this issue in v1.8.3 (as per their above comments).

@vdboor & @maresja1 – could you confirm if you are still encountering this?

It has been a while for me and I cannot remember if a) I also encountered the problem in v1.8.3, or if b) I just took it on faith that the problem still existed.

adamcharnock on 2 Jul 2018

I will retest this scenario once again on the latest release, but I am pretty sure that it should not be an issue anymore unless 1-2 log lines every 2 sec is considered a dangerous logging 😄 .

floreks on 19 Jul 2018

Thank you @floreks. FWIW, my experience is that the problems takes a while to present itself (hours? days?)

adamcharnock on 20 Jul 2018

still happening on 1.8.3

Kubernetes: 1.11.2

xiak on 20 Aug 2018

👍1

Restarting kubernetes-dashboard fixed it for me too.

Now I'm seeing 2018/09/06 23:09:08 Metric client health check failed: the server could not find the requested resource (get services heapster). Retrying in 30 seconds., which is weird since I don't have any reference to heapster in any YAMLs associated with kubernetes-dashboard, or any other YAMLs for that matter.

jazoom on 7 Sep 2018

@jazoom Dashboard always tries to find and connect to Heapster. Ignore this message.

maciaszczykm on 7 Sep 2018

👍1

still happening
k8s: 1.11.3
dashboard: 1.10.0

Altro-Technology on 27 Sep 2018

Closing in favor of #3008.
/close

floreks on 13 Dec 2018

@floreks: Closing this issue.

In response to this:

Closing in favor of #3008.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 13 Dec 2018

I had the problem of that logging going crazy. Had those 2 lines continuously appearing every 2 seconds in the logs (and never really realized until I was actively looking for CPU/Memory logging on the dashboard, and had to dig into the dashboard).

A scale down then scale up (basically, restarted) made the trick, but a mechanism of retrial would be nice.