Dashboard: Extremely dangerous logging

Created on 1 Jan 2018  ·  32Comments  ·  Source: kubernetes/dashboard

Environment
Dashboard version: 1.8.1
Kubernetes version: 1.9.0
Operating system: Linux
Steps to reproduce

Restarted Kubernetes master node, dashboard was running on another node.

Observed result

Dashboard starts logging the following 3 log lines in a very, VERY tight loop:

2018/01/01 12:20:47 Starting secret synchronizer for kubernetes-dashboard-key-holder in namespace kube-system
2018/01/01 12:20:47 Synchronizer kubernetes-dashboard-key-holder-kube-system exited with error: kubernetes-dashboard-key-holder-kube-system watch ended with timeout
2018/01/01 12:20:47 Restarting synchronizer: kubernetes-dashboard-key-holder-kube-system.

This was enough to quickly exhaust free disk space on the node where dashboard was running and bring the whole node down. :( The amount of this logging is unbelievable, close to 50MB/s?!

Expected result

Dashboard logging should not be nearly so aggressive. Having same log lines hundreds or thousands times per second is not useful at all.

Comments

Please, please, this is so dangerous behaviour, put time.Sleep(time.Second) wherever this loop is executing before even starting to contemplate how to fix/avoid it properly. I will be scared to run dashboard until I see fix for this, 'cause if master node even glitches again, I know I'm going to lose another node due to dashboard going wild.

Most helpful comment

Why is this issue closed if it is not resolved?

This might become potentially severe problem for Google Cloud user since Stackdriver pricing changes, see https://cloud.google.com/stackdriver/pricing_v2#table

All 32 comments

I will add some delay before attempting to reconnect however, there has to be some other issue that causes synchronizer to timeout. Either connection to apiserver is dropping or Dashboard has no permissions to synchronize resources. After all, this is only a symptom of some other underlying issue.

Restarting master node will of course make apiserver unavailable for several minutes, but dashboard continued having trouble even after apiserver came back. Only restarting dashboard pod brings it out of the vicious loop.

We ran into this recently with Kubernetes Dashboard 1.8.0 running on GKE 1.8.7. Kubernetes Dashboard was in this state (spamming these loglines ~8 times per second) and filling up our log servers rapidly ever since a master upgrade. Deleting the Kubernetes Dashboard pod (restarting it) fixed the issue, so I believe there's some underlying condition where it will never recover when it encounters an error.

We've upgraded to 1.8.3 to get the sleep noted in the linked PR. It may be helpful to implement an exponential sleep up to 90 seconds or so to further reduce log spam because 2 seconds isn't very long to sleep between long periods of failures.

Same issue right now with freshly installed Kubernetes 1.9.2 ( via kubespray )
We've got 90k log records per minute (!)
Resolved with kubectl delete po [kubernetes-dashboard-....] -n [kube-system or whatever]

Still happening on Kubernetes 1.9.3 ( via kubespray )

Also happened to us on 1.9.2, caused 15GB of logs with our logging service, which we pay for by volume.

Same thing for us. Using GKE with
Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.8-gke.0", GitCommit:"6e5b33a290a99c067003632e0fd6be0ead48b233", GitTreeState:"clean", BuildDate:"2018-02-16T18:26:58Z", GoVersion:"go1.8.3b4", Compiler:"gc", Platform:"linux/amd64"}

Totally exhausted my papertrail account logs in an hour.

It is all caused by using the outdated version of Dashboard.

Just to expand on the above – going from dashboard version 1.8.0 to 1.8.3 resolved this for me.

Still present on:

  • Kubernetes 10.0
  • dashboard v1.8.3

Restarting the dashboard indeed fixes the issue, hence "upgrading" seems to fix it.
Restarting the apiserver (or complete master nodes) causes this bug to appear.

Ah thank you @vdboor, you've just saved me from some expensive logging bills 👍

Yep, got it again:

Kubernetes: 1.9.4-gke.1
dashboard: v1.8.3

when was this fixed?

i delete the old pod and it's working fine but just want to make sure it won't happen again.

My interpretation of @vdboor comments (above) is that this is not fixed:

Restarting the dashboard [i.e. deleting the pod] indeed fixes the issue, hence "upgrading" seems to fix it.
Restarting the apiserver (or complete master nodes) causes this bug to appear.

Why is this issue closed if it is not resolved?

This might become potentially severe problem for Google Cloud user since Stackdriver pricing changes, see https://cloud.google.com/stackdriver/pricing_v2#table

Huh, checked Stackdrive billing section and noticed 1,9TB of logging caused by just one kubernetes dashboard. I'm happy that I found this before Stackdriver pricing changes..

I see @floreks hasn't been mentioned since they closed this ticket. Perhaps the mention will bring some attention back to this...

this is a very dangerous issue indeed.

@adamcharnock @tuananh It seems, that @floreks fixed it in https://github.com/kubernetes/dashboard/commit/d64874dc725f9da5d8c121d200275015748804ee. This commit is available in version >1.8.2. Can you verify?

@adamcharnock @jwarpenius the core issue that was flooding logs with thousands of messages every second has been fixed in v1.8.2. I have added a 2-second delay between every synchronizer restart. Of course, there is a room for improvement here and we should force restart after x failed synchronizer restarts, however, the main issue was fixed. You can add some log rotation based on file size to the nodes and together with introduced delay there should be no issues with high storage consumption.

Thanks. I'm currently on 1.8.1. will update.

Thank you @floreks, and I'm sorry to keep on prolonging the conversation on this issue. However, it seems that @vdboor and @maresja1 were still encountering this issue in v1.8.3 (as per their above comments).

@vdboor & @maresja1 – could you confirm if you are still encountering this?

It has been a while for me and I cannot remember if a) I also encountered the problem in v1.8.3, or if b) I just took it on faith that the problem still existed.

I will retest this scenario once again on the latest release, but I am pretty sure that it should not be an issue anymore unless 1-2 log lines every 2 sec is considered a dangerous logging 😄 .

Thank you @floreks. FWIW, my experience is that the problems takes a while to present itself (hours? days?)

still happening on 1.8.3

Kubernetes: 1.11.2

Restarting kubernetes-dashboard fixed it for me too.

Now I'm seeing 2018/09/06 23:09:08 Metric client health check failed: the server could not find the requested resource (get services heapster). Retrying in 30 seconds., which is weird since I don't have any reference to heapster in any YAMLs associated with kubernetes-dashboard, or any other YAMLs for that matter.

@jazoom Dashboard always tries to find and connect to Heapster. Ignore this message.

still happening
k8s: 1.11.3
dashboard: 1.10.0

Closing in favor of #3008.
/close

@floreks: Closing this issue.

In response to this:

Closing in favor of #3008.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

I had the problem of that logging going crazy. Had those 2 lines continuously appearing every 2 seconds in the logs (and never really realized until I was actively looking for CPU/Memory logging on the dashboard, and had to dig into the dashboard).

A scale down then scale up (basically, restarted) made the trick, but a mechanism of retrial would be nice.

It's already there in v2.

Was this page helpful?
0 / 5 - 0 ratings