Dashboard: Errors for large numbers of objects

Created on 17 Nov 2016 · 26Comments · Source: kubernetes/dashboard

I've seen reports of the dashboard throwing errors and otherwise not working for large numbers of objects.

Anecdotally, I have seen reports for large numbers of pods and large numbers of events. It may behove us to test with large numbers of objects and see which are problematic and start creating issues and fixing them.

kinbug

Source

ianlewis

Most helpful comment

100 megs requests 300 limits to start with?

And update this on https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dashboard/dashboard-controller.yaml

bryk on 20 Apr 2017

👍2

All 26 comments

Any specific stacktrace or something to debug? The UI should work fine.

bryk on 17 Nov 2016

somewhat similar, on my test cluster with a deployed 1.4.0 dashboard and 10k pods the dashboard container crashes every time I request a page. Haven't looked too much more into it though

rf232 on 17 Nov 2016

@bryk I haven't tested it myself but I heard it a few times ( once on slack and a few times at KubeCon) so I thought it worth investigating.

What @rf232 said is exactly what I'm talking about but we need more info.

ianlewis on 17 Nov 2016

I can reproduce on my env, will take a look what really happens

rf232 on 18 Nov 2016

Interesting. Please reply here what was/is wrong

bryk on 18 Nov 2016

I did a short investigation, and the container crashes with a OOM (Out Of Memory) crash

2016-11-18T21:29:35.151956377Z container oom bfb8c28f5b70273f475a397bd8be3408f9dc256f33fbe3165940ba187b7a1253 (image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=20, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_b829007e)
2016-11-18T21:29:37.251943478Z container die bfb8c28f5b70273f475a397bd8be3408f9dc256f33fbe3165940ba187b7a1253 (exitCode=137, image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=20, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_b829007e)
2016-11-18T21:29:37.329794337Z container destroy f013e7a99b7b9513c97b7bb7e5cd9f15d297155e242754432c7c52687f8b7375 (image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=19, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_d7d1a95d)
2016-11-18T21:29:50.718773431Z container create f0149ffeab479c2620fe941e7cfa625bd8e6fe9bb12dd92cef5bdcc849310a26 (image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=21, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_4156d221)
2016-11-18T21:29:50.779910144Z container start f0149ffeab479c2620fe941e7cfa625bd8e6fe9bb12dd92cef5bdcc849310a26 (image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=21, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_4156d221)

rf232 on 18 Nov 2016

okey, apparently I had some limits set on my pod, after setting the limits a bit higher I found out that for 10k pods the dashboard needs ~200MB of memory

rf232 on 18 Nov 2016

When meeting with some folks yesterday they said they had problems even with 1000 pods. There are probably a number of issues related to memory, API timeouts, etc. for various API objects. I would leave this issue as pretty generic and solve each specific issue separately as we find them.

ianlewis on 18 Nov 2016

We're not enforcing any CPU/Memory limits on dashboard by default. It has to be applied externally either by adjusting yaml or creating LimitRange. We should add some comment in documentation saying that for high number of resources in the cluster memory limits should be extended (if there are any applied). 200Mi - 2Gi should be enough.

floreks on 19 Nov 2016

When meeting with some folks yesterday they said they had problems even with 1000 pods. There are probably a number of issues related to memory, API timeouts, etc. for various API objects. I would leave this issue as pretty generic and solve each specific issue separately as we find them.

Can you get us a stacktrace or screenshot of some form? Or guide the folks to report bugs here? I'd help a lot.

bryk on 21 Nov 2016

I instructed them to create a bug but they were just evaluating Kubernetes. They may or may not post an issue.

I'll see if I can repro the issue at some point.

This issue about paging in the API may be worth following:
https://github.com/kubernetes/kubernetes/issues/2349

ianlewis on 21 Nov 2016

Even if we don't apply CPU/Memory limits ourselves in a way the hardware will do this in the end. Perhaps we should find a way to be a bit more graceful there.

But even if we don't crash with a high number of objects we are getting really slow. Some findings up till now about that:

JSON serializiation takes a significant amount of the time
Since k8s supports protobuf serialization format we could switch to that, but the straightforward way of just always requesting protobufs breaks the view and edit yaml workflow
We request way too large lists from api server.
If I have a RC with 10k pods and go to the list of jobs or a job detail page we still request the list of all pods (given the same namespace) and this makes it so that all pages become slow

Given that this is a bit larger than a small fix I'll remove this issue from the 1.5 project but keep it open

rf232 on 24 Nov 2016

Just wanted to confirm that my team and I are seeing this issue when we were running as low as 315 pods.

dgreene1 on 29 Nov 2016

That's sad @dgreene1

Do you have any logs to confirm that this is OOMs?

A short term fix for this problem can be increasing memory reservation for the UI. Can you try this?

bryk on 30 Nov 2016

@rf232 Yah. gRPC might be a long term stretch goal but we'll still require paging to get around the memory issues and there doesn't seem to be a way to request pages of data from the API server at the moment. The issue kubernetes/kubernetes#2349 from the kubernetes repo addresses this but doesn't look like it's been seriously considered for implementation yet.

ianlewis on 30 Nov 2016

@dgreene1 That seems in line with what I heard from folks I talked to.

As @bryk said, please do provide as much info as you can so we can try to address the issue. What actually happens? does the dashboard app crash with OOM? Is there some other kind of error happening?

ianlewis on 30 Nov 2016

BTW. I think setting ~200Mi for the limit for the dashboard is reasonable. If more is required than users can upgrade it but the current 50Mi may be too low.

For instance fluentd gets 200Mi on GKE per node.

ianlewis on 30 Nov 2016

BTW. I think setting ~200Mi for the limit for the dashboard is reasonable. If more is required than users can upgrade it but the current 50Mi may be too low.
For instance fluentd gets 200Mi on GKE per node.

Do we need memory limits anyway? Can we do only memory reservation and no limit? Or make limit, like, 500 megs.

bryk on 30 Nov 2016

Do we need memory limits anyway? Can we do only memory reservation and no limit? Or make limit, like, 500 megs.

It's a balance between giving the dashboard enough memory (when the API calls get to big it will timeout anyway), and requesting too many resources from the cluster.

The best thing may be to give it a lowish ~200Mi request and a high 1Gb limit (or no limit) but we risk being unfriendly to other pods on the same node.

ianlewis on 30 Nov 2016

@maciaszczykm Can we fix this by increasing memory limits to O(hundreds) of megs? If you open Dashboard on any large cluster it crashes.

bryk on 20 Apr 2017

@bryk Sure, we can do it. Do you think about any specific limit?

maciaszczykm on 20 Apr 2017

100 megs requests 300 limits to start with?

And update this on https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dashboard/dashboard-controller.yaml

bryk on 20 Apr 2017

👍2

Pull request is on core. Still, we have to fix issues mentioned by @rf232 in https://github.com/kubernetes/dashboard/issues/1431#issuecomment-262742406.

maciaszczykm on 20 Apr 2017

Pull request is on core. Still, we have to fix issues mentioned by
@rf232 in
https://github.com/kubernetes/dashboard/issues/1431#issuecomment-262742406.

Switch to using protobuf is done (for the relevant pages, only the yaml
editor uses json, but that is for single resources, so not worth the
effort)

Smaller lists would require us to refactor how we build up pages since
now we do all requests to api in parallel, and we should get the
resource first and then find the label selector and do a get to the
backend with the label selector. This would require quite some work I
think.

rf232 on 24 Apr 2017

Switch to using protobuf is done (for the relevant pages, only the yaml
editor uses json, but that is for single resources, so not worth the
effort)

@rf232 Oh, I see now. Did not check it before.

Smaller lists would require us to refactor how we build up pages since
now we do all requests to api in parallel, and we should get the
resource first and then find the label selector and do a get to the
backend with the label selector. This would require quite some work I
think.

Yes, I am aware of it. It is good enhacement for the future, but right now we should focus on higher priority issues as this is non-blocker IMO.

maciaszczykm on 25 Apr 2017

Let's track https://github.com/kubernetes/kubernetes/pull/44712 from here.

maciaszczykm on 25 Apr 2017

Was this page helpful?

0 / 5 - 0 ratings