I've seen reports of the dashboard throwing errors and otherwise not working for large numbers of objects.
Anecdotally, I have seen reports for large numbers of pods and large numbers of events. It may behove us to test with large numbers of objects and see which are problematic and start creating issues and fixing them.
Any specific stacktrace or something to debug? The UI should work fine.
somewhat similar, on my test cluster with a deployed 1.4.0 dashboard and 10k pods the dashboard container crashes every time I request a page. Haven't looked too much more into it though
@bryk I haven't tested it myself but I heard it a few times ( once on slack and a few times at KubeCon) so I thought it worth investigating.
What @rf232 said is exactly what I'm talking about but we need more info.
I can reproduce on my env, will take a look what really happens
Interesting. Please reply here what was/is wrong
I did a short investigation, and the container crashes with a OOM (Out Of Memory) crash
2016-11-18T21:29:35.151956377Z container oom bfb8c28f5b70273f475a397bd8be3408f9dc256f33fbe3165940ba187b7a1253 (image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=20, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_b829007e)
2016-11-18T21:29:37.251943478Z container die bfb8c28f5b70273f475a397bd8be3408f9dc256f33fbe3165940ba187b7a1253 (exitCode=137, image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=20, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_b829007e)
2016-11-18T21:29:37.329794337Z container destroy f013e7a99b7b9513c97b7bb7e5cd9f15d297155e242754432c7c52687f8b7375 (image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=19, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_d7d1a95d)
2016-11-18T21:29:50.718773431Z container create f0149ffeab479c2620fe941e7cfa625bd8e6fe9bb12dd92cef5bdcc849310a26 (image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=21, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_4156d221)
2016-11-18T21:29:50.779910144Z container start f0149ffeab479c2620fe941e7cfa625bd8e6fe9bb12dd92cef5bdcc849310a26 (image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=21, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_4156d221)
okey, apparently I had some limits set on my pod, after setting the limits a bit higher I found out that for 10k pods the dashboard needs ~200MB of memory
When meeting with some folks yesterday they said they had problems even with 1000 pods. There are probably a number of issues related to memory, API timeouts, etc. for various API objects. I would leave this issue as pretty generic and solve each specific issue separately as we find them.
We're not enforcing any CPU/Memory limits on dashboard by default. It has to be applied externally either by adjusting yaml or creating LimitRange. We should add some comment in documentation saying that for high number of resources in the cluster memory limits should be extended (if there are any applied). 200Mi - 2Gi should be enough.
When meeting with some folks yesterday they said they had problems even with 1000 pods. There are probably a number of issues related to memory, API timeouts, etc. for various API objects. I would leave this issue as pretty generic and solve each specific issue separately as we find them.
Can you get us a stacktrace or screenshot of some form? Or guide the folks to report bugs here? I'd help a lot.
I instructed them to create a bug but they were just evaluating Kubernetes. They may or may not post an issue.
I'll see if I can repro the issue at some point.
This issue about paging in the API may be worth following:
https://github.com/kubernetes/kubernetes/issues/2349
Even if we don't apply CPU/Memory limits ourselves in a way the hardware will do this in the end. Perhaps we should find a way to be a bit more graceful there.
But even if we don't crash with a high number of objects we are getting really slow. Some findings up till now about that:
Given that this is a bit larger than a small fix I'll remove this issue from the 1.5 project but keep it open
Just wanted to confirm that my team and I are seeing this issue when we were running as low as 315 pods.
That's sad @dgreene1
Do you have any logs to confirm that this is OOMs?
A short term fix for this problem can be increasing memory reservation for the UI. Can you try this?
@rf232 Yah. gRPC might be a long term stretch goal but we'll still require paging to get around the memory issues and there doesn't seem to be a way to request pages of data from the API server at the moment. The issue kubernetes/kubernetes#2349 from the kubernetes repo addresses this but doesn't look like it's been seriously considered for implementation yet.
@dgreene1 That seems in line with what I heard from folks I talked to.
As @bryk said, please do provide as much info as you can so we can try to address the issue. What actually happens? does the dashboard app crash with OOM? Is there some other kind of error happening?
BTW. I think setting ~200Mi for the limit for the dashboard is reasonable. If more is required than users can upgrade it but the current 50Mi may be too low.
For instance fluentd gets 200Mi on GKE per node.
BTW. I think setting ~200Mi for the limit for the dashboard is reasonable. If more is required than users can upgrade it but the current 50Mi may be too low.
For instance fluentd gets 200Mi on GKE per node.
Do we need memory limits anyway? Can we do only memory reservation and no limit? Or make limit, like, 500 megs.
Do we need memory limits anyway? Can we do only memory reservation and no limit? Or make limit, like, 500 megs.
It's a balance between giving the dashboard enough memory (when the API calls get to big it will timeout anyway), and requesting too many resources from the cluster.
The best thing may be to give it a lowish ~200Mi request and a high 1Gb limit (or no limit) but we risk being unfriendly to other pods on the same node.
@maciaszczykm Can we fix this by increasing memory limits to O(hundreds) of megs? If you open Dashboard on any large cluster it crashes.
@bryk Sure, we can do it. Do you think about any specific limit?
100 megs requests 300 limits to start with?
And update this on https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dashboard/dashboard-controller.yaml
Pull request is on core. Still, we have to fix issues mentioned by @rf232 in https://github.com/kubernetes/dashboard/issues/1431#issuecomment-262742406.
Pull request is on core. Still, we have to fix issues mentioned by
@rf232 in
https://github.com/kubernetes/dashboard/issues/1431#issuecomment-262742406.
Switch to using protobuf is done (for the relevant pages, only the yaml
editor uses json, but that is for single resources, so not worth the
effort)
Smaller lists would require us to refactor how we build up pages since
now we do all requests to api in parallel, and we should get the
resource first and then find the label selector and do a get to the
backend with the label selector. This would require quite some work I
think.
Switch to using protobuf is done (for the relevant pages, only the yaml
editor uses json, but that is for single resources, so not worth the
effort)
@rf232 Oh, I see now. Did not check it before.
Smaller lists would require us to refactor how we build up pages since
now we do all requests to api in parallel, and we should get the
resource first and then find the label selector and do a get to the
backend with the label selector. This would require quite some work I
think.
Yes, I am aware of it. It is good enhacement for the future, but right now we should focus on higher priority issues as this is non-blocker IMO.
Let's track https://github.com/kubernetes/kubernetes/pull/44712 from here.
Most helpful comment
100 megs requests 300 limits to start with?
And update this on https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dashboard/dashboard-controller.yaml