e2e tests are frequently failing on CI with error:
--- FAIL: TestKillOneMasterNode/Elasticsearch_endpoint_should_eventually_be_reachable#02 (300.00s)
require.go:794:
Error Trace: testutils.go:42
Error: Received unexpected error:
503 Service Unavailable:
Test:
TestKillOneMasterNode/Elasticsearch_endpoint_should_eventually_be_reachable#02
FAIL
The cluster lacks CPU.
We are trying to deploy 4 Elasticsearch pods that each request 2 CPU on a k8s cluster with 3 nodes that have each 4 CPU.
After deploying the first 3 ES pods, the fourth is unschedulable. We can not deploy 2 ES pods on a single k8s node, because other pods (in the kube-system namesapce) already require a bit of CPU (between 300m and 600m cpu per k8s node).
> k kube-capacity -p
NODE NAMESPACE POD CPU REQUESTS CPU LIMITS MEMORY REQUESTS MEMORY LIMITS
* * * 7444m (63%) 10305m (87%) 7100Mi (10%) 8440Mi (12%)
c1-n1 * * 2491m (63%) 3013m (76%) 2267Mi (9%) 2617Mi (11%)
c1-n1 e2e test-failure-kill-one-master-node-es-b9z4qzl7ht 2000m (51%) 2000m (51%) 1907Mi (8%) 1907Mi (8%)
c1-n1 kube-system event-exporter-v0.2.3-85644fcdf-smjgm 0m (0%) 0m (0%) 0Mi (0%) 0Mi (0%)
c1-n1 kube-system fluentd-gcp-scaler-8b674f786-kvw4t 0m (0%) 0m (0%) 0Mi (0%) 0Mi (0%)
c1-n1 kube-system fluentd-gcp-v3.2.0-p2sw8 100m (2%) 1000m (25%) 200Mi (0%) 500Mi (2%)
c1-n1 kube-system kube-dns-76dbb796c5-6wq4z 260m (6%) 0m (0%) 110Mi (0%) 170Mi (0%)
c1-n1 kube-system kube-dns-autoscaler-67c97c87fb-z4s8m 20m (0%) 0m (0%) 10Mi (0%) 0Mi (0%)
c1-n1 kube-system kube-proxy-c1-n1 100m (2%) 0m (0%) 0Mi (0%) 0Mi (0%)
c1-n1 kube-system l7-default-backend-7ff48cffd7-2zv85 10m (0%) 10m (0%) 20Mi (0%) 20Mi (0%)
c1-n1 kube-system prometheus-to-sd-d2cks 1m (0%) 3m (0%) 20Mi (0%) 20Mi (0%)
c1-n2 * * 2301m (58%) 4003m (102%) 2147Mi (9%) 2527Mi (11%)
c1-n2 e2e test-failure-kill-one-master-node-es-85r6nn8w9q 2000m (51%) 2000m (51%) 1907Mi (8%) 1907Mi (8%)
c1-n2 elastic-system elastic-operator-0 100m (2%) 1000m (25%) 20Mi (0%) 100Mi (0%)
c1-n2 kube-system fluentd-gcp-v3.2.0-xn74z 100m (2%) 1000m (25%) 200Mi (0%) 500Mi (2%)
c1-n2 kube-system kube-proxy-c1-n2 100m (2%) 0m (0%) 0Mi (0%) 0Mi (0%)
c1-n2 kube-system prometheus-to-sd-2d2bn 1m (0%) 3m (0%) 20Mi (0%) 20Mi (0%)
c1-n3 * * 2652m (67%) 3289m (83%) 2686Mi (11%) 3296Mi (14%)
c1-n3 e2e test-failure-kill-one-master-node-es-rvsmx29ddx 2000m (51%) 2000m (51%) 1907Mi (8%) 1907Mi (8%)
c1-n3 kube-system fluentd-gcp-v3.2.0-hzj85 100m (2%) 1000m (25%) 200Mi (0%) 500Mi (2%)
c1-n3 kube-system heapster-v1.6.0-beta.1-586c879b55-v8mlf 138m (3%) 138m (3%) 294Mi (1%) 294Mi (1%)
c1-n3 kube-system kube-dns-76dbb796c5-zd4w2 260m (6%) 0m (0%) 110Mi (0%) 170Mi (0%)
c1-n3 kube-system kube-proxy-c1-n3 100m (2%) 0m (0%) 0Mi (0%) 0Mi (0%)
c1-n3 kube-system metrics-server-v0.2.1-fd596d746-v9nz6 53m (1%) 148m (3%) 154Mi (0%) 404Mi (1%)
c1-n3 kube-system prometheus-to-sd-vf88r 1m (0%) 3m (0%) 20Mi (0%)
Potential solutions:
@thbkrkr Thanks for info. I will experiment with different types of instances for tests with more CPU. It looks like at the moment we are using instances with a lot of RAM and small amount of CPU.
Bigger instances didn't help. I tried to use n1-highcpu-16 with 16 CPU per instance. In peak it used 4 CPU per instance out of 16 and still got same error.
I will try to increase resources for pod and check if it will help.
I wasn't been able to reproduce it locally. It is not reproducible on CI too right now. I will close this issue next week, if there will be no such issues locally or on CI
I think we can close this one.
Close this one as some improvements have been done on this tests.
Reopen if needed.
Most helpful comment
The cluster lacks CPU.
We are trying to deploy 4 Elasticsearch pods that each request 2 CPU on a k8s cluster with 3 nodes that have each 4 CPU.
After deploying the first 3 ES pods, the fourth is unschedulable. We can not deploy 2 ES pods on a single k8s node, because other pods (in the
kube-systemnamesapce) already require a bit of CPU (between 300m and 600m cpu per k8s node).Potential solutions: