senlu@senlu:~/work/src/k8s.io/test-infra/prow$ kubectl get po -n=test-pods -a | grep OOM | wc -l
50
We probably want to set memory limit for each job, like regular e2e jobs use ~1Gi, however bazel job can eat up ~7Gi
"sacrifice child!", said by the node
/area prow
/area jobs
/assign
/assign @BenTheElder @cjwagner
https://kubernetes.io/docs/tasks/administer-cluster/memory-default-namespace/
we should probably add some configuration for this, as well as look at configuring some of the more intensive jobs (jobs with builds) to request more
FYI, current status:
senlu@senlu:~/work/src/k8s.io/test-infra/prow$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-prow-default-pool-42819f20-bg3z 3605m 91% 6648Mi 53%
gke-prow-default-pool-42819f20-2jmh 3532m 90% 8902Mi 71%
gke-prow-default-pool-42819f20-1rjm 3249m 82% 7587Mi 61%
gke-prow-default-pool-42819f20-c81m 3342m 85% 6546Mi 52%
gke-prow-default-pool-42819f20-hmx1 2577m 65% 10963Mi 88%
gke-prow-default-pool-42819f20-nlk6 3290m 83% 10185Mi 82%
gke-prow-default-pool-42819f20-z1v4 80m 2% 5811Mi 46%
gke-prow-default-pool-42819f20-frsc 2323m 59% 12538Mi 101%
gke-prow-default-pool-42819f20-nh4x 345m 8% 6499Mi 52%
gke-prow-default-pool-42819f20-v12h 1151m 29% 8257Mi 66%
gke-prow-default-pool-42819f20-pbpz 3577m 91% 4425Mi 35%
gke-prow-default-pool-42819f20-8lxf 3180m 81% 7657Mi 61%
gke-prow-default-pool-42819f20-4j5c 530m 13% 10118Mi 81%
gke-prow-default-pool-42819f20-pl3m 246m 6% 8003Mi 64%
gke-prow-default-pool-42819f20-25fm 272m 6% 8132Mi 65%
gke-prow-default-pool-42819f20-4nvp 3381m 86% 9984Mi 80%
gke-prow-default-pool-42819f20-nwd2 80m 2% 6970Mi 56%
gke-prow-default-pool-42819f20-m0wk 98m 2% 6385Mi 51%
gke-prow-default-pool-42819f20-2l32 2671m 68% 9209Mi 74%
gke-prow-default-pool-42819f20-4dc6 212m 5% 8109Mi 65%
gke-prow-default-pool-42819f20-j3b8 283m 7% 9262Mi 74%
gke-prow-default-pool-42819f20-kh1l 2959m 75% 10672Mi 86%
gke-prow-default-pool-42819f20-dghc 108m 2% 3497Mi 28%
gke-prow-default-pool-42819f20-28z7 114m 2% 8246Mi 66%
gke-prow-default-pool-42819f20-2vp5 993m 25% 5592Mi 45%
gke-prow-default-pool-42819f20-pc5n 2990m 76% 12065Mi 97%
gke-prow-default-pool-42819f20-cmh7 188m 4% 7191Mi 57%
gke-prow-default-pool-42819f20-4kjl 3388m 86% 5398Mi 43%
gke-prow-default-pool-42819f20-7kk1 3958m 100% 6922Mi 55%
gke-prow-default-pool-42819f20-3snv 3367m 85% 7845Mi 63%
gke-prow-default-pool-42819f20-b8d8 2742m 69% 11205Mi 90%
gke-prow-default-pool-42819f20-wz3r 3215m 82% 7935Mi 63%
gke-prow-default-pool-42819f20-spc8 1101m 28% 11321Mi 91%
gke-prow-default-pool-42819f20-jvtd 1755m 44% 7266Mi 58%
gke-prow-default-pool-42819f20-svsn 172m 4% 1414Mi 11%
gke-prow-default-pool-42819f20-q3k3 3350m 85% 2498Mi 20%
gke-prow-default-pool-42819f20-xk8f 394m 10% 7162Mi 57%
gke-prow-default-pool-42819f20-qrzf 284m 7% 992Mi 7%
gke-prow-default-pool-42819f20-l0zx 1406m 35% 2144Mi 17%
gke-prow-default-pool-42819f20-sxtk 2679m 68% 3301Mi 26%
gke-prow-default-pool-42819f20-d2xp 2383m 60% 1113Mi 8%
gke-prow-default-pool-42819f20-t779 404m 10% 2483Mi 20%
gke-prow-default-pool-42819f20-s4gf 2735m 69% 10729Mi 86%
gke-prow-default-pool-42819f20-v9zm 3257m 83% 8000Mi 64%
gke-prow-default-pool-42819f20-m58t 3288m 83% 8708Mi 70%
gke-prow-default-pool-42819f20-xf8k 265m 6% 10336Mi 83%
gke-prow-default-pool-42819f20-wn4n 78m 1% 8999Mi 72%
gke-prow-default-pool-42819f20-2bsd 198m 5% 6057Mi 48%
gke-prow-default-pool-42819f20-mpp3 287m 7% 8714Mi 70%
gke-prow-default-pool-42819f20-t5rd 3283m 83% 8523Mi 68%
gke-prow-default-pool-42819f20-6r8w 2457m 62% 8427Mi 67%
gke-prow-default-pool-42819f20-4tkh 316m 8% 2865Mi 23%
gke-prow-default-pool-42819f20-g532 2398m 61% 9238Mi 74%
gke-prow-default-pool-42819f20-7768 3483m 88% 4516Mi 36%
gke-prow-default-pool-42819f20-zs96 1856m 47% 8527Mi 68%
gke-prow-default-pool-42819f20-34vx 2311m 58% 1334Mi 10%
gke-prow-default-pool-42819f20-9xfn 89m 2% 7234Mi 58%
gke-prow-default-pool-42819f20-kt11 3741m 95% 5641Mi 45%
gke-prow-default-pool-42819f20-kwsv 68m 1% 7879Mi 63%
gke-prow-default-pool-42819f20-02sl 140m 3% 2672Mi 21%
gke-prow-default-pool-42819f20-vw7s 3874m 98% 10005Mi 80%
gke-prow-default-pool-42819f20-1rh9 3134m 79% 9354Mi 75%
gke-prow-default-pool-42819f20-27rp 3233m 82% 8003Mi 64%
gke-prow-default-pool-42819f20-5t9b 3531m 90% 3639Mi 29%
gke-prow-default-pool-42819f20-qqgc 3997m 101% 9240Mi 74%
gke-prow-default-pool-42819f20-fptg 407m 10% 6639Mi 53%
gke-prow-default-pool-42819f20-sx26 3357m 85% 9818Mi 79%
gke-prow-default-pool-42819f20-8h86 798m 20% 7891Mi 63%
gke-prow-default-pool-42819f20-vj85 131m 3% 9127Mi 73%
gke-prow-default-pool-42819f20-pzzv 84m 2% 6506Mi 52%
gke-prow-default-pool-42819f20-4kqg 693m 17% 8760Mi 70%
gke-prow-default-pool-42819f20-vw5l 2132m 54% 11611Mi 93%
gke-prow-default-pool-42819f20-cw5p 231m 5% 8284Mi 66%
gke-prow-default-pool-42819f20-ls8z 2102m 53% 10194Mi 82%
gke-prow-default-pool-42819f20-wwgr 264m 6% 7205Mi 58%
gke-prow-default-pool-42819f20-j8sq 2737m 69% 9195Mi 74%
gke-prow-default-pool-42819f20-p3jz 290m 7% 6655Mi 53%
gke-prow-default-pool-42819f20-j4bw 3354m 85% 10976Mi 88%
gke-prow-default-pool-42819f20-96n1 3514m 89% 4960Mi 39%
gke-prow-default-pool-42819f20-pv9n 3547m 90% 8574Mi 69%
gke-prow-default-pool-42819f20-z495 3406m 86% 9855Mi 79%
gke-prow-default-pool-42819f20-bnfq 2858m 72% 9878Mi 79%
gke-prow-default-pool-42819f20-bv6m 2408m 61% 3066Mi 24%
gke-prow-default-pool-42819f20-lg65 3585m 91% 8852Mi 71%
gke-prow-default-pool-42819f20-d4cf 1943m 49% 9798Mi 78%
gke-prow-default-pool-42819f20-zh2h 377m 9% 6144Mi 49%
gke-prow-default-pool-42819f20-jgn8 110m 2% 6962Mi 56%
gke-prow-default-pool-42819f20-rl9t 107m 2% 2331Mi 18%
gke-prow-default-pool-42819f20-0q1j 3154m 80% 11291Mi 90%
gke-prow-default-pool-42819f20-r9z8 96m 2% 6836Mi 55%
you can get pod-level stats by doing something like
gcloud compute ssh gke-prow-default-pool-42819f20-zs96 --project=k8s-prow-builds -- curl localhost:10255/stats/summary
I'll try to write a script to collect some data.
This seems to still be a problem. #5457 was deployed before the weekend, right?
$ kubectl get po -n=test-pods -a | grep "OOMKilled" | wc -l
56
@cjwagner that adds a lower bound for default memory requesting, we still need to add a higher bound for build jobs
seems we already have the resources fields in type.go, I'll add them for jobs using bazel build
I've been ad-hoc monitoring this in the background with a small script around kubectl I had lying around from when we had previous issues and we're still looking at:
2017-11-14 05:53:37.007705: {'Completed': 235, 'Error': 131, 'Evicted': 2, 'OOMKilled': 74, 'Pending': 63, 'Running': 352}
(Edit: updated to UTC)
trying to turn on some n1-highmem-4 instances. on the other side seems prow-build cluster only allows me to choose 1.7.8-gke0 rather than 1.8.x?
edit: well, the entire cluster nodes have to be the same version
status update: with 25 highmem nodes seems there's no more backlogged pods (for now), but in the meantime we might still want other solutions, i.e. https://stackoverflow.com/questions/37312581/can-i-release-some-memory-of-a-running-docker-container-on-the-fly/37312820#37312820?
Going to add this to kubetest today, I tested that out last night on some
GPU jobs with kubectl exec and it seems to work fine.
On Nov 14, 2017 11:03, "Sen Lu" notifications@github.com wrote:
status update: with 25 highmem nodes seems there's no more backlogged pods
(for now), but in the meantime we might still want other solutions, i.e.
https://stackoverflow.com/questions/37312581/can-i-
release-some-memory-of-a-running-docker-container-on-
the-fly/37312820#37312820?—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/5449#issuecomment-344362900,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4BqzO0h4SMqLGyLuOmhCsiK7WakZXQks5s2eP7gaJpZM4QaAaf
.
2017-11-14 19:49:13.335810: {'Completed': 353, 'ContainerCreating': 1, 'Error': 161, 'Evicted': 2, 'OOMKilled': 2, 'Running': 302}
Much better, but not perfect.
2017-11-14 23:30:02.823312: {'Completed': 238, 'Error': 112, 'Evicted': 2, 'Running': 209}
We're in a pretty good place for the moment.
Memory is definitely still something we need to tune: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/test-infra/5585/pull-test-infra-bazel/10242/
I'm planning to try to increase our build cluster capacity which should help a bit but we also need to be making reasonably accurate resource requests where possible
https://k8s-testgrid.appspot.com/sig-release-1.9-all#gce-1.8-1.9-upgrade-cluster-skew
I think that's still happening, the pod is rescheduled right away so we won't catch the state
We need to tighten up the resource requests and flip on that new kubetest
flag, but I'm also going to improve the node sizes O(Friday)
On Wed, Nov 22, 2017, 20:33 Sen Lu notifications@github.com wrote:
>
https://k8s-testgrid.appspot.com/sig-release-1.9-all#gce-1.8-1.9-upgrade-cluster-skew
I think that's still happening, the pod is rescheduled right away so we
won't catch the state—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/5449#issuecomment-346530335,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq87ikkm4szMyntmrs7xlWWVC4IMAks5s5PWGgaJpZM4QaAaf
.
@BenTheElder don't worry about this too much :-) 🦃 first
noting: we did improve the build cluster sizing quite a bit. much more detail on the xref'd issue above (5700)
This has been improved, I've migrated to a 50/50 split on n1-highmem-8 and n1-standard-4 nodes now so more of our nodes can handle large tasks we've moved from jenkins, and also increased the total number of nodes a bit more.
shall we close this as well?
/close