Test-infra: Pods are OOMKilling itself

Created on 10 Nov 2017  Â·  22Comments  Â·  Source: kubernetes/test-infra

senlu@senlu:~/work/src/k8s.io/test-infra/prow$ kubectl get po -n=test-pods -a | grep OOM | wc -l
50

We probably want to set memory limit for each job, like regular e2e jobs use ~1Gi, however bazel job can eat up ~7Gi

"sacrifice child!", said by the node

/area prow
/area jobs

arejobs areprow kinbug

All 22 comments

/assign
/assign @BenTheElder @cjwagner

https://kubernetes.io/docs/tasks/administer-cluster/memory-default-namespace/

we should probably add some configuration for this, as well as look at configuring some of the more intensive jobs (jobs with builds) to request more

FYI, current status:

senlu@senlu:~/work/src/k8s.io/test-infra/prow$ kubectl top nodes
NAME                                  CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%   
gke-prow-default-pool-42819f20-bg3z   3605m        91%       6648Mi          53%       
gke-prow-default-pool-42819f20-2jmh   3532m        90%       8902Mi          71%       
gke-prow-default-pool-42819f20-1rjm   3249m        82%       7587Mi          61%       
gke-prow-default-pool-42819f20-c81m   3342m        85%       6546Mi          52%       
gke-prow-default-pool-42819f20-hmx1   2577m        65%       10963Mi         88%       
gke-prow-default-pool-42819f20-nlk6   3290m        83%       10185Mi         82%       
gke-prow-default-pool-42819f20-z1v4   80m          2%        5811Mi          46%       
gke-prow-default-pool-42819f20-frsc   2323m        59%       12538Mi         101%      
gke-prow-default-pool-42819f20-nh4x   345m         8%        6499Mi          52%       
gke-prow-default-pool-42819f20-v12h   1151m        29%       8257Mi          66%       
gke-prow-default-pool-42819f20-pbpz   3577m        91%       4425Mi          35%       
gke-prow-default-pool-42819f20-8lxf   3180m        81%       7657Mi          61%       
gke-prow-default-pool-42819f20-4j5c   530m         13%       10118Mi         81%       
gke-prow-default-pool-42819f20-pl3m   246m         6%        8003Mi          64%       
gke-prow-default-pool-42819f20-25fm   272m         6%        8132Mi          65%       
gke-prow-default-pool-42819f20-4nvp   3381m        86%       9984Mi          80%       
gke-prow-default-pool-42819f20-nwd2   80m          2%        6970Mi          56%       
gke-prow-default-pool-42819f20-m0wk   98m          2%        6385Mi          51%       
gke-prow-default-pool-42819f20-2l32   2671m        68%       9209Mi          74%       
gke-prow-default-pool-42819f20-4dc6   212m         5%        8109Mi          65%       
gke-prow-default-pool-42819f20-j3b8   283m         7%        9262Mi          74%       
gke-prow-default-pool-42819f20-kh1l   2959m        75%       10672Mi         86%       
gke-prow-default-pool-42819f20-dghc   108m         2%        3497Mi          28%       
gke-prow-default-pool-42819f20-28z7   114m         2%        8246Mi          66%       
gke-prow-default-pool-42819f20-2vp5   993m         25%       5592Mi          45%       
gke-prow-default-pool-42819f20-pc5n   2990m        76%       12065Mi         97%       
gke-prow-default-pool-42819f20-cmh7   188m         4%        7191Mi          57%       
gke-prow-default-pool-42819f20-4kjl   3388m        86%       5398Mi          43%       
gke-prow-default-pool-42819f20-7kk1   3958m        100%      6922Mi          55%       
gke-prow-default-pool-42819f20-3snv   3367m        85%       7845Mi          63%       
gke-prow-default-pool-42819f20-b8d8   2742m        69%       11205Mi         90%       
gke-prow-default-pool-42819f20-wz3r   3215m        82%       7935Mi          63%       
gke-prow-default-pool-42819f20-spc8   1101m        28%       11321Mi         91%       
gke-prow-default-pool-42819f20-jvtd   1755m        44%       7266Mi          58%       
gke-prow-default-pool-42819f20-svsn   172m         4%        1414Mi          11%       
gke-prow-default-pool-42819f20-q3k3   3350m        85%       2498Mi          20%       
gke-prow-default-pool-42819f20-xk8f   394m         10%       7162Mi          57%       
gke-prow-default-pool-42819f20-qrzf   284m         7%        992Mi           7%        
gke-prow-default-pool-42819f20-l0zx   1406m        35%       2144Mi          17%       
gke-prow-default-pool-42819f20-sxtk   2679m        68%       3301Mi          26%       
gke-prow-default-pool-42819f20-d2xp   2383m        60%       1113Mi          8%        
gke-prow-default-pool-42819f20-t779   404m         10%       2483Mi          20%       
gke-prow-default-pool-42819f20-s4gf   2735m        69%       10729Mi         86%       
gke-prow-default-pool-42819f20-v9zm   3257m        83%       8000Mi          64%       
gke-prow-default-pool-42819f20-m58t   3288m        83%       8708Mi          70%       
gke-prow-default-pool-42819f20-xf8k   265m         6%        10336Mi         83%       
gke-prow-default-pool-42819f20-wn4n   78m          1%        8999Mi          72%       
gke-prow-default-pool-42819f20-2bsd   198m         5%        6057Mi          48%       
gke-prow-default-pool-42819f20-mpp3   287m         7%        8714Mi          70%       
gke-prow-default-pool-42819f20-t5rd   3283m        83%       8523Mi          68%       
gke-prow-default-pool-42819f20-6r8w   2457m        62%       8427Mi          67%       
gke-prow-default-pool-42819f20-4tkh   316m         8%        2865Mi          23%       
gke-prow-default-pool-42819f20-g532   2398m        61%       9238Mi          74%       
gke-prow-default-pool-42819f20-7768   3483m        88%       4516Mi          36%       
gke-prow-default-pool-42819f20-zs96   1856m        47%       8527Mi          68%       
gke-prow-default-pool-42819f20-34vx   2311m        58%       1334Mi          10%       
gke-prow-default-pool-42819f20-9xfn   89m          2%        7234Mi          58%       
gke-prow-default-pool-42819f20-kt11   3741m        95%       5641Mi          45%       
gke-prow-default-pool-42819f20-kwsv   68m          1%        7879Mi          63%       
gke-prow-default-pool-42819f20-02sl   140m         3%        2672Mi          21%       
gke-prow-default-pool-42819f20-vw7s   3874m        98%       10005Mi         80%       
gke-prow-default-pool-42819f20-1rh9   3134m        79%       9354Mi          75%       
gke-prow-default-pool-42819f20-27rp   3233m        82%       8003Mi          64%       
gke-prow-default-pool-42819f20-5t9b   3531m        90%       3639Mi          29%       
gke-prow-default-pool-42819f20-qqgc   3997m        101%      9240Mi          74%       
gke-prow-default-pool-42819f20-fptg   407m         10%       6639Mi          53%       
gke-prow-default-pool-42819f20-sx26   3357m        85%       9818Mi          79%       
gke-prow-default-pool-42819f20-8h86   798m         20%       7891Mi          63%       
gke-prow-default-pool-42819f20-vj85   131m         3%        9127Mi          73%       
gke-prow-default-pool-42819f20-pzzv   84m          2%        6506Mi          52%       
gke-prow-default-pool-42819f20-4kqg   693m         17%       8760Mi          70%       
gke-prow-default-pool-42819f20-vw5l   2132m        54%       11611Mi         93%       
gke-prow-default-pool-42819f20-cw5p   231m         5%        8284Mi          66%       
gke-prow-default-pool-42819f20-ls8z   2102m        53%       10194Mi         82%       
gke-prow-default-pool-42819f20-wwgr   264m         6%        7205Mi          58%       
gke-prow-default-pool-42819f20-j8sq   2737m        69%       9195Mi          74%       
gke-prow-default-pool-42819f20-p3jz   290m         7%        6655Mi          53%       
gke-prow-default-pool-42819f20-j4bw   3354m        85%       10976Mi         88%       
gke-prow-default-pool-42819f20-96n1   3514m        89%       4960Mi          39%       
gke-prow-default-pool-42819f20-pv9n   3547m        90%       8574Mi          69%       
gke-prow-default-pool-42819f20-z495   3406m        86%       9855Mi          79%       
gke-prow-default-pool-42819f20-bnfq   2858m        72%       9878Mi          79%       
gke-prow-default-pool-42819f20-bv6m   2408m        61%       3066Mi          24%       
gke-prow-default-pool-42819f20-lg65   3585m        91%       8852Mi          71%       
gke-prow-default-pool-42819f20-d4cf   1943m        49%       9798Mi          78%       
gke-prow-default-pool-42819f20-zh2h   377m         9%        6144Mi          49%       
gke-prow-default-pool-42819f20-jgn8   110m         2%        6962Mi          56%       
gke-prow-default-pool-42819f20-rl9t   107m         2%        2331Mi          18%       
gke-prow-default-pool-42819f20-0q1j   3154m        80%       11291Mi         90%       
gke-prow-default-pool-42819f20-r9z8   96m          2%        6836Mi          55% 

you can get pod-level stats by doing something like

gcloud compute ssh gke-prow-default-pool-42819f20-zs96 --project=k8s-prow-builds -- curl localhost:10255/stats/summary

I'll try to write a script to collect some data.

This seems to still be a problem. #5457 was deployed before the weekend, right?

$ kubectl get po -n=test-pods -a | grep "OOMKilled" | wc -l
56

@cjwagner that adds a lower bound for default memory requesting, we still need to add a higher bound for build jobs

seems we already have the resources fields in type.go, I'll add them for jobs using bazel build

I've been ad-hoc monitoring this in the background with a small script around kubectl I had lying around from when we had previous issues and we're still looking at:
2017-11-14 05:53:37.007705: {'Completed': 235, 'Error': 131, 'Evicted': 2, 'OOMKilled': 74, 'Pending': 63, 'Running': 352}
(Edit: updated to UTC)

trying to turn on some n1-highmem-4 instances. on the other side seems prow-build cluster only allows me to choose 1.7.8-gke0 rather than 1.8.x?

edit: well, the entire cluster nodes have to be the same version

status update: with 25 highmem nodes seems there's no more backlogged pods (for now), but in the meantime we might still want other solutions, i.e. https://stackoverflow.com/questions/37312581/can-i-release-some-memory-of-a-running-docker-container-on-the-fly/37312820#37312820?

Going to add this to kubetest today, I tested that out last night on some
GPU jobs with kubectl exec and it seems to work fine.

On Nov 14, 2017 11:03, "Sen Lu" notifications@github.com wrote:

status update: with 25 highmem nodes seems there's no more backlogged pods
(for now), but in the meantime we might still want other solutions, i.e.
https://stackoverflow.com/questions/37312581/can-i-
release-some-memory-of-a-running-docker-container-on-
the-fly/37312820#37312820?

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/5449#issuecomment-344362900,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4BqzO0h4SMqLGyLuOmhCsiK7WakZXQks5s2eP7gaJpZM4QaAaf
.

2017-11-14 19:49:13.335810: {'Completed': 353, 'ContainerCreating': 1, 'Error': 161, 'Evicted': 2, 'OOMKilled': 2, 'Running': 302}

Much better, but not perfect.

2017-11-14 23:30:02.823312: {'Completed': 238, 'Error': 112, 'Evicted': 2, 'Running': 209}

We're in a pretty good place for the moment.

Memory is definitely still something we need to tune: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/test-infra/5585/pull-test-infra-bazel/10242/

I'm planning to try to increase our build cluster capacity which should help a bit but we also need to be making reasonably accurate resource requests where possible

https://k8s-testgrid.appspot.com/sig-release-1.9-all#gce-1.8-1.9-upgrade-cluster-skew

I think that's still happening, the pod is rescheduled right away so we won't catch the state

We need to tighten up the resource requests and flip on that new kubetest
flag, but I'm also going to improve the node sizes O(Friday)

On Wed, Nov 22, 2017, 20:33 Sen Lu notifications@github.com wrote:

>

https://k8s-testgrid.appspot.com/sig-release-1.9-all#gce-1.8-1.9-upgrade-cluster-skew

I think that's still happening, the pod is rescheduled right away so we
won't catch the state

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/5449#issuecomment-346530335,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq87ikkm4szMyntmrs7xlWWVC4IMAks5s5PWGgaJpZM4QaAaf
.

@BenTheElder don't worry about this too much :-) 🦃 first

noting: we did improve the build cluster sizing quite a bit. much more detail on the xref'd issue above (5700)

This has been improved, I've migrated to a 50/50 split on n1-highmem-8 and n1-standard-4 nodes now so more of our nodes can handle large tasks we've moved from jenkins, and also increased the total number of nodes a bit more.

shall we close this as well?

/close

Was this page helpful?
0 / 5 - 0 ratings