Kubeadm: Update CoreDNS to v1.12 to fix OOM & restart

Created on 7 Aug 2018 · 42Comments · Source: kubernetes/kubeadm

BUG REPORT

Versions

kubeadm version (use kubeadm version):
kubeadm version: &version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:50:16Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:53:20Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:43:26Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release): Ubuntu 16.04 LTS X64
Kernel (e.g. uname -a): 4.4.0-91-generic #114-Ubuntu SMP
Others:

What happened?

core dns keep oom & restart, other pod works fine

get pod status

NAMESPACE NAME READY STATUS RESTARTS AGE
....
kube-system coredns-78fcdf6894-ls2q4 0/1 CrashLoopBackOff 12 1h
kube-system coredns-78fcdf6894-xn75c 0/1 CrashLoopBackOff 12 1h
....

describ the pod

Name: coredns-78fcdf6894-ls2q4
Namespace: kube-system
Priority: 0
PriorityClassName:
Node: k8s1/172.21.0.8
Start Time: Tue, 07 Aug 2018 11:59:37 +0800
Labels: k8s-app=kube-dns
pod-template-hash=3497892450
Annotations: cni.projectcalico.org/podIP=192.168.0.7/32
Status: Running
IP: 192.168.0.7
Controlled By: ReplicaSet/coredns-78fcdf6894
Containers:
coredns:
Container ID: docker://519046f837c93439a77d75288e6d630cdbcefe875b0bdb6aa5409d566070ec03
Image: k8s.gcr.io/coredns:1.1.3
Image ID: docker-pullable://k8s.gcr.io/coredns@sha256:db2bf53126ed1c761d5a41f24a1b82a461c85f736ff6e90542e9522be4757848
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Tue, 07 Aug 2018 13:07:21 +0800
Finished: Tue, 07 Aug 2018 13:08:21 +0800
Ready: False
Restart Count: 12
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Environment:
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-tsv2g (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-tsv2g:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-tsv2g
Optional: false
QoS Class: Burstable
Node-Selectors:
Tolerations: CriticalAddonsOnly
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 44m kubelet, k8s1 Liveness probe failed: Get http://192.168.0.7:8080/health: dial tcp 192.168.0.7:8080: connect: connection refused
Normal Pulled 41m (x5 over 1h) kubelet, k8s1 Container image "k8s.gcr.io/coredns:1.1.3" already present on machine
Normal Created 41m (x5 over 1h) kubelet, k8s1 Created container
Normal Started 41m (x5 over 1h) kubelet, k8s1 Started container
Warning Unhealthy 40m kubelet, k8s1 Liveness probe failed: Get http://192.168.0.7:8080/health: read tcp 172.21.0.8:40972->192.168.0.7:8080: read: connection reset by peer
Warning Unhealthy 34m (x2 over 38m) kubelet, k8s1 Liveness probe failed: Get http://192.168.0.7:8080/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning BackOff 4m (x124 over 44m) kubelet, k8s1 Back-off restarting failed container

logs of pod

.:53
CoreDNS-1.1.3
linux/amd64, go1.10.1, b0fd575c
2018/08/07 05:13:27 [INFO] CoreDNS-1.1.3
2018/08/07 05:13:27 [INFO] linux/amd64, go1.10.1, b0fd575c
2018/08/07 05:13:27 [INFO] plugin/reload: Running configuration MD5 = 2a066f12ec80aeb2b92740dd74c17138

ram usage of master

          total        used        free      shared  buff/cache   available

Mem: 1872 711 365 8 795 960
Swap: 0 0 0

ram usage of slave

          total        used        free      shared  buff/cache   available

Mem: 1872 392 78 17 1400 1250
Swap: 0 0 0

What you expected to happen?

core dns keep working and not restart

How to reproduce it (as minimally and precisely as possible)?

kubeadm init --apiserver-advertise-address=10.4.96.3 --pod-network-cidr=192.168.0.0/16
use calico network mode

join on second slave machine

node status is ready for both
kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s1 Ready master 1h v1.11.1
k8s2 Ready 1h v1.11.1

Anything else we need to know?

I'm doing test on host with 2GB RAM, not sure if it is too small for k8s

kinbug lifecyclactive

Source

liheyuan

All 42 comments

There is a known issue on Ubuntu, where kubeadm sets up CoreDNS (and also kube-dns) incorrectly.
If Ubuntu is using resolved as it does by default in recent versions, then its /etc/resolv.conf contains a localhost address (127.0.0.53). Kubernetes pushes this configuration to all "default" dns policy pods, so when they forward lookups upstream, it comes back at them ... looping until oom.

The fixes are to update kubelet to use the correct resolv.conf(the one that resolved uses). Or you can directly configure the upstream proxy in your CoreDNS configmap (but that doesnt fix the issue for other pods that might have the "default" dns policy). Or you can disable resolved on the nodes.

chrisohaver on 7 Aug 2018

FYI, the kubelet flag is --resolv-conf=<path>

chrisohaver on 7 Aug 2018

There is a known issue on Ubuntu, where kubeadm sets up CoreDNS (and also kube-dns) incorrectly.

To be more correct, its kubelet that is set up incorrectly, not coredns/kubedns directly.

Next version of CoreDNS will be able to detect this misconfiguration and put warnings/errors in the logs. But thats not a fix. It just makes the failure less mysterious.

Not sure if its up to kubeadm to detect use of resolved and adjust the kubelet config accordingly during kubeadm init automatically. Perhaps it should do a preflight check it can look for local addresses in the /etc/resolv.conf, or for presence of resolved running, and then warn the user.

chrisohaver on 7 Aug 2018

hi @liheyuan ,

1) when you start kubeadm init kubeadm should generate a file /var/lib/kubelet/kubeadm-flags.env that should handle the systemd-resolved issue automatically for you:
https://kubernetes.io/docs/setup/independent/kubelet-integration/

what are the contents of the file when you start kubeadm init and is your OS using systemd-resolved?

2) have you tried a different CNI e.g. weavenet or flannel?

neolit123 on 7 Aug 2018

Ah... I didn't know about https://github.com/kubernetes/kubernetes#64665. Good to know!

chrisohaver on 7 Aug 2018

👍1

@chrisohaver Thanks for your reply ,I'll have a try
@neolit123 Thank you. flannel not work in our scenario, may be I'll try weavenet in the future.

Also @chrisohaver @neolit123 I try to modify the core dns's Pod define, increase memory limit from 170MB(default) to 256MB, and it works like a charm... May be this is another solution.

liheyuan on 8 Aug 2018

@liheyuan

170MB(default) to 256MB, and it works like a charm... May be this is another solution.

thanks for finding that.

@chrisohaver
do you have an idea why the memory cap causes a problem?
i think it's OK to keep the issue here in case you'd suggest us to bump the memory cap to 256mb in the kubeadm manifest.

neolit123 on 8 Aug 2018

do you have an idea why the memory cap causes a problem?

No - in fact, the CoreDNS manifests don't have a memory cap defined by default. So I don't know where the cap was introduced. Possibly in this cluster, kube-system has a default container memory limit? Though I don't think thats a default setting either.

chrisohaver on 8 Aug 2018

@liheyuan thanks for noticing the low memory cap.

By any chance, did you add the initial 170 memory limit to the coredns deployment, or perhaps add a container memory limit to the kube-system namespace? Trying to understand how the limit was introduced in your case.

chrisohaver on 8 Aug 2018

@chrisohaver I'm not sure, the 170MB limit was found when I export coredns's yaml using kubectl

liheyuan on 12 Aug 2018

I'm also using kubeadm to launch a local kube cluster and am running into the same issue. I also have the 170Mi cap in the yaml for the coredns deployment. I can't seem to get it workingm unlike @liheyuan. After i kubeadm init I see nothing related to systemd-resolved @neolit123, am I doing anything wrong? I have the most recent version of kubeadm.