Rke: Kubernetes 1.11.1 nodes occasionally do not register internal IP address

Created on 14 Aug 2018 · 16Comments · Source: rancher/rke

Rancher versions:
rancher/server or rancher/rancher: 2.0.7
rancher/agent or rancher/rancher-agent: 2.0.6

Docker version: (docker version,docker info preferred)
Server:
Engine:
Version: 18.03.1-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.6
Git commit: 9ee9f40
Built: Thu Apr 26 04:27:49 2018
OS/Arch: linux/amd64
Experimental: false

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1800.6.0
VERSION_ID=1800.6.0
BUILD_ID=2018-08-04-0323
PRETTY_NAME="Container Linux by CoreOS 1800.6.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
OpenStack

Steps to Reproduce:
Deploy Kubernetes 1.11.1 cluster with RKE using the rke_config.yml

rke config:

addon_job_timeout: 30
authentication: 
  strategy: "x509"
ignore_docker_version: true

cloud_provider:
  name: openstack
  openstackCloudProvider:
    global:
      username: {{ openstack_username }}
      password: {{ openstack_password }}
      auth-url: {{ openstack_auth_url }}
      tenant-id: {{ openstack_tenant_id }}
      domain-id: {{ openstack_domain_id }}
    block_storage:
      ignore-volume-az: false

ingress: 
  provider: "none"

kubernetes_version: 1.11.1

network: 
  plugin: "canal"

services: 
  etcd: 
    extra_args: 
      heartbeat-interval: 500
      election-timeout: 5000
    snapshot: false
  kubelet:
    extra_args:
      authentication-token-webhook: true
  kube_api: 
    pod_security_policy: false
    extra_args:
      requestheader-client-ca-file: "/etc/kubernetes/ssl/kube-ca.pem"
      requestheader-extra-headers-prefix: "X-Remote-Extra-"
      requestheader-group-headers: "X-Remote-Group"
      requestheader-username-headers: "X-Remote-User"
      proxy-client-cert-file: "/etc/kubernetes/ssl/kube-proxy.pem"
      proxy-client-key-file: "/etc/kubernetes/ssl/kube-proxy-key.pem"

ssh_agent_auth: false

Results

~ $ kubectl get nodes -o wide                                                                                                                                 4339ms  Tue Aug 14 09:57:37 2018
NAME                                     STATUS    ROLES               AGE       VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                        KERNEL-VERSION      CONTAINER-RUNTIME
k8s-corp-prod-0-master-us-corp-kc-8a-0   Ready     controlplane,etcd   3d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-master-us-corp-kc-8b-1   Ready     controlplane,etcd   3d        v1.11.1   10.144.6.137    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-master-us-corp-kc-8c-2   Ready     controlplane,etcd   3d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-0   Ready     worker              3d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-1   Ready     worker              3d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-2   Ready     worker              3d        v1.11.1   10.144.2.141    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-0   Ready     worker              3d        v1.11.1   10.144.6.142    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-1   Ready     worker              3d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-2   Ready     worker              3d        v1.11.1   10.144.6.145    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-0   Ready     worker              3d        v1.11.1   10.144.10.137   <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-1   Ready     worker              3d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-2   Ready     worker              3d        v1.11.1   10.144.10.148   <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1

A restart of the kubelet container on the affected nodes resolves this issue.

internal kinbug statuhas-dependency

Source

twittyc

👍2

All 16 comments

I just saw this in our 1.11 cluster yesterday as well. Have not had a chance to investigate. It's causing the key vault flex volume to fail.

HighwayofLife on 14 Aug 2018

If a restart of the kubelet fixes this, can you provide the kubelet logging @twittyc ?

superseb on 14 Aug 2018

I've added the full log file from one of the affected nodes to this gist:
https://gist.github.com/twittyc/1878c60d78979e92acb87bb5997b2777

twittyc on 14 Aug 2018

At some point all our nodes in the cluster lost their internal IP (The two with IPs had their kubelet container restarted.)

NAME                                     STATUS    ROLES               AGE       VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                        KERNEL-VERSION      CONTAINER-RUNTIME
k8s-corp-prod-0-master-us-corp-kc-8a-0   Ready     controlplane,etcd   4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-master-us-corp-kc-8b-1   Ready     controlplane,etcd   4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-master-us-corp-kc-8c-2   Ready     controlplane,etcd   4d        v1.11.1   10.144.10.134   <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-0   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-1   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-2   Ready     worker              4d        v1.11.1   10.144.2.141    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-0   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-1   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-2   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-0   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-1   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-2   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1

Here are the kubelet logs from a nodes that regained its internal ip after a restart of the kubelet container:
https://gist.github.com/twittyc/40ea96e29c51a1c6018d2a045ae537c2

twittyc on 14 Aug 2018

I am unable to reproduce this on a bare-metal setup:

$ kubectl get nodes -o wide
NAME      STATUS    ROLES                      AGE       VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                        KERNEL-VERSION      CONTAINER-RUNTIME
node-1    Ready     controlplane,etcd,worker   1d        v1.11.1   172.31.18.219   <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
node-2    Ready     worker                     1d        v1.11.1   172.31.20.39    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
node-3    Ready     worker                     1d        v1.11.1   172.31.22.143   <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
node-4    Ready     worker                     1d        v1.11.1   172.31.29.29    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1

I will try with a cloud provider and see how it goes.

moelsayed on 16 Aug 2018

I think it's more an issue of Kubernetes and the OpenStack cloud provider, maybe in addition to an unstable OpenStack API endpoint, please also refer my issue that contains some log excerpts: https://github.com/kubernetes/cloud-provider-openstack/issues/280

stieler-it on 4 Sep 2018

👍1

This isn't specific to the OpenStack cloud provider. We're using Azure and encountered the lost Internal IPs as well.

HighwayofLife on 4 Sep 2018

This problem is reported in kubernetes https://github.com/kubernetes/kubernetes/issues/68270, the kubelet fails to get the address of the node from the cloud provider and it fails to update the node status, i will keep the issue open until the issue in k8s is resolved.
i can see the following logs from @twittyc gist:

cat kubelet-log.json |  grep -v "Volume not attached" | grep "node status" 
{"log":"E0813 20:38:56.118941   18989 kubelet_node_status.go:391] Error updating node status, will retry: error getting node \"k8s-corp-prod-0-worker-us-corp-kc-8b-1\": Get https://127.0.0.1:6443/api/v1/nodes/k8s-corp-prod-0-worker-us-corp-kc-8b-1?resourceVersion=0\u0026timeout=10s: unexpected EOF\n","stream":"stderr","time":"2018-08-13T20:38:56.121316062Z"}
{"log":"E0813 23:50:34.361091   18989 kubelet_node_status.go:391] Error updating node status, will retry: error getting node \"k8s-corp-prod-0-worker-us-corp-kc-8b-1\": Get https://127.0.0.1:6443/api/v1/nodes/k8s-corp-prod-0-worker-us-corp-kc-8b-1?resourceVersion=0\u0026timeout=10s: unexpected EOF\n","stream":"stderr","time":"2018-08-13T23:50:34.361316592Z"}
{"log":"W0814 01:13:16.823637   18989 kubelet_node_status.go:1114] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s\n","stream":"stderr","time":"2018-08-14T01:13:16.823960043Z"}
{"log":"E0814 07:01:54.730030   18989 kubelet_node_status.go:391] Error updating node status, will retry: error getting node \"k8s-corp-prod-0-worker-us-corp-kc-8b-1\": Get https://127.0.0.1:6443/api/v1/nodes/k8s-corp-prod-0-worker-us-corp-kc-8b-1?resourceVersion=0\u0026timeout=10s: unexpected EOF\n","stream":"stderr","time":"2018-08-14T07:01:54.731184923Z"}

galal-hussein on 6 Sep 2018

👍1

Looks like the issue has been fixed in k8s 1.12, and there are currently requests to backport it to 1.11, but no confirmation that it will be backported

alena1108 on 22 Oct 2018

The issue has been fixed in k8s 1.11.6; once we make v1.11.6 available to rke and rancher.

alena1108 on 18 Dec 2018

We upgraded to 1.11.5 a few days ago and the bug that nodes lose their internal IP has not occurred again.

stieler-it on 18 Dec 2018

@stieler-it the fix is not a part of 1.11.5; it is a part of k8s v1.11.6. So wonder if the fact that the bug hasn't occurred after the upgrade to v1.11.5 on your setup can be coincidental.

alena1108 on 19 Dec 2018

@alena1108 Ok, good to know. However, the bug appeared pretty often (every few hours at least) and now didn't appear for like 6 days. Maybe something else mitigated the problem - or it is just luck so far. We'll see.

stieler-it on 19 Dec 2018

Validated # 1 on master 1/4
Steps:

Ran Rancher server version master 1/4 (single Install)
Ran test scrips on this cluster:
- Ubuntu 16.04 docker 17.03.2-ce, k8s 1.11.6, 1 control plane, 1 etcd, and 3 workers
Tests cover the following areas:
- workload
- dns
- rbac
- communication
- ingress
- secret
- registry
- service discovery

Validated # 2 on v2.1.5-rc3 which is derived from master 1/2
Steps:

ran Rancher server version v2.1.5-rc3 (single Install)
Ran test scrips on three clusters:
- Ubuntu 16.04 docker 17.03.2-ce, k8s 1.11.6
- RancherOS 1.4.2 docker 17.03.2-ce, k8s 1.11.6, 1 control plane, 1 etcd, and 3 workers
- Rhel 7.5, native docker 1.13 with selinux enabled, k8s 1.11.6, 1 control plane, 1 etcd, and 3 workers
Tests cover the following areas:
- workload
- dns
- rbac
- communication
- ingress
- secret
- registry
- service discovery