Rancher versions:
rancher/server or rancher/rancher: 2.0.7
rancher/agent or rancher/rancher-agent: 2.0.6
Docker version: (docker version,docker info preferred)
Server:
Engine:
Version: 18.03.1-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.6
Git commit: 9ee9f40
Built: Thu Apr 26 04:27:49 2018
OS/Arch: linux/amd64
Experimental: false
Operating system and kernel: (cat /etc/os-release, uname -r preferred)
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1800.6.0
VERSION_ID=1800.6.0
BUILD_ID=2018-08-04-0323
PRETTY_NAME="Container Linux by CoreOS 1800.6.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
OpenStack
Steps to Reproduce:
Deploy Kubernetes 1.11.1 cluster with RKE using the rke_config.yml
rke config:
addon_job_timeout: 30
authentication:
strategy: "x509"
ignore_docker_version: true
cloud_provider:
name: openstack
openstackCloudProvider:
global:
username: {{ openstack_username }}
password: {{ openstack_password }}
auth-url: {{ openstack_auth_url }}
tenant-id: {{ openstack_tenant_id }}
domain-id: {{ openstack_domain_id }}
block_storage:
ignore-volume-az: false
ingress:
provider: "none"
kubernetes_version: 1.11.1
network:
plugin: "canal"
services:
etcd:
extra_args:
heartbeat-interval: 500
election-timeout: 5000
snapshot: false
kubelet:
extra_args:
authentication-token-webhook: true
kube_api:
pod_security_policy: false
extra_args:
requestheader-client-ca-file: "/etc/kubernetes/ssl/kube-ca.pem"
requestheader-extra-headers-prefix: "X-Remote-Extra-"
requestheader-group-headers: "X-Remote-Group"
requestheader-username-headers: "X-Remote-User"
proxy-client-cert-file: "/etc/kubernetes/ssl/kube-proxy.pem"
proxy-client-key-file: "/etc/kubernetes/ssl/kube-proxy-key.pem"
ssh_agent_auth: false
Results
~ $ kubectl get nodes -o wide 4339ms  Tue Aug 14 09:57:37 2018
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-corp-prod-0-master-us-corp-kc-8a-0 Ready controlplane,etcd 3d v1.11.1 <none> <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-master-us-corp-kc-8b-1 Ready controlplane,etcd 3d v1.11.1 10.144.6.137 <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-master-us-corp-kc-8c-2 Ready controlplane,etcd 3d v1.11.1 <none> <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-0 Ready worker 3d v1.11.1 <none> <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-1 Ready worker 3d v1.11.1 <none> <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-2 Ready worker 3d v1.11.1 10.144.2.141 <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-0 Ready worker 3d v1.11.1 10.144.6.142 <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-1 Ready worker 3d v1.11.1 <none> <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-2 Ready worker 3d v1.11.1 10.144.6.145 <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-0 Ready worker 3d v1.11.1 10.144.10.137 <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-1 Ready worker 3d v1.11.1 <none> <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-2 Ready worker 3d v1.11.1 10.144.10.148 <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
A restart of the kubelet container on the affected nodes resolves this issue.
I just saw this in our 1.11 cluster yesterday as well. Have not had a chance to investigate. It's causing the key vault flex volume to fail.
If a restart of the kubelet fixes this, can you provide the kubelet logging @twittyc ?
I've added the full log file from one of the affected nodes to this gist:
https://gist.github.com/twittyc/1878c60d78979e92acb87bb5997b2777
At some point all our nodes in the cluster lost their internal IP (The two with IPs had their kubelet container restarted.)
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-corp-prod-0-master-us-corp-kc-8a-0 Ready controlplane,etcd 4d v1.11.1 <none> <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-master-us-corp-kc-8b-1 Ready controlplane,etcd 4d v1.11.1 <none> <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-master-us-corp-kc-8c-2 Ready controlplane,etcd 4d v1.11.1 10.144.10.134 <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-0 Ready worker 4d v1.11.1 <none> <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-1 Ready worker 4d v1.11.1 <none> <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-2 Ready worker 4d v1.11.1 10.144.2.141 <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-0 Ready worker 4d v1.11.1 <none> <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-1 Ready worker 4d v1.11.1 <none> <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-2 Ready worker 4d v1.11.1 <none> <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-0 Ready worker 4d v1.11.1 <none> <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-1 Ready worker 4d v1.11.1 <none> <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-2 Ready worker 4d v1.11.1 <none> <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
Here are the kubelet logs from a nodes that regained its internal ip after a restart of the kubelet container:
https://gist.github.com/twittyc/40ea96e29c51a1c6018d2a045ae537c2
I am unable to reproduce this on a bare-metal setup:
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node-1 Ready controlplane,etcd,worker 1d v1.11.1 172.31.18.219 <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
node-2 Ready worker 1d v1.11.1 172.31.20.39 <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
node-3 Ready worker 1d v1.11.1 172.31.22.143 <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
node-4 Ready worker 1d v1.11.1 172.31.29.29 <none> Container Linux by CoreOS 1800.6.0 (Rhyolite) 4.14.59-coreos-r2 docker://18.3.1
I will try with a cloud provider and see how it goes.
I think it's more an issue of Kubernetes and the OpenStack cloud provider, maybe in addition to an unstable OpenStack API endpoint, please also refer my issue that contains some log excerpts: https://github.com/kubernetes/cloud-provider-openstack/issues/280
This isn't specific to the OpenStack cloud provider. We're using Azure and encountered the lost Internal IPs as well.
This problem is reported in kubernetes https://github.com/kubernetes/kubernetes/issues/68270, the kubelet fails to get the address of the node from the cloud provider and it fails to update the node status, i will keep the issue open until the issue in k8s is resolved.
i can see the following logs from @twittyc gist:
cat kubelet-log.json | grep -v "Volume not attached" | grep "node status"
{"log":"E0813 20:38:56.118941 18989 kubelet_node_status.go:391] Error updating node status, will retry: error getting node \"k8s-corp-prod-0-worker-us-corp-kc-8b-1\": Get https://127.0.0.1:6443/api/v1/nodes/k8s-corp-prod-0-worker-us-corp-kc-8b-1?resourceVersion=0\u0026timeout=10s: unexpected EOF\n","stream":"stderr","time":"2018-08-13T20:38:56.121316062Z"}
{"log":"E0813 23:50:34.361091 18989 kubelet_node_status.go:391] Error updating node status, will retry: error getting node \"k8s-corp-prod-0-worker-us-corp-kc-8b-1\": Get https://127.0.0.1:6443/api/v1/nodes/k8s-corp-prod-0-worker-us-corp-kc-8b-1?resourceVersion=0\u0026timeout=10s: unexpected EOF\n","stream":"stderr","time":"2018-08-13T23:50:34.361316592Z"}
{"log":"W0814 01:13:16.823637 18989 kubelet_node_status.go:1114] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s\n","stream":"stderr","time":"2018-08-14T01:13:16.823960043Z"}
{"log":"E0814 07:01:54.730030 18989 kubelet_node_status.go:391] Error updating node status, will retry: error getting node \"k8s-corp-prod-0-worker-us-corp-kc-8b-1\": Get https://127.0.0.1:6443/api/v1/nodes/k8s-corp-prod-0-worker-us-corp-kc-8b-1?resourceVersion=0\u0026timeout=10s: unexpected EOF\n","stream":"stderr","time":"2018-08-14T07:01:54.731184923Z"}
Looks like the issue has been fixed in k8s 1.12, and there are currently requests to backport it to 1.11, but no confirmation that it will be backported
The issue has been fixed in k8s 1.11.6; once we make v1.11.6 available to rke and rancher.
We upgraded to 1.11.5 a few days ago and the bug that nodes lose their internal IP has not occurred again.
@stieler-it the fix is not a part of 1.11.5; it is a part of k8s v1.11.6. So wonder if the fact that the bug hasn't occurred after the upgrade to v1.11.5 on your setup can be coincidental.
@alena1108 Ok, good to know. However, the bug appeared pretty often (every few hours at least) and now didn't appear for like 6 days. Maybe something else mitigated the problem - or it is just luck so far. We'll see.
Validated # 1 on master 1/4
Steps:
Validated # 2 on v2.1.5-rc3 which is derived from master 1/2
Steps:
Is it possible to get this fixed in the 1.6 branch too ? Kube version there is k8s:v1.11.5-rancher1-1 which also has this problem.
@mrmason we are going to address it there as well; here is the corresponding issue: https://github.com/rancher/rancher/issues/14600