kubeadm on AWS cloud provider deadlocks because of insufficent permissions

Created on 1 Jul 2017  路  14Comments  路  Source: kubernetes/kubeadm

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version: &version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T22:55:19Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T23:15:59Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: aws
  • OS (e.g. from /etc/os-release): Centos7
  • Kernel (e.g. uname -a): Linux ip-172-20-0-134.us-west-2.compute.internal 3.10.0-514.21.2.el7.x86_64 #1 SMP Tue Jun 20 12:24:47 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

What happened?

[root@ip-172-20-0-134 centos]# kubeadm init --config=/etc/kubernetes/kubeadm.conf
[kubeadm] WARNING: kubeadm is in beta, please do not use it for production clusters.
[init] Using Kubernetes version: v1.7.0
[init] Using Authorization modes: [Node RBAC]
[init] WARNING: For cloudprovider integrations to work --cloud-provider must be set for all kubelets in the cluster.
    (/etc/systemd/system/kubelet.service.d/10-kubeadm.conf should be edited for this purpose)
[preflight] Running pre-flight checks
[certificates] Generated CA certificate and key.
[certificates] Generated API server certificate and key.
[certificates] API Server serving cert is signed for DNS names [ip-172-20-0-134 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 172.20.0.134]
[certificates] Generated API server kubelet client certificate and key.
[certificates] Generated service account token signing key and public key.
[certificates] Generated front-proxy CA certificate and key.
[certificates] Generated front-proxy client certificate and key.
[certificates] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/admin.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/kubelet.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/controller-manager.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/scheduler.conf"
[apiclient] Created API client, waiting for the control plane to become ready
[apiclient] All control plane components are healthy after 78.500853 seconds

...hangs forever

What you expected to happen?

It's supposed to continue with:

[apiclient] Waiting for at least one node to register
[apiclient] First node has registered after 3.002484 seconds
[token] Using token: cncfci.geneisbatman4242
[apiconfig] Created RBAC rules
[addons] Created essential addon: kube-proxy
[addons] Created essential addon: kube-dns

Your Kubernetes master has initialized successfully!

To start using your cluster, you need to run (as a regular user):

  sudo cp /etc/kubernetes/admin.conf $HOME/
  sudo chown $(id -u):$(id -g) $HOME/admin.conf
  export KUBECONFIG=$HOME/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  http://kubernetes.io/docs/admin/addons/

You can now join any number of machines by running the following on each node
as root:

  kubeadm join --token cncfci.geneisbatman4242 172.20.0.134:6443

How to reproduce it (as minimally and precisely as possible)?

  • stop kubelet
  • kubeadm reset
  • yum remove kubeadm
  • yum install kubeadm-1.6.6
  • kubeadm init --config=/etc/kubernetes/kubeadm.conf

That's the only change and this time the init succeeds (which is where the second paste above comes from).

Anything else we need to know?

This seems to be a regression/repeat of https://github.com/kubernetes/kubernetes/issues/43815, which is a bummer.

Kubernetes v1.7.0 got released two days ago and, well, does kubeadm v1.7.0 get built and released in tandem automatically or something? Nobody tests it first?

People who didn't pin to kubeadm v1.6.6 will suddenly get a bunch of broken clusters this week.

The main error to look out for is once again:
Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Reproducibly only happens on v1.7.0.

I really hope we do better and catch such things in time for v1.8.0 and am working on an automated test suite for this, if anybody is interested in that please ping me.

help wanted kinbug prioritimportant-soon

Most helpful comment

I have the same problem. Also with CentOS 7 / AWS. Also in my case 1.6.6 works fine.

In my case kubelet seems to be not allowed to register it self with the credentials created by kubeadm (/etc/kubernetes/kubelet.conf):

Jul 03 22:07:41 ip-10-0-0-33 kubelet[12427]: E0703 22:07:41.154368   12427 kubelet_node_status.go:106] Unable to register node "ip-10-0-0-33.eu-central-1.compute.internal" with API server: nodes "ip-10-0-0-33.eu-central-1.compute.internal" is forbidden: node ip-10-0-0-33 cannot modify node ip-10-0-0-33.eu-central-1.compute.internal

However, even when I reconfigure kubelet to be able register (for example by using /etc/kubernetes/admin.conf credentials for kubelet) or even get it to the Ready state by installing the network plugin kubeadm still seems to be stuck at the same point and doesn't move forward. So I'm not sure what is kubeadm actually waiting for.

I have no clue what is the actual cause and what is the consequence.

All 14 comments

I have the same problem. Also with CentOS 7 / AWS. Also in my case 1.6.6 works fine.

In my case kubelet seems to be not allowed to register it self with the credentials created by kubeadm (/etc/kubernetes/kubelet.conf):

Jul 03 22:07:41 ip-10-0-0-33 kubelet[12427]: E0703 22:07:41.154368   12427 kubelet_node_status.go:106] Unable to register node "ip-10-0-0-33.eu-central-1.compute.internal" with API server: nodes "ip-10-0-0-33.eu-central-1.compute.internal" is forbidden: node ip-10-0-0-33 cannot modify node ip-10-0-0-33.eu-central-1.compute.internal

However, even when I reconfigure kubelet to be able register (for example by using /etc/kubernetes/admin.conf credentials for kubelet) or even get it to the Ready state by installing the network plugin kubeadm still seems to be stuck at the same point and doesn't move forward. So I'm not sure what is kubeadm actually waiting for.

I have no clue what is the actual cause and what is the consequence.

This seems to be a regression/repeat of kubernetes/kubernetes#43815, which is a bummer.

No, it's absolutely not that issue.

Kubernetes v1.7.0 got released two days ago and, well, does kubeadm v1.7.0 get built and released in tandem automatically or something? Nobody tests it first?

We absolutely test things a lot. We have automated CI e2e tests that are green: https://k8s-testgrid.appspot.com/sig-cluster-lifecycle#kubeadm-gce-1.7
However, as with everything, it's hard to test in exactly _your_ environment

I think the cloud provider is automatically detected on the kubelet, ref the --cloud-provider flag description on the kubelet.
So it means the kubelet uses custom AWS logic for creating the Node object, and it seems to set the node name based on calls to the AWS API, ref: Node Name == ip-10-0-0-33.eu-central-1.compute.internal

I want to remind you that cloud provider integrations are experimental as kubeadm will only fully support the out-of-tree cloud providers. The current in-tree providers might just as well work fine, but we can't promise it will work.

This is such a case. The kubelet talks to the AWS API, gets a Node Name that is different from the hostname (ip-10-0-0-33.eu-central-1.compute.internal vs ip-10-0-0-33). The kubelet has CN=system:node:ip-10-0-0-33, but needs CN=system:node:ip-10-0-0-33.eu-central-1.compute.internal in order to be able to modify itself. Now the latest security feature that is enabled in kubeadm, the Node Authorizer sees the Node API object that the cloud provider has created and the kubelet actually _on_ that node as two different persons.

So what can we do? As you see, the flow when bootstrapping a cluster using a cloud provider varies from the normal flow and even a lot between providers (AWS is the only provider that does this AFAIK).
I think fixing #64 will just fix this issue, but you still have to know the node name in beforehand and pass it to kubeadm init.

Do you want to contribute a fix for #64?

I really hope we do better and catch such things in time for v1.8.0 and am working on an automated test suite for this, if anybody is interested in that please ping me.

I have no AWS credits, if you want to contribute with AWS results, great and thanks :+1:

Thanks @zilman and @scholzj for the bug report, we appreciate it.
Please have oversight with that this is indeed a very specific edge case that only affects AWS due to how the cloud provider code there works. It has nothing to do with what happened with v1.6.
And it is much better to secure things more than less, so I defend enabling the Node Authorizer.

@GheRivero volunteered to work on this :tada: (i.e. #64 that will solve this problem)
I can't assign him, so therefore commenting instead...

Status report:

First of two PRs is up: https://github.com/kubernetes/kubernetes/issues/48538

I expect this to be fixed in v1.7.1, so that you can specify the node name yourself with --node-name.

I can confirm the same behavior with Ubuntu 16.04
"Failed creating a mirror pod for "kube-apiserver-ip-10-0-0-51.ec2.internal_kube-system(fe921e27127eb782227d38f55946771e)": pods "kube-apiserver-ip-10-0-0-51.ec2.internal" is forbidden: node ip-10-0-0-51 can only create pods with spec.nodeName set to itself"

The first patch solve that situation for kubeadm join but something similar should be done for kubeadm init

Seeing the same issue as @scholzj on CentOS 7.3 with 1.7.0. Falling back to 1.6.6 for now.

Second part of the fix: kubernetes/kubernetes#48594
Added the node-name flag to kubeadm init

Fixed with v1.7.1

I'm still experiencing this with 1.7.3 during kubeadm init.

@gtaylor You must set --node-name specifically to the name the node will have later (you should query the AWS API)

After reading a handful of these issue threads, I think my issue is that I'm setting a non-default hostname, leading to the indefinite deadlocking. Even if kubelet's --hostname-override matches --node-name.

If I am understanding correctly, you can't use the node authorizer with a hostname that doesn't match the one returned by the AWS API. So you're always going to end up having to stick with the default AWS ip-<x>-<x>-<x>-<x> hostnames. But you _can_ pass in a --node-name if you have kept the AWS-provided hostname.

I hope I am misunderstanding something, because we strongly prefer to change our master and node hostnames!

i'm getting this hang:
[apiclient] All control plane components are healthy after XX seconds

with kubernetes v1.7.5 on Ubuntu 17.04.

multiple online tutorials, including the official one here do not work out of the box:
https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/

so i found this comment at github:
https://github.com/kubernetes/kubernetes/issues/33544#issuecomment-249937975

after running that script kubeadm init started working.

I ran into this issue also and got it to work by updating the hostname of my EC2 instance to match that of the EC2 metadata service. On Ubuntu I added the following steps to my master bootscript:

sudo apt-get update && sudo apt-get install -y curl
echo 127.0.0.1 $(curl 169.254.169.254/latest/meta-data/hostname) | sudo tee -a /etc/hosts
curl 169.254.169.254/latest/meta-data/hostname | sudo tee /etc/hostname
sudo hostname $(curl 169.254.169.254/latest/meta-data/hostname)

(adding to /etc/hosts and /etc/hostname should ensure that the hostname change survives a reboot)

Was this page helpful?
0 / 5 - 0 ratings