What kops version are you running?
Version 1.8.0 (git-5099bc5)
What Kubernetes version are you running?
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.4", GitCommit:"9befc2b8928a9426501d3bf62f72849d5cbcd5a3", GitTreeState:"clean", BuildDate:"2017-11-20T19:11:02Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.4", GitCommit:"9befc2b8928a9426501d3bf62f72849d5cbcd5a3", GitTreeState:"clean", BuildDate:"2017-11-20T05:17:43Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
What cloud provider are you using?
AWS
What commands did you run? What is the simplest way to reproduce this issue?
First, created the cluster spec with
kops create cluster \
--state s3://XXXXXX \
--name XXXXXX \
--authorization AlwaysAllow \
--cloud aws \
--cloud-labels XXXXXX=XXXXXX,XXXXXX=XXXXXX \
--encrypt-etcd-storage \
--master-size m3.medium \
--ssh-public-key XXXXXX \
--zones eu-west-1a,eu-west-1b,eu-west-1c \
--networking calico \
--dry-run \
--output yaml \
--vpc XXXXXX
Then modified the following in the cluster spec:
spec.kubeAPIServer.runtimeConfig.autoscaling/v2beta1 \"true\"
spec.kubeAPIServer.runtimeConfig.batch/v1beta1 \"true\"
spec.kubelet.enableCustomMetrics true
spec.kubeControllerManager.horizontalPodAutoscalerSyncPeriod "15s"
spec.kubeControllerManager.horizontalPodAutoscalerDownscaleDelay "5m0s"
spec.kubeControllerManager.horizontalPodAutoscalerUpscaleDelay "2m0s"
spec.kubeControllerManager.HorizontalPodAutoscalerUseRestClients true
Then created the cluster and the following instance groups
On Demand:
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
name: on-demand
labels:
kops.k8s.io/cluster: XXXXXX
spec:
rootVolumeSize: 100
rootVolumeType: gp2
machineType: c4.2xlarge
maxSize: 50
minSize: 0
role: Node
subnets:
- eu-west-1a
- eu-west-1b
- eu-west-1c
Spot:
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
name: spot
labels:
kops.k8s.io/cluster: XXXXXX
spec:
rootVolumeSize: 100
rootVolumeType: gp2
machineType: m3.2xlarge
maxPrice: "0.5"
maxSize: 50
minSize: 0
role: Node
subnets:
- eu-west-1a
- eu-west-1b
- eu-west-1c
What happened after the commands executed?
The cluster started as usual but after setting the desired count to X in the AWS Autoscaling Groups the on-demand nodes are successfully registered but the spot ones fail
What did you expect to happen?
All nodes registering successfully
Please provide your cluster manifest.
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
creationTimestamp: 2017-12-19T15:32:27Z
name: XXXXXX
spec:
additionalPolicies:
master: |-
XXXXXX
node: |-
XXXXXX
api:
dns: {}
authorization:
alwaysAllow: {}
channel: stable
cloudLabels:
XXX: XXX
XXX: XXX
cloudProvider: aws
configBase: s3://XXXXXX
etcdClusters:
- etcdMembers:
- encryptedVolume: true
instanceGroup: master-eu-west-1a
name: a
name: main
- etcdMembers:
- encryptedVolume: true
instanceGroup: master-eu-west-1a
name: a
name: events
iam:
allowContainerRegistry: true
legacy: false
kubeAPIServer:
runtimeConfig:
autoscaling/v2beta1: "true"
batch/v1beta1: "true"
kubeControllerManager:
horizontalPodAutoscalerDownscaleDelay: 5m0s
horizontalPodAutoscalerSyncPeriod: 15s
horizontalPodAutoscalerUpscaleDelay: 2m0s
horizontalPodAutoscalerUseRestClients: true
kubelet:
enableCustomMetrics: true
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.8.4
masterPublicName: api.XXXXXX
networkCIDR: 10.0.0.0/16
networkID: XXXXXX
networking:
calico: {}
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- 0.0.0.0/0
subnets:
- cidr: 10.0.32.0/19
name: eu-west-1a
type: Public
zone: eu-west-1a
- cidr: 10.0.64.0/19
name: eu-west-1b
type: Public
zone: eu-west-1b
- cidr: 10.0.96.0/19
name: eu-west-1c
type: Public
zone: eu-west-1c
topology:
dns:
type: Public
masters: public
nodes: public
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2017-12-19T15:32:28Z
labels:
kops.k8s.io/cluster: XXXXXX
name: master-eu-west-1a
spec:
image: kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2017-12-02
machineType: m3.medium
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-eu-west-1a
role: Master
subnets:
- eu-west-1a
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2017-12-19T15:32:29Z
labels:
kops.k8s.io/cluster: XXXXXX
name: on-demand
spec:
image: kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2017-12-02
machineType: c4.2xlarge
maxSize: 50
minSize: 0
role: Node
rootVolumeSize: 100
rootVolumeType: gp2
subnets:
- eu-west-1a
- eu-west-1b
- eu-west-1c
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2017-12-19T15:32:30Z
labels:
kops.k8s.io/cluster: XXXXXX
name: spot
spec:
image: kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2017-12-02
machineType: m3.2xlarge
maxPrice: "0.5"
maxSize: 50
minSize: 0
role: Node
rootVolumeSize: 100
rootVolumeType: gp2
subnets:
- eu-west-1a
- eu-west-1b
- eu-west-1c
Anything else do we need to know?
Looking at the kubelet logs on the failing nodes I see these errors:
Unable to update cni config: No networks found in /etc/cni/net.d/
Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
I can see the /etc/cni/net.d/ directory is empty, which is the cause of this.
I've also tried with different instance types but it still fails to register the spot nodes.
I'm not 100% sure but both spot and on-demand should use the same Launch Configurations so I don't really get what could be causing this.
Can you check if this is the same as issue #4028? Looking at the journal entries for kubelet startup ("journalctl -u kubelet.service", search for a line like "Tag "KubernetesCluster" nor "kubernetes.io/cluster/..." not found; Kubernetes may behave unexpectedly.") should make this clear fairly quickly.
@vainu-arto That's it:
Dec 20 12:11:35 ip-10-0-42-5 kubelet[1295]: E1220 12:11:35.605272 1295 tags.go:94] Tag "KubernetesCluster" nor "kubernetes.io/cluster/..." not found; Kubernetes may behave unexpectedly.
In my case, this only happens with spot instances.
Actually, doing systemctl restart kubelet on the instance fixes the issue and the node is registered immediately. I guess the kubelet service is starting before the instance is tagged?
I have the issue reported upstream here: https://github.com/kubernetes/kubernetes/issues/57382
If you find any additional information that could help with solving it go ahead and add it there.
I've the same issue running with flannel networking. But I don't think that is a networking issue.
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.5", GitCommit:"cce11c6a185279d037023e02ac5249e14daa22bf", GitTreeState:"clean", BuildDate:"2017-12-07T18:09:00Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.4", GitCommit:"9befc2b8928a9426501d3bf62f72849d5cbcd5a3", GitTreeState:"clean", BuildDate:"2017-11-20T05:17:43Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
I checked the logs of the kube-controller-manager, and I think that the reason of the errors below are caused because the kubelet service on the node couldn't find the tags KubernetesCluster or kubernetes.io/cluster/... (as mencioned in kubernetes/kubernetes#57382), so there isn't a PodCIDR to set on the target node, because there's no node healthy.
E1221 21:20:37.004437 6 actual_state_of_world.go:483] Failed to set statusUpdateNeeded to needed true because nodeName="ip-10-200-76-95.ec2.internal" does not exist
E1221 21:20:37.004449 6 actual_state_of_world.go:497] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true because nodeName="ip-10-200-76-95.ec2.internal" does not exist
I1221 21:20:37.027012 6 ttl_controller.go:271] Changed ttl annotation for node ip-10-200-76-95.ec2.internal to 0 seconds
I1221 21:20:37.028326 6 range_allocator.go:168] Node ip-10-200-76-95.ec2.internal is already in a process of CIDR assignment.
E1221 21:20:37.039252 6 range_allocator.go:252] Failed to update node ip-10-200-76-95.ec2.internal PodCIDR to 100.96.27.0/24 (9 retries left): Operation cannot be fulfilled on nodes "ip-10-200-76-95.ec2.internal": the object has been modified; please apply your changes to the latest version and try again
I1221 21:20:37.057993 6 range_allocator.go:249] Set node ip-10-200-76-95.ec2.internal PodCIDR to 100.96.27.0/24
I1221 21:20:39.769836 6 node_controller.go:585] Controller observed a new Node: "ip-10-200-76-95.ec2.internal"
I1221 21:20:39.769863 6 controller_utils.go:237] Recording Registered Node ip-10-200-76-95.ec2.internal in Controller event message for node ip-10-200-76-95.ec2.internal
W1221 21:20:39.769995 6 node_controller.go:916] Missing timestamp for Node ip-10-200-76-95.ec2.internal. Assuming now as a timestamp.
I1221 21:20:39.770451 6 event.go:218] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-200-76-95.ec2.internal", UID:"c9999cce-e694-11e7-a26e-0a40f934faf2", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node ip-10-200-76-95.ec2.internal event: Registered Node ip-10-200-76-95.ec2.internal in Controller
I1221 21:20:39.912185 6 node_controller.go:721] Deleting node (no longer present in cloud provider): ip-10-200-76-95.ec2.internal
I1221 21:20:39.912205 6 controller_utils.go:237] Recording Deleting Node ip-10-200-76-95.ec2.internal because it's not present according to cloud provider event message for node ip-10-200-76-95.ec2.internal
I1221 21:20:39.912759 6 event.go:218] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-200-76-95.ec2.internal", UID:"c9999cce-e694-11e7-a26e-0a40f934faf2", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'DeletingNode' Node ip-10-200-76-95.ec2.internal event: Deleting Node ip-10-200-76-95.ec2.internal because it's not present according to cloud provider
I1221 21:20:44.912610 6 node_controller.go:597] Controller observed a Node deletion: ip-10-200-76-95.ec2.internal
I1221 21:20:44.912635 6 controller_utils.go:237] Recording Removing Node ip-10-200-76-95.ec2.internal from Controller event message for node ip-10-200-76-95.ec2.internal
I1221 21:20:44.913045 6 event.go:218] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-200-76-95.ec2.internal", UID:"c9999cce-e694-11e7-a26e-0a40f934faf2", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RemovingNode' Node ip-10-200-76-95.ec2.internal event: Removing Node ip-10-200-76-95.ec2.internal from Controller
As well as the errors below, that are probably cause by the error above. There's no network found because kube-controller-manager couldn't apply the CIDR on the node.
Dec 21 21:30:46 ip-10-200-76-95 kubelet[1224]: W1221 21:30:46.345153 1224 cni.go:196] Unable to update cni config: No networks found in /etc/cni/net.d/
I ended up with a workaround until the kubelet issue is resolved. Add this hooks to the cluster spec:
```spec:
hooks:
duplicated #3605
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
@david92rl to make your hook work I had to change it:
ExecStart=/bin/bash -c "while [[ $(aws --region $(curl --silent http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region) ec2 describe-tags --filters "Name=resource-id,Values=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)" "Name=key,Values=KubernetesCluster" | wc -l) -lt 4 ]]; do sleep 10; done"
to
ExecStart=/bin/bash -c 'while [[ $(aws --region $(curl --silent http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region) ec2 describe-tags --filters "Name=resource-id,Values=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)" "Name=key,Values=KubernetesCluster" | wc -l) -lt 4 ]]; do sleep 10; done'
@kayano This workaround should not be needed anymore, the bug has been fixed in kubelet in current releases of 1.8, 1.9 and 1.10.
/close
Most helpful comment
I ended up with a workaround until the kubelet issue is resolved. Add this hooks to the cluster spec:
```spec:
hooks:
before:
manifest: |
Type=oneshot
ExecStart=/bin/bash -c "apt-get update && apt-get install -y jq"
before:
manifest: |
Type=oneshot
ExecStart=/bin/bash -c "while [[ $(aws --region $(curl --silent http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region) ec2 describe-tags --filters "Name=resource-id,Values=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)" "Name=key,Values=KubernetesCluster" | wc -l) -lt 4 ]]; do sleep 10; done"
```
It just makes the instance wait until the KubernetesCluster tag is present.