What happened:
Error:
Ready False Wed, 04 Nov 2020 10:56:25 +0000 Wed, 04 Nov 2020 10:48:23 +0000 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Yesterday evening I set ASG to zero.
This morning I set ASG to 4.
kubectl get nodes reports nodes as NotReady
kubectl describe node REDACTED
Name: REDACTED
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t3.large
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=REDACTED
failure-domain.beta.kubernetes.io/zone=REDACTED
kubernetes.io/arch=amd64
kubernetes.io/hostname=REDACTED
kubernetes.io/os=linux
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 04 Nov 2020 10:48:23 +0000
Taints: node.kubernetes.io/not-ready:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 04 Nov 2020 10:56:25 +0000 Wed, 04 Nov 2020 10:48:23 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 04 Nov 2020 10:56:25 +0000 Wed, 04 Nov 2020 10:48:23 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 04 Nov 2020 10:56:25 +0000 Wed, 04 Nov 2020 10:48:23 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Wed, 04 Nov 2020 10:56:25 +0000 Wed, 04 Nov 2020 10:48:23 +0000 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Addresses:
InternalIP: REDACTED
ExternalIP: REDACTED
Hostname: REDACTED.compute.internal
InternalDNS: REDACTED.compute.internal
ExternalDNS: REDACTED.compute.amazonaws.com
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 20959212Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8063660Ki
pods: 35
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 1930m
ephemeral-storage: 18242267924
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7305900Ki
pods: 35
System Info:
Machine ID: REDACTED
System UUID: REDACTED
Boot ID: REDACTED
Kernel Version: 4.14.198-152.320.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.6
Kubelet Version: v1.15.11-eks-bf8eea
Kube-Proxy Version: v1.15.11-eks-bf8eea
ProviderID: aws:///REDACTED/i-REDACTED
Non-terminated Pods: (0 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 0 (0%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 8m48s kubelet, REDACTED.compute.internal Starting kubelet.
Normal NodeHasSufficientMemory 8m48s (x2 over 8m48s) kubelet, REDACTED.compute.internal Node REDACTED.compute.internal status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 8m48s (x2 over 8m48s) kubelet, REDACTED.compute.internal Node REDACTED.compute.internal status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 8m48s (x2 over 8m48s) kubelet, REDACTED.compute.internal Node REDACTED.compute.internal status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 8m48s kubelet, REDACTED.compute.internal Updated Node Allocatable limit across pods
CNI was running: amazon-k8s-cni:v1.6.3
After seeing this error I upgraded it to: amazon-k8s-cni-init:v1.7.5 amazon-k8s-cni:v1.7.5
CNI Log attached. eks_i-REDACTED_2020-11-04_1111-UTC_0.6.2_REDACTED.zip
What you expected to happen: Nodes to join EKS cluster
How to reproduce it (as minimally and precisely as possible): That's difficult to answer.
Anything else we need to know?:
amazon-eks-node-1.15-v20201007 ami-0af730da10ac8b0b7 and amazon-eks-node-1.15-v20200814 ami-04cc6ec46d6dbc4faEnvironment:
kubectl version):Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-16T00:04:31Z", GoVersion:"go1.14.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-eks-065dce", GitCommit:"065dcecfcd2a91bd68a17ee0b5e895088430bd05", GitTreeState:"clean", BuildDate:"2020-07-16T01:44:47Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
amazon-k8s-cni:v1.6.3
amazon-k8s-cni-init:v1.7.5
amazon-k8s-cni:v1.7.5
cat /etc/os-release):NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
uname -a): Linux REDACTED.compute.internal 4.14.198-152.320.amzn2.x86_64 #1 SMP Wed Sep 23 23:57:28 UTC 2020 x86_64 x86_64 x86_64 GNU/LinuxNow I'm thinkin that Kubeflow might have something to do with it https://github.com/kubeflow/kubeflow/issues/5247.
I logged this with Kubeflow https://github.com/kubeflow/kubeflow/issues/5381
Hi @tretos53
I see the IPAMD logs are not created so most likely IPAMD hasnt even started. Also 10-aws.conflist config (https://github.com/aws/amazon-vpc-cni-k8s/blob/master/scripts/entrypoint.sh#L114) file is not in the logs so looks like even copying of the config file has failed. Can you please open a support case so that we can further debug the issues. Also can you please provide kubectl logs for aws-node to verify if IPAMD started.
Thank you!
Hi,
None of the pods start so I can't get any logs. I only attached one node to make things simple but all the nodes fail to get Ready.
I will try to open a support call.
kubectl get pods -A
No resources found
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-xx-x-x-xx.x-xxxx-x.compute.internal NotReady <none> 22h v1.15.11-eks-bf8eea
I am also experiencing this issue without kubeflow. Notably, this happening on new NodeGroups in a cluster, and when trying to create a new cluster entirely via eksctl. Also, while I found the same log message in our cluster logs, network plugin is not ready: cni config uninitialized, I found another message which may provide more insight: network plugin is not ready: cni config uninitialized, CSINode is not yet initialized, missing node capacity for resources: ephemeral-storage. This is interesting because the new nodegroups were created using the same configuration as all our other nodegroups. I have also opened a support ticket with AWS.
Hi @jayanthvn,
Can I do anything to fix this? I only have billing support, no technical support.
Hi @tretos53
Can you please share kubectl logs of aws-node (kubectl logs aws-node-9hrfc -n kube-system)? With the logs you shared I see 10-aws.conflist and ipamd.log file is not created and kubelet seems to be complaining about 10-aws.conflist file not found. Kubectl logs should show if IPAMD failed to start.
Nov 04 10:48:24 REDACTED.compute.internal kubelet[4073]: I1104 10:48:24.091283 4073 reconciler.go:150] Reconciler: start to sync state
Nov 04 10:48:28 REDACTED.compute.internal kubelet[4073]: W1104 10:48:28.723750 4073 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Nov 04 10:48:28 REDACTED.compute.internal kubelet[4073]: E1104 10:48:28.967433 4073 kubelet.go:2179] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Nov 04 10:48:33 REDACTED.compute.internal kubelet[4073]: W1104 10:48:33.723969 4073 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Nov 04 10:48:33 REDACTED.compute.internal kubelet[4073]: E1104 10:48:33.979992 4073 kubelet.go:2179] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Nov 04 10:48:38 REDACTED.compute.internal kubelet[4073]: W1104 10:48:38.724205 4073 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Nov 04 10:48:38 REDACTED.compute.internal kubelet[4073]: E1104 10:48:38.990025 4073 kubelet.go:2179] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Nov 04 10:48:43 REDACTED.compute.internal kubelet[4073]: W1104 10:48:43.724379 4073 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Update: the source of my issue was eksctl failed to assign any of the default permissions to the nodegroups IAM role, when it previously had. It does not sound like out issues are related anymore
Hi @jayanthvn
No pods are running as all nodes are NotReady. I can't get any logs. I can enable logs on EKS cluster. Will that help?
➜ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-REDACTED. REDACTED.compute.internal NotReady
ip-REDACTED. REDACTED.compute.internal NotReady
ip-REDACTED. REDACTED.compute.internal NotReady
ip-REDACTED. REDACTED.compute.internal NotReady
ip-REDACTED. REDACTED.compute.internal NotReady
➜ kubectl get pods -A
No resources found
Hi @tretos53
Can you please email me ([email protected]) you cluster ARN? I can see if I can get any logs. Yes enabling logs on EKS cluster will definitely help.
Thank you.
In case this helps anyone, we had similar launching new clusters when aws-node began using a new cni v1.7.5 (some time late last week). We use our own pod security policies and it seems 1.7.5 requires NET_ADMIN capabilities. We didn't need this with cni 1.6.3.
@gillbee You're right, starting v1.7.* we removed privileged to true and updated securityContext with just NET_ADMIN capability (https://github.com/aws/amazon-vpc-cni-k8s/blob/master/config/v1.7/aws-k8s-cni.yaml#L177-L180) which wasn't the case with v1.6.* (https://github.com/aws/amazon-vpc-cni-k8s/blob/v1.6.3/config/v1.6/aws-k8s-cni.yaml#L132-L133) This makes aws-node pod to run with lesser privilege than before.
Hi @tretos53
I was able to check your cluster from the ARN you provided and none of the pods [aws-node, kube-proxy, core-dns] are running as you mentioned. This doesn't look like a CNI issue, I am following up with internal team. Will update you once I get some info.
Thanks.
Hi @tretos53
Daemon set failed to create on your cluster -
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 3m (x1353 over 15d) daemonset-controller Error creating: Internal error occurred:
failed calling webhook "inferenceservice.kfserving-webhook-server.pod-mutator":
Post https://kfserving-webhook-server-service.kubeflow.svc:443/mutate-pods?
timeout=30s: no endpoints available for service "kfserving-webhook-server-service"
Hi,
ok, this is kubeflow deamon set, that doesn't explain why none of the nodes can join the cluster.
This is probably the reason why this deamon can't be created.
All nodes are in NoReady state with below error:
Ready False Thu, 19 Nov 2020 16:38:40 +0000 Mon, 09 Nov 2020 08:58:49 +0000 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Hi @tretos53
The error is from the aws-node DS. Looks like there is a mutating webhook which is not available. Can you please try deleting the webhook? Nodes have joined the cluster but they won't be ready until CNI is ready.
kubectl describe ds/aws-node -n kube-system
Name: aws-node
Selector: k8s-app=aws-node
Node-Selector: <none>
Labels: k8s-app=aws-node
Annotations: deprecated.daemonset.template.generation: 2
kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"labels":{"k8s-app":"aws-node"},"name":"aws-node","namespace":"kub...
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Scheduled with Up-to-date Pods: 0
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: k8s-app=aws-node
Service Account: aws-node
Init Containers:
aws-vpc-cni-init:
Image: 602401143452.dkr.ecr.eu-west-2.amazonaws.com/amazon-k8s-cni-init:v1.7.5
Port: <none>
Host Port: <none>
Environment:
DISABLE_TCP_EARLY_DEMUX: false
Mounts:
/host/opt/cni/bin from cni-bin-dir (rw)
Containers:
aws-node:
Image: 602401143452.dkr.ecr.eu-west-2.amazonaws.com/amazon-k8s-cni:v1.7.5
Port: 61678/TCP
Host Port: 61678/TCP
Requests:
cpu: 10m
Liveness: exec [/app/grpc-health-probe -addr=:50051] delay=60s timeout=1s period=10s #success=1 #failure=3
Readiness: exec [/app/grpc-health-probe -addr=:50051] delay=1s timeout=1s period=10s #success=1 #failure=3
Environment:
ADDITIONAL_ENI_TAGS: {}
AWS_VPC_CNI_NODE_PORT_SUPPORT: true
AWS_VPC_ENI_MTU: 9001
AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER: false
AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG: false
AWS_VPC_K8S_CNI_EXTERNALSNAT: false
AWS_VPC_K8S_CNI_LOGLEVEL: DEBUG
AWS_VPC_K8S_CNI_LOG_FILE: /host/var/log/aws-routed-eni/ipamd.log
AWS_VPC_K8S_CNI_RANDOMIZESNAT: prng
AWS_VPC_K8S_CNI_VETHPREFIX: eni
AWS_VPC_K8S_PLUGIN_LOG_FILE: /var/log/aws-routed-eni/plugin.log
AWS_VPC_K8S_PLUGIN_LOG_LEVEL: DEBUG
DISABLE_INTROSPECTION: false
DISABLE_METRICS: false
ENABLE_POD_ENI: false
MY_NODE_NAME: (v1:spec.nodeName)
WARM_ENI_TARGET: 1
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/host/var/log/aws-routed-eni from log-dir (rw)
/run/xtables.lock from xtables-lock (rw)
/var/run/aws-node from run-dir (rw)
/var/run/dockershim.sock from dockershim (rw)
Volumes:
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType:
cni-net-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
dockershim:
Type: HostPath (bare host directory volume)
Path: /var/run/dockershim.sock
HostPathType:
xtables-lock:
Type: HostPath (bare host directory volume)
Path: /run/xtables.lock
HostPathType:
log-dir:
Type: HostPath (bare host directory volume)
Path: /var/log/aws-routed-eni
HostPathType: DirectoryOrCreate
run-dir:
Type: HostPath (bare host directory volume)
Path: /var/run/aws-node
HostPathType: DirectoryOrCreate
Priority Class Name: system-node-critical
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 3m (x1353 over 15d) daemonset-controller Error creating: Internal error occurred: failed calling webhook "inferenceservice.kfserving-webhook-server.pod-mutator": Post https://kfserving-webhook-server-service.kubeflow.svc:443/mutate-pods?timeout=30s: no endpoints available for service "kfserving-webhook-server-service"
Thank you. I'll check.
Adam
Hi there,
We have the same issue with our brand new Private EKS cluster (v 1.18)
A node does not come in the Ready state due to
runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialize
aws-node pod is in Running state, but constantly gets restarted eah 2-3 minutes with the following errors
Successfully assigned kube-system/aws-node-85279 to ip-10-98-77-41.ec2.internal
Pulling image "602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni-init:v1.7.5-eksbuild.1"
Successfully pulled image "602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni-init:v1.7.5-eksbuild.1"
Created container aws-vpc-cni-init
Started container aws-vpc-cni-init
Successfully pulled image "602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1"
Pulling image "602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1"
Created container aws-node
Started container aws-node
Readiness probe failed: {"level":"info","ts":"2020-11-21T10:29:30.590Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
Readiness probe failed: {"level":"info","ts":"2020-11-21T10:29:40.602Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
Readiness probe failed: {"level":"info","ts":"2020-11-21T10:29:50.591Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
Readiness probe failed: {"level":"info","ts":"2020-11-21T10:30:00.597Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
Container logs for amazon-k8s-cni
{"level":"info","ts":"2020-11-21T10:29:25.981Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2020-11-21T10:29:25.998Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2020-11-21T10:29:26.000Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
Kubelet logs show that:
Nov 21 10:34:38 ip-10-98-77-41.ec2.internal kubelet[3820]: W1121 10:34:38.585425 3820 cni.go:237] Unable to update cni config: no networks found in /etc/cni/net.d
Nov 21 10:34:40 ip-10-98-77-41.ec2.internal kubelet[3820]: E1121 10:34:40.142766 3820 kubelet.go:2195] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Here is the log collection:
eks_i-0b47c308f2650abb6_2020-11-21_1032-UTC_0.6.2.tar.gz
I've found that if I manually create the file _/etc/cni/net.d/10-aws.conflist_
with the following config:
{
"cniVersion": "0.3.1",
"name": "aws-cni",
"plugins": [
{
"name": "aws-cni",
"type": "aws-cni",
"vethPrefix": "eni",
"mtu": "9001",
"pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
"pluginLogLevel": "Debug"
},
{
"type": "portmap",
"capabilities": {"portMappings": true},
"snat": true
}
]
}
The node immediately goes UP.
What's the reason that this file gets not created automatically?
First I was thinking that it relates to the Custom CNI settings, but now I've created the new cluster with just three subnets and
have done nothing related to Custom CNI networking (no changes to _aws-node_ DaemonSet)
Hi there,
We have the same issue with our brand new Private EKS cluster (v 1.18)
A node does not come in the Ready state due to
runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialize
...
I have found the reason for my failure.
For anyone who gets faced the same issue:
I used _eksctl_ and within the input file, I had iam.withOIDC=true
After I have recreated a cluster without this setting, everything started to work correctly.
Seems that according to this https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-cni-walkthrough.html
it is required to add AWS_ROLE_ARN to the aws-node daemonset
However, I had no chance to check it since it is not required in my case at this point.
I deleted Kubeflow. That fixed it.
Will probably break again when I deploy Kubeflow again...