We've been plaqued for a while (at least through 1.5.x, maybe even before) with the CNI dying (not sure how) and never coming back up. This sort of issue causes a node to continue accepting workloads of which can't provide any value because networking is broken.
We also have #1054 open presently which represents successful restarts and trying to understand why we're restarting in the first place.
For one of the nodes which this happened, we cordoned and blocked CA from touching it for observation. Any information following relates to the aws-node instance running on that node, aws-node-mq5bk.
We had case 7079142701 open with AWS where they suggested #1019 (issue #1011) would fix the problem but couldn't assist beyond that. We've also reached out to our TAM to setup a meeting with your CNI team and he's failed to respond in over a week.
Therefore, I'm writing this ticket to agglomerate available information from what I'm seeing because I don't quite think this problem is like the others (see list at bottom). Please @mogren, you're my only hope!
I'd really like to verify with engineers that our problem is indeed solved with #1019. We've been told before by support to "just upgradeâ„¢" before and still saw issues.
Logs K8s provides (kubectl logs):
Starting IPAM daemon in the background ... ok.
ERROR: logging before flag.Parse: E0624 20:52:52.156133 6 memcache.go:147] couldn't get resource list for external.metrics.k8s.io/v1beta1: <nil>
Checking for IPAM connectivity ... failed.
Timed out waiting for IPAM daemon to start:
Support Person found errors in logs (provided via eks-logs-collector.sh)
(ipamd.log):
- {"level":"error","ts":"2020-06-04T13:37:58.660Z","caller":"ipamd/ipamd.go:962","msg":"Failed to retrieve interfaces data from instance metadata RequestError: send request failed\ncaused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/: dial tcp 169.254.169.254:80: connect: connection refused"}
- {"level":"error","ts":"2020-06-04T19:07:03.780Z","caller":"aws-k8s-agent/main.go:30","msg":"Initialization failure: ipamd init: failed to retrieve attached ENIs info"}
(plugin.log):
- {"level":"error","ts":"2020-06-09T04:16:03.253Z","caller":"routed-eni-cni-plugin/cni.go:252","msg":"Error received from DelNetwork gRPC call for pod edge-platform-gateway-66f6864f48-5wbvn namespace default sandbox c7bda6671561a7bfa978017c7f66b440df7991a2efe15b8ebc2ff3bf9c6a26b8: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused\""}
A unique readiness probe from what I think is around the time the container crashed and ceased to come back (~Jun 3/4):
Events from the Pod kube-system/aws-node-mq5bk
1 Unhealthy: Readiness probe failed: runtime: failed to create new OS thread (have 8 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
runtime stack:
runtime.throw(0x5653e61c1951, 0x9)
/usr/local/go/src/runtime/panic.go:774 +0x74
runtime.newosproc(0xc000111500)
/usr/local/go/src/runtime/os_linux.go:153 +0x1bc
runtime.newm1(0xc000111500)
/usr/local/go/src/runtime/proc.go:1853 +0xde
runtime.newm(0x0, 0xc000030500)
/usr/local/go/src/runtime/proc.go:1832 +0x93
runtime.startm(0xc000030500, 0xc000111100)
/usr/local/go/src/runtime/proc.go:1969 +0x12c
runtime.handoffp(0xc000030500)
/usr/local/go/src/runtime/proc.go:1996 +0x54
runtime.entersyscallblock_handoff()
/usr/local/go/src/runtime/proc.go:2937 +0x32
runtime.systemstack(0x0)
/usr/local/go/src/runtime/asm_amd64.s:370 +0x63
runtime.mstart()
/usr/local/go/src/runtime/proc.go:1146
goroutine 23 [syscall]:
runtime.notetsleepg(0x5653e6964660, 0x3b5d979a, 0x0)
/usr/local/go/src/runtime/lock_futex.go:227 +0x38 fp=0xc00004cf60 sp=0xc00004cf30 pc=0x5653e5da1658
runtime.timerproc(0x5653e6964640)
/usr/local/go/src/runtime/time.go:311 +0x2f5 fp=0xc00004cfd8 sp=0xc00004cf60 pc=0x5653e5de2565
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc00004cfe0 sp=0xc00004cfd8 pc=0x5653e5df1821
created by runtime.(*timersBucket).addtimerLocked
/usr/local/go/src/runtime/time.go:169 +0x110
goroutine 1 [select]:
google.golang.org/grpc.(*ClientConn).WaitForStateChange(0xc000110e00, 0x5653e6643e60, 0xc00008e840, 0x0, 0x0)
/go/pkg/mod/google.golang.org/[email protected]/clientconn.go:502 +0x103
google.golang.org/grpc.DialContext(0x5653e6643e60, 0xc00008e840, 0x7ffff42f3510, 0x6, 0xc000090800, 0x3, 0x4, 0x0, 0x0, 0x0)
/go/pkg/mod/google.golang.org/[email protected]/clientconn.go:322 +0xca2
main.main()
/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/grpc-health-probe/main.go:99 +0x344
goroutine 20 [syscall]:
os/signal.signal_recv(0x0)
/usr/local/go/src/runtime/sigqueue.go:147 +0x9e
os/signal.loop()
/usr/local/go/src/os/signal/signal_unix.go:23 +0x24
created by os/signal.init.0
/usr/local/go/src/os/signal/signal_unix.go:29 +0x43
goroutine 22 [chan receive]:
main.main.func1(0xc00008e780, 0xc000095980)
/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/grpc-health-probe/main.go:79 +0x43
created by main.main
/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/grpc-health-probe/main.go:78 +0x11d
goroutine 24 [runnable]:
google.golang.org/grpc.(*ccBalancerWrapper).watcher(0xc000090940)
/go/pkg/mod/google.golang.org/[email protected]/balancer_conn_wrappers.go:67
created by google.golang.org/grpc.newCCBalancerWrapper
/go/pkg/mod/google.golang.org/[email protected]/balancer_conn_wrappers.go:60 +0x16f
goroutine 25 [chan receive]:
net.(netFD).connect.func1(0xc000092900, 0xc000092960, 0xc0000829b0, 0xc000104380)
/usr/local/go/src/net/fd_unix.go:117 +0x51
net.(netFD).connect(0xc000104380, 0x5653e6643e60, 0xc00008ea20, 0x0, 0x0, 0x5653e663b8a0, 0xc0000b66c0, 0x5653e663b8a0, 0xc0000261c0, 0x0, ...)
/usr/local/go/src/net/fd_unix.go:173 +0x42e
net.(netFD).dial(0xc000104380, 0x5653e6643e60, 0xc00008ea20, 0x5653e6646160, 0x0, 0x5653e6646160, 0xc00011e6f0, 0x0, 0x0, 0x0)
/usr/local/go/src/net/sock_posix.go:149 +0x101
net.socket(0x5653e6643e60, 0xc00008ea20, 0x5653e61bd657, 0x3, 0x2, 0x1, 0x0, 0x0, 0x5653e6646160, 0x0, ...)
/usr/local/go/src/net/sock_posix.go:70 +0x1c2
net.internetSocket(0x5653e6643e60, 0xc00008ea20, 0x5653e61bd657, 0x3, 0x5653e6646160, 0x0, 0x5653e6646160, 0xc00011e6f0, 0x1, 0x0, ...)
/usr/local/go/src/net/ipsock_posix.go:141 +0x143
net.(sysDialer).doDialTCP(0xc000104300, 0x5653e6643e60, 0xc00008ea20, 0x0, 0xc00011e6f0, 0x5653e659e440, 0x5653e697ce80, 0x0)
/usr/local/go/src/net/tcpsock_posix.go:65 +0xc4
net.(sysDialer).dialTCP(0xc000104300, 0x5653e6643e60, 0xc00008ea20, 0x0, 0xc00011e6f0, 0xc000095bc0, 0x10, 0x10)
/usr/local/go/src/net/tcpsock_posix.go:61 +0xd9
net.(sysDialer).dialSingle(0xc000104300, 0x5653e6643e60, 0xc0000
Wed Jun 03 2020 09:29:22 GMT-0400 (EDT)
I then see this liveness probe failure about 20 seconds after:
1 Unhealthy: Liveness probe failed: OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "process_linux.go:101: executing setns process caused \"exit status 1\"": unknown
Here's more Container details in its current state:
Containers:
aws-node:
Container ID: docker://b1fc0367d03d6172a225b9d34018d8532e49aa2437a4b608a655487519b712e3
Image: 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.6.1
Image ID: docker-pullable://602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni@sha256:d50a182475c5ee6c18c3b81b01aa649367f30fb0dc60f7a619dcdbf45e10b3a3
Port: 61678/TCP
Host Port: 61678/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 24 Jun 2020 16:52:52 -0400
Finished: Wed, 24 Jun 2020 16:53:28 -0400
Ready: False
Restart Count: 5203
Requests:
cpu: 10m
Liveness: exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s #success=1 #failure=3
Readiness: exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s #success=1 #failure=3
Environment:
AWS_VPC_K8S_CNI_RANDOMIZESNAT: prng
MINIMUM_IP_TARGET: 15
WARM_IP_TARGET: 7
AWS_VPC_K8S_CNI_LOGLEVEL: DEBUG
AWS_VPC_K8S_CNI_VETHPREFIX: eni
AWS_VPC_ENI_MTU: 9001
MY_NODE_NAME: (v1:spec.nodeName)
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/host/var/log from log-dir (rw)
/var/run/docker.sock from dockersock (rw)
/var/run/dockershim.sock from dockershim (rw)
/var/run/secrets/kubernetes.io/serviceaccount from aws-node-token-mkxkm (ro)
Other suggested issues:
This is curious indeed:
- {"level":"error","ts":"2020-06-04T13:37:58.660Z","caller":"ipamd/ipamd.go:962","msg":"Failed to retrieve interfaces data from instance metadata RequestError: send request failed\ncaused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/: dial tcp 169.254.169.254:80: connect: connection refused"}
Basically, the instance metadata service is refusing the call. It could again be because of throttling, but this time on the metadata and not EC2. The CNI does not call instance metadata very often, on average around once every 5 seconds. Is there some other service or pod on the node that does a lot more frequent calls?
Also:
1 Unhealthy: Readiness probe failed: runtime: failed to create new OS thread (have 8 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
Suggests that the node has a lot of processes running, or for some reason have very low limits set. Is the load on these nodes very high?
This is curious indeed:
- {"level":"error","ts":"2020-06-04T13:37:58.660Z","caller":"ipamd/ipamd.go:962","msg":"Failed to retrieve interfaces data from instance metadata RequestError: send request failed\ncaused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/: dial tcp 169.254.169.254:80: connect: connection refused"}Basically, the instance metadata service is refusing the call. It could again be because of throttling, but this time on the metadata and not EC2. The CNI does not call instance metadata very often, on average around once every 5 seconds. Is there some other service or pod on the node that does a lot more frequent calls?
I don't think there's anything else on-node that would be using that metadata service. The DaemonSets we have are nginx, datadog, consul and nodelocaldns, and of course kube-proxy. We do leverage Cluster Autoscaler (single instance Deployment) as well which would call the API around ASGs but I don't think that should interfere here.
This is also what Support mentioned and even had me curl the metadata service which worked perfectly every time...
Also:
1 Unhealthy: Readiness probe failed: runtime: failed to create new OS thread (have 8 already; errno=11) runtime: may need to increase max user processes (ulimit -u) fatal error: newosprocSuggests that the node has a lot of processes running, or for some reason have very low limits set. Is the load on these nodes very high?
If you feel strongly enough about this being part of the issue, I can do more research on the state of the node when the CNI restarted, but it's still in this restarting state. I still have this node available so please feel free to give me a slew of tests you'd like to see done. I feel like this has to do with a long-term change that's happened somehow. Additionally, the node has been drained and cordoned so there's just the DaemonSets running on it, hence less load while still having the CNI in a restarting state.
Looks like I have similar. In /var/logs/aws-routed-eni/ipamd.log I see:
{"level":"error","ts":"2020-07-31T09:47:45.014Z","caller":"ipamd/ipamd.go:322","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-01f111154 eni-0822227]: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: 111....111"}
despite I do have both ec2:DescribeNetworkInterfaces and sts:AssumeRoleWithWebIdentity explicitly attached to node role. Please note "caused by: [nothing here]".
@InAnimaTe Sorry for being late here. if your still having the issue we can jump on a call and debug the issue.
@InAnimaTe Did you upload your logs somewhere? I tried to find the link but couldn't find it.
Thanks for responding @SaranBalaji90 !!
I did not upload logs, just the excerpts I've shared. The logs are in our support tickets so I could pull them out if needbe.
We haven't upgraded but the issue is much less of a problem now. I would be interested in possibly getting on a call to talk through this. However, I'm not entirely sure it makes sense if we're aware that newer versions fix this?
Reopened #486 to discuss further on this.
Reopened #486 to discuss further on this.
Sounds good. I have not yet tested if this still occurs on newer CNI versions and have no real way to replicate it. Seems to be completely random, and hasn't happened to us in a little while anyways.
Synced up with @InAnimaTe offline too. Just to update everyone here, we haven't made any changes to ipamd to fix this issue. Hopefully driving #486 to completion will help us to find the actual issue here.
FWIW, I'm running on EKS 1.17 and amazon-k8s-cni:v1.6.3. When I attempt to roll my cni daemonset (for example changing the deamonset node tags), I've frequently seen this issue:
Copying portmap binary ... Starting IPAM daemon in the background ... ok.
Checking for IPAM connectivity ... failed.
Timed out waiting for IPAM daemon to start:
The only solution I've found has been to recycle the node which makes this super painful.
@nigelellis Can you also paste the ipamd logs?
@SaranBalaji90 how do I get the ipamd logs?
@nigelellis https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#troubleshoot-cni is the relevant doc.
If you have access to the node, running sudo bash /opt/cni/bin/aws-cni-support.sh should give you an archive with all relevant logs. (Note, you might want to send this through AWS support or email, depending on the size of the log file)
Thanks, I'll see if I can reproduce it tomorrow.
...
On Thu, Sep 03, 2020 at 5:20 PM, Claes Mogren < [email protected] > wrote:
@ nigelellis ( https://github.com/nigelellis ) https:/ / docs. aws. amazon.
com/ eks/ latest/ userguide/ troubleshooting. html#troubleshoot-cni (
https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#troubleshoot-cni
) is the relevant doc.If you have access to the node, running sudo bash / opt/ cni/ bin/ aws-cni-support.
sh ( http://opt/cni/bin/aws-cni-support.sh ) should give you an archive
with all relevant logs. (Note, you might want to send this through AWS
support or email, depending on the size of the log file)—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub (
https://github.com/aws/amazon-vpc-cni-k8s/issues/1055#issuecomment-686829092
) , or unsubscribe (
https://github.com/notifications/unsubscribe-auth/AC4UMHJJYSPEBLOTYYM222TSEAXFZANCNFSM4OHMIZ6A
).
I was able to get a repro - final log line is Getting running pod sandboxes from "unix:///var/run/dockershim.sock":-
[ipamd.log]
{"level":"debug","ts":"2020-09-04T00:25:30.883Z","caller":"awsutils/awsutils.go:253","msg":"Found subnet-id: subnet-0b64fcb84bf6c844a "}
{"level":"debug","ts":"2020-09-04T00:25:30.883Z","caller":"awsutils/awsutils.go:253","msg":"Found vpc-ipv4-cidr-block: 10.10.0.0/16 "}
{"level":"debug","ts":"2020-09-04T00:25:30.884Z","caller":"awsutils/awsutils.go:253","msg":"Found VPC CIDR: 10.10.0.0/16"}
{"level":"debug","ts":"2020-09-04T00:25:30.884Z","caller":"ipamd/ipamd.go:308","msg":"Start node init"}
{"level":"info","ts":"2020-09-04T00:25:30.884Z","caller":"wait/wait.go:133","msg":"Will attempt to clean up AWS CNI leaked ENIs after waiting 2m21s."}
{"level":"debug","ts":"2020-09-04T00:25:30.884Z","caller":"awsutils/awsutils.go:948","msg":"Total number of interfaces found: 2 "}
{"level":"debug","ts":"2020-09-04T00:25:30.884Z","caller":"awsutils/awsutils.go:443","msg":"Found ENI MAC address: 0a:ab:ee:cb:ac:d3"}
{"level":"debug","ts":"2020-09-04T00:25:30.885Z","caller":"awsutils/awsutils.go:457","msg":"Using device number 0 for primary eni: eni-00719958e493c64e0"}
{"level":"debug","ts":"2020-09-04T00:25:30.885Z","caller":"awsutils/awsutils.go:443","msg":"Found ENI: eni-00719958e493c64e0, MAC 0a:ab:ee:cb:ac:d3, device 0"}
{"level":"debug","ts":"2020-09-04T00:25:30.886Z","caller":"awsutils/awsutils.go:463","msg":"Found CIDR 10.10.192.0/19 for ENI 0a:ab:ee:cb:ac:d3"}
{"level":"debug","ts":"2020-09-04T00:25:30.886Z","caller":"awsutils/awsutils.go:463","msg":"Found IP addresses [10.10.196.9 10.10.216.43 10.10.197.8 10.10.203.104 10.10.198.142 10.10.209.14 10.10.220.239 10.10.198.111 10.10.211.130 10.10.200.228 10.10.207.120 10.10.209.152 10.10.205.254 10.10.205.191 10.10.192.253] on ENI 0a:ab:ee:cb:ac:d3"}
{"level":"debug","ts":"2020-09-04T00:25:30.886Z","caller":"awsutils/awsutils.go:443","msg":"Found ENI MAC address: 0a:f0:80:9d:5c:8d"}
{"level":"debug","ts":"2020-09-04T00:25:30.887Z","caller":"awsutils/awsutils.go:443","msg":"Found ENI: eni-09d5b0a5b7cd21df3, MAC 0a:f0:80:9d:5c:8d, device 2"}
{"level":"debug","ts":"2020-09-04T00:25:30.888Z","caller":"awsutils/awsutils.go:463","msg":"Found CIDR 10.10.192.0/19 for ENI 0a:f0:80:9d:5c:8d"}
{"level":"debug","ts":"2020-09-04T00:25:30.888Z","caller":"awsutils/awsutils.go:463","msg":"Found IP addresses [10.10.222.18 10.10.218.168 10.10.195.238 10.10.218.12 10.10.203.45 10.10.215.99 10.10.192.96 10.10.209.192 10.10.199.128 10.10.221.199 10.10.217.229 10.10.210.158 10.10.205.94 10.10.214.159 10.10.198.16] on ENI 0a:f0:80:9d:5c:8d"}
{"level":"debug","ts":"2020-09-04T00:25:31.116Z","caller":"ipamd/ipamd.go:308","msg":"DescribeAllENIs success: ENIs: 2, tagged: 1"}
{"level":"info","ts":"2020-09-04T00:25:31.116Z","caller":"ipamd/ipamd.go:354","msg":"Setting up host network... "}
{"level":"debug","ts":"2020-09-04T00:25:31.116Z","caller":"networkutils/network.go:236","msg":"Trying to find primary interface that has mac : 0a:ab:ee:cb:ac:d3"}
{"level":"debug","ts":"2020-09-04T00:25:31.116Z","caller":"networkutils/network.go:236","msg":"Discovered interface: lo, mac: "}
{"level":"debug","ts":"2020-09-04T00:25:31.117Z","caller":"networkutils/network.go:236","msg":"Discovered interface: eth0, mac: 0a:ab:ee:cb:ac:d3"}
{"level":"info","ts":"2020-09-04T00:25:31.117Z","caller":"networkutils/network.go:236","msg":"Discovered primary interface: eth0"}
{"level":"debug","ts":"2020-09-04T00:25:31.117Z","caller":"ipamd/ipamd.go:354","msg":"Setting RPF for primary interface: net/ipv4/conf/eth0/rp_filter"}
{"level":"debug","ts":"2020-09-04T00:25:31.117Z","caller":"networkutils/network.go:263","msg":"Found the Link that uses mac address 0a:ab:ee:cb:ac:d3 and its index is 2 (attempt 1/5)"}
{"level":"debug","ts":"2020-09-04T00:25:31.118Z","caller":"networkutils/network.go:314","msg":"Setup Host Network: loading existing iptables nat SNAT exclusion rules"}
{"level":"debug","ts":"2020-09-04T00:25:31.125Z","caller":"networkutils/network.go:314","msg":"host network setup: found potentially stale SNAT rule for chain AWS-SNAT-CHAIN-0: [-N AWS-SNAT-CHAIN-0]"}
{"level":"debug","ts":"2020-09-04T00:25:31.125Z","caller":"networkutils/network.go:314","msg":"host network setup: found potentially stale SNAT rule for chain AWS-SNAT-CHAIN-0: [-A AWS-SNAT-CHAIN-0 ! -d 10.10.0.0/16 -m comment --comment AWS SNAT CHAIN -j AWS-SNAT-CHAIN-1]"}
{"level":"debug","ts":"2020-09-04T00:25:31.127Z","caller":"networkutils/network.go:314","msg":"host network setup: found potentially stale SNAT rule for chain AWS-SNAT-CHAIN-1: [-N AWS-SNAT-CHAIN-1]"}
{"level":"debug","ts":"2020-09-04T00:25:31.127Z","caller":"networkutils/network.go:314","msg":"host network setup: found potentially stale SNAT rule for chain AWS-SNAT-CHAIN-1: [-A AWS-SNAT-CHAIN-1 -m comment --comment AWS, SNAT -m addrtype ! --dst-type LOCAL -j SNAT --to-source 10.10.196.9 --random]"}
{"level":"debug","ts":"2020-09-04T00:25:31.127Z","caller":"ipamd/ipamd.go:354","msg":"Setup Host Network: iptables -N AWS-SNAT-CHAIN-0 -t nat"}
{"level":"debug","ts":"2020-09-04T00:25:31.129Z","caller":"ipamd/ipamd.go:354","msg":"Setup Host Network: iptables -N AWS-SNAT-CHAIN-1 -t nat"}
{"level":"debug","ts":"2020-09-04T00:25:31.130Z","caller":"ipamd/ipamd.go:354","msg":"Setup Host Network: iptables -A POSTROUTING -m comment --comment \"AWS SNAT CHAIN\" -j AWS-SNAT-CHAIN-0"}
{"level":"debug","ts":"2020-09-04T00:25:31.130Z","caller":"ipamd/ipamd.go:354","msg":"Setup Host Network: iptables -A AWS-SNAT-CHAIN-0 ! -d {10.10.0.0/16 %!s(bool=false)} -t nat -j AWS-SNAT-CHAIN-1"}
{"level":"debug","ts":"2020-09-04T00:25:31.130Z","caller":"ipamd/ipamd.go:354","msg":"Setup Host Network: synchronising SNAT stale rules"}
{"level":"debug","ts":"2020-09-04T00:25:31.130Z","caller":"ipamd/ipamd.go:354","msg":"Setup Host Network: stale rule found: nat/AWS-SNAT-CHAIN-0 rule [0] AWS-SNAT-CHAIN-0"}
{"level":"debug","ts":"2020-09-04T00:25:31.130Z","caller":"ipamd/ipamd.go:354","msg":"Setup Host Network: active rule found: nat/AWS-SNAT-CHAIN-0 rule [1] AWS-SNAT-CHAIN-0"}
{"level":"debug","ts":"2020-09-04T00:25:31.130Z","caller":"ipamd/ipamd.go:354","msg":"Setup Host Network: stale rule found: nat/AWS-SNAT-CHAIN-1 rule [0] AWS-SNAT-CHAIN-1"}
{"level":"debug","ts":"2020-09-04T00:25:31.130Z","caller":"ipamd/ipamd.go:354","msg":"Setup Host Network: active rule found: nat/AWS-SNAT-CHAIN-1 rule [1] AWS-SNAT-CHAIN-1"}
{"level":"debug","ts":"2020-09-04T00:25:31.130Z","caller":"ipamd/ipamd.go:354","msg":"iptableRules: [nat/POSTROUTING rule first SNAT rules for non-VPC outbound traffic nat/AWS-SNAT-CHAIN-0 rule [0] AWS-SNAT-CHAIN nat/AWS-SNAT-CHAIN-1 rule last SNAT rule for non-VPC outbound traffic nat/AWS-SNAT-CHAIN-0 rule [0] AWS-SNAT-CHAIN-0 nat/AWS-SNAT-CHAIN-1 rule [0] AWS-SNAT-CHAIN-1]"}
{"level":"debug","ts":"2020-09-04T00:25:31.130Z","caller":"ipamd/ipamd.go:354","msg":"execute iptable rule : first SNAT rules for non-VPC outbound traffic"}
{"level":"debug","ts":"2020-09-04T00:25:31.132Z","caller":"ipamd/ipamd.go:354","msg":"execute iptable rule : [0] AWS-SNAT-CHAIN"}
{"level":"debug","ts":"2020-09-04T00:25:31.133Z","caller":"ipamd/ipamd.go:354","msg":"execute iptable rule : last SNAT rule for non-VPC outbound traffic"}
{"level":"debug","ts":"2020-09-04T00:25:31.135Z","caller":"ipamd/ipamd.go:354","msg":"execute iptable rule : [0] AWS-SNAT-CHAIN-0"}
{"level":"debug","ts":"2020-09-04T00:25:31.136Z","caller":"ipamd/ipamd.go:354","msg":"execute iptable rule : [0] AWS-SNAT-CHAIN-1"}
{"level":"debug","ts":"2020-09-04T00:25:31.137Z","caller":"ipamd/ipamd.go:354","msg":"execute iptable rule : connmark for primary ENI"}
{"level":"debug","ts":"2020-09-04T00:25:31.138Z","caller":"ipamd/ipamd.go:354","msg":"execute iptable rule : connmark restore for primary ENI"}
{"level":"debug","ts":"2020-09-04T00:25:31.139Z","caller":"ipamd/ipamd.go:354","msg":"execute iptable rule : rule for primary address 10.10.196.9"}
{"level":"debug","ts":"2020-09-04T00:25:31.140Z","caller":"ipamd/ipamd.go:308","msg":"Discovered ENI eni-00719958e493c64e0, trying to set it up"}
{"level":"debug","ts":"2020-09-04T00:25:31.140Z","caller":"ipamd/ipamd.go:742","msg":"DataStore Add an ENI eni-00719958e493c64e0"}
{"level":"info","ts":"2020-09-04T00:25:31.140Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.216.43 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.197.8 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.203.104 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.198.142 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.209.14 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.220.239 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.198.111 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.211.130 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.200.228 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.207.120 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.209.152 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.205.254 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.205.191 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-00719958e493c64e0)'s IP 10.10.192.253 to datastore"}
{"level":"debug","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:756","msg":"IP Address Pool stats: total: 14, assigned: 0"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:308","msg":"ENI eni-00719958e493c64e0 set up."}
{"level":"debug","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:308","msg":"Discovered ENI eni-09d5b0a5b7cd21df3, trying to set it up"}
{"level":"debug","ts":"2020-09-04T00:25:31.141Z","caller":"ipamd/ipamd.go:742","msg":"DataStore Add an ENI eni-09d5b0a5b7cd21df3"}
{"level":"info","ts":"2020-09-04T00:25:31.141Z","caller":"networkutils/network.go:679","msg":"Setting up network for an ENI with IP address 10.10.222.18, MAC address 0a:f0:80:9d:5c:8d, CIDR 10.10.192.0/19 and route table 2"}
{"level":"debug","ts":"2020-09-04T00:25:31.141Z","caller":"networkutils/network.go:692","msg":"Found the Link that uses mac address 0a:f0:80:9d:5c:8d and its index is 4 (attempt 1/5)"}
{"level":"debug","ts":"2020-09-04T00:25:31.141Z","caller":"networkutils/network.go:679","msg":"Setting up ENI's primary IP 10.10.222.18"}
{"level":"debug","ts":"2020-09-04T00:25:31.141Z","caller":"networkutils/network.go:679","msg":"Deleting existing IP address 10.10.222.18/19 eth1"}
{"level":"debug","ts":"2020-09-04T00:25:31.183Z","caller":"networkutils/network.go:679","msg":"Adding IP address 10.10.222.18/19"}
{"level":"debug","ts":"2020-09-04T00:25:31.183Z","caller":"networkutils/network.go:679","msg":"Setting up ENI's default gateway 10.10.192.1"}
{"level":"debug","ts":"2020-09-04T00:25:31.183Z","caller":"retry/retry.go:69","msg":"Successfully added/replaced route to be 10.10.192.1/0"}
{"level":"debug","ts":"2020-09-04T00:25:31.183Z","caller":"retry/retry.go:69","msg":"Successfully added/replaced route to be 0.0.0.0/0"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.218.168 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.195.238 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.218.12 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.203.45 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.215.99 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.192.96 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.209.192 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.199.128 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.221.199 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.217.229 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.210.158 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.205.94 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.214.159 to datastore"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:768","msg":"Added ENI(eni-09d5b0a5b7cd21df3)'s IP 10.10.198.16 to datastore"}
{"level":"debug","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:756","msg":"IP Address Pool stats: total: 28, assigned: 0"}
{"level":"info","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:308","msg":"ENI eni-09d5b0a5b7cd21df3 set up."}
{"level":"debug","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:439","msg":"Getting running pod sandboxes from \"unix:///var/run/dockershim.sock\""}
The pod logs returned:
Copying portmap binary ... Starting IPAM daemon in the background ... ok.
Checking for IPAM connectivity ... failed.
Timed out waiting for IPAM daemon to start:
I was tailing the ipamd log and the container exits with:
2020-09-03 17:26:06,800 - ERROR: command terminated with exit code 137
Here's the podspec:
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/limit-ranger: 'LimitRanger plugin set: memory request for container
aws-node; cpu, memory limit for container aws-node'
kubernetes.io/psp: eks.privileged
creationTimestamp: "2020-09-02T20:54:17Z"
generateName: aws-node-
labels:
controller-revision-hash: 88769d4f9
k8s-app: aws-node
pod-template-generation: "9"
name: aws-node-569fm
namespace: kube-system
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: DaemonSet
name: aws-node
uid: 87a32608-ed1a-11e8-8d8d-026b59889896
resourceVersion: "183343173"
selfLink: /api/v1/namespaces/kube-system/pods/aws-node-569fm
uid: 5a4e21da-b7dd-4f0f-a5fa-2c3039a31a04
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- ip-10-10-196-9.us-west-2.compute.internal
automountServiceAccountToken: true
containers:
- env:
- name: AWS_VPC_K8S_CNI_LOGLEVEL
value: DEBUG
- name: MY_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.4
imagePullPolicy: Always
name: aws-node
ports:
- containerPort: 61678
hostPort: 61678
name: metrics
protocol: TCP
resources:
limits:
cpu: 300m
memory: 300Mi
requests:
cpu: 10m
memory: 300Mi
securityContext:
allowPrivilegeEscalation: true
privileged: true
readOnlyRootFilesystem: false
runAsGroup: 0
runAsNonRoot: false
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/opt/cni/bin
mountPropagation: None
name: cni-bin-dir
- mountPath: /host/etc/cni/net.d
mountPropagation: None
name: cni-net-dir
- mountPath: /host/var/log
mountPropagation: None
name: log-dir
- mountPath: /var/run/docker.sock
mountPropagation: None
name: dockersock
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: aws-node-token-vw5dm
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostNetwork: true
nodeName: ip-10-10-196-9.us-west-2.compute.internal
priority: 2000001000
priorityClassName: system-node-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: aws-node
serviceAccountName: aws-node
shareProcessNamespace: false
terminationGracePeriodSeconds: 30
tolerations:
- operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/network-unavailable
operator: Exists
volumes:
- hostPath:
path: /opt/cni/bin
type: ""
name: cni-bin-dir
- hostPath:
path: /etc/cni/net.d
type: ""
name: cni-net-dir
- hostPath:
path: /var/log
type: ""
name: log-dir
- hostPath:
path: /var/run/docker.sock
type: ""
name: dockersock
- name: aws-node-token-vw5dm
secret:
defaultMode: 420
secretName: aws-node-token-vw5dm
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2020-09-02T20:54:17Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2020-09-04T00:31:35Z"
message: 'containers with unready status: [aws-node]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2020-09-04T00:31:35Z"
message: 'containers with unready status: [aws-node]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2020-09-02T20:54:17Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://3fd7469cd1288c1ca964f50730d8d489692c046094d2f05c81615f07de4caec7
image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.4
imageID: docker-pullable://602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni@sha256:e01133675b0ac9857392593aaaf62e56dd8a28c1bf6a23eac34cd577a9c2da20
lastState:
terminated:
containerID: docker://3fd7469cd1288c1ca964f50730d8d489692c046094d2f05c81615f07de4caec7
exitCode: 1
finishedAt: "2020-09-04T00:31:35Z"
reason: Error
startedAt: "2020-09-04T00:30:59Z"
name: aws-node
ready: false
restartCount: 7
started: false
state:
waiting:
message: back-off 5m0s restarting failed container=aws-node pod=aws-node-569fm_kube-system(5a4e21da-b7dd-4f0f-a5fa-2c3039a31a04)
reason: CrashLoopBackOff
hostIP: 10.10.196.9
phase: Running
podIP: 10.10.196.9
podIPs:
- ip: 10.10.196.9
qosClass: Burstable
startTime: "2020-09-02T20:54:17Z"
To reproduce this issue:
amazon-k8s-cni:v1.6.4 and created a fresh node in the ASGaws-node pod spec via. kubectl edit pod aws-node-xyz and bumped the image tag version from amazon-k8s-cni:v1.6.3 to amazon-k8s-cni:v1.6.4. FWIW I don't think the version makes any differencekubectl -n kube-system exec -it aws-node-XXX-- tail -f /host/var/log/aws-routed-eni/ipamd.log | tee ipamd.log
I'm running EKS kubelet v1.17.9-eks-4c6976
Thanks a lot @nigelellis, this should help us reproduce the issue.
Is this the final line in the logs?
{"level":"debug","ts":"2020-09-04T00:25:31.183Z","caller":"ipamd/ipamd.go:439","msg":"Getting running pod sandboxes from \"unix:///var/run/dockershim.sock\""}
So reading from the docker shim is hanging, locking up ipamd is that correct? What does docker version say?
@mogren yes, that's the last line in the log before the pod terminates with error 137. My read is it dies while reading from Docker.
I'm running the stock AWS image amazon-k8s-cni:v1.6.4. The kubelet is running EKS v1.17.9-eks-4c6976.
Looks like it's running Docker docker://19.3.6 (found via. kubectl get node | grep containerVersion)
Does that help?
@mogren any idea on when this issue might be resolved? LMK if I can help with more logs, etc. Thanks.
@nigelellis Hey, sorry for the late reply on this. In the pod spec you pated, I don't see unix:///var/run/dockershim.sock being mounted into the aws-node pod. The docker.sock is a docker specific API that was only used on CNI v1.5x and earlier.
The missing lines are aws-k8s-cni.yaml#L141-L142 and aws-k8s-cni.yaml#L156-L158.
This was mentioned in the release notes for v1.6.0 and the next 3 releases when updating from earlier versions. Sorry for not noticing this right away. Could you please try adding /var/run/dockershim.sock?
Thank you, I'll try this out tomorrow.
...
On Sun, Sep 13, 2020 at 3:18 PM, Claes Mogren < [email protected] > wrote:
@ nigelellis ( https://github.com/nigelellis ) Hey, sorry for the late
reply on this. In the pod spec you pated, I don't see unix:///var/run/dockershim.sock
being mounted into the aws-node pod. The docker.sock is a docker specific
API that was only used on CNI v1.5x and earlier.The missing lines are aws-k8s-cni. yaml#L141-L142 (
https://github.com/aws/amazon-vpc-cni-k8s/blob/master/config/v1.6/aws-k8s-cni.yaml#L141-L142
) and aws-k8s-cni. yaml#L156-L158 (
https://github.com/aws/amazon-vpc-cni-k8s/blob/master/config/v1.6/aws-k8s-cni.yaml#L156-L158
).This was mentioned in the release notes for v1. 6. 0 (
https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.6.0 ) and the
next 3 releases when updating from earlier versions. Sorry for not
noticing this right away. Could you please try adding /var/run/dockershim.sock
?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub (
https://github.com/aws/amazon-vpc-cni-k8s/issues/1055#issuecomment-691733364
) , or unsubscribe (
https://github.com/notifications/unsubscribe-auth/AC4UMHKL74A3RC6RLRNHJB3SFVALBANCNFSM4OHMIZ6A
).
@mogren - I can confirm updating the daemonset to mount dockershim.sock addressed the crashlooping issue we experienced when rolling the aws-node pods. I was able to bump to 1.6.4 without any problems. Thank you for the help, very much appreciated -- a good lesson for me to validate all config/manifest changes when bumping version.
Thanks a lot for confirming @nigelellis!
There will be a v1.7.2 out very soon with a lot of fixes. The v1.7.x versions now have an init-container as well, so the config has changed a lot since v1.6.x
@mogren will 1.7 be recommended for EKS 1.17? I still see 1.6.3 on the EKS upgrade guide.
@nigelellis Yes, as soon as we make it the default for new clusters. 🙂
@InAnimaTe Are you still seeing this issue? Could you please try using v1.7.3?
@InAnimaTe Are you still seeing this issue? Could you please try using v1.7.3?
Hey @mogren so we have not seen this anymore in the past couple months in any of our environments. We are planning to upgrade to 1.7.2/3 in the next few weeks so I'll be looking out for these issues but please don't keep this open on my account. I can reopen if I see clear evidence of this happening again.
Thanks!
Great, thanks a lot @InAnimaTe, and please open a new ticket for if you see any issue!