Containers-roadmap: [EKS]: Support for Arm Nodes

Created on 24 Apr 2019 · 40Comments · Source: aws/containers-roadmap

Amazon EKS now supports Arm processor EC2 A1 instances as a developer preview. You can now run containers using EC2 A1 instances on a Kubernetes cluster that is managed by Amazon EKS.

Learn more and get started here: https://github.com/aws/containers-roadmap/tree/master/preview-programs/

Learn more about Amazon A1 instances: https://aws.amazon.com/ec2/instance-types/a1/

Please leave feedback and comments on the preview using this ticket.

EKS EKS Managed Nodes

Source

tabern

👍13

Most helpful comment

Are mixed x86_64 and arm64 clusters supported?

tkinz27 on 12 May 2020

👍6

All 40 comments

Thank you bring in Arm support! I tried out the article. Here's my feedback.

After step 6, I do see the Arm64 node becomes READY. However, it took several try for the pod aws-node-arm-d788f to start. Even though, it keeps crashing restarting. Here's the event of that pod.

Events:
  Type     Reason     Age               From                                                  Message
  ----     ------     ----              ----                                                  -------
  Normal   Scheduled  9m                default-scheduler                                     Successfully assigned kube-system/aws-node-arm-d788f to ip-172-31-36-179.us-west-2.compute.internal
  Normal   Created    6m (x4 over 8m)   kubelet, ip-172-31-36-179.us-west-2.compute.internal  Created container
  Normal   Started    6m (x4 over 8m)   kubelet, ip-172-31-36-179.us-west-2.compute.internal  Started container
  Normal   Pulling    5m (x5 over 9m)   kubelet, ip-172-31-36-179.us-west-2.compute.internal  pulling image "940911992744.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-arm64:v1.3.3"
  Normal   Pulled     5m (x5 over 9m)   kubelet, ip-172-31-36-179.us-west-2.compute.internal  Successfully pulled image "940911992744.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-arm64:v1.3.3"
  Warning  BackOff    4m (x12 over 7m)  kubelet, ip-172-31-36-179.us-west-2.compute.internal  Back-off restarting failed container

I uploaded the whole description to S3: https://s3-us-west-2.amazonaws.com/public.apex.ai/aws-node-arm64-pod.txt

Didn't find anything wrong with the Arm node it self. Here's the description: https://s3-us-west-2.amazonaws.com/public.apex.ai/arm64-node.txt

I tried to start a ubuntu that runs on this Arm64(using NodeSelector), but it fail to start. Events show these:

Events:
  Type     Reason                  Age                From                                                  Message
  ----     ------                  ----               ----                                                  -------
  Normal   Scheduled               19m                default-scheduler                                     Successfully assigned default/ubuntu-arm64-sample-7dc4c76c4f-fkl4r to ip-172-31-36-179.us-west-2.compute.internal
  Warning  FailedCreatePodSandBox  19m                kubelet, ip-172-31-36-179.us-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "70b6cef67df071b8a488bdbf49cda1ffae41e809c149e6c47108d61e784f15c3" network for pod "ubuntu-arm64-sample-7dc4c76c4f-fkl4r": NetworkPlugin cni failed to set up pod "ubuntu-arm64-sample-7dc4c76c4f-fkl4r_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "70b6cef67df071b8a488bdbf49cda1ffae41e809c149e6c47108d61e784f15c3" network for pod "ubuntu-arm64-sample-7dc4c76c4f-fkl4r": NetworkPlugin cni failed to teardown pod "ubuntu-arm64-sample-7dc4c76c4f-fkl4r_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
  Normal   SandboxChanged          4m (x71 over 19m)  kubelet, ip-172-31-36-179.us-west-2.compute.internal  Pod sandbox changed, it will be killed and re-created.

Full description can be found here: https://s3-us-west-2.amazonaws.com/public.apex.ai/aws-ubuntu.txt

left4taco on 26 Apr 2019

Hello,

My node is not becoming ready. It's looks like an issue with ecr.

I'm using eks 1.12.

NAME                                                          READY   STATUS              RESTARTS   AGE
aws-node-arm-cd7x6                                            0/1     ContainerCreating   0          24m

kubectl describe pod aws-node-arm-cd7x6

...
Warning FailedCreatePodSandBox 41s (x119 over 25m) kubelet, ip-xx-xx-xx-xx.us-east-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = Error response from daemon: pull access denied for 602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause-arm64, repository does not exist or may require 'docker login'
...

pablomfc on 12 Jun 2019

@pablomfc have you checked your IAM roles for the node? Node role could be missing the AmazonEC2ContainerRegistryPowerUser policy, or something similar, which allows access to ECR for pulling the image.

tabern on 12 Jun 2019

Thanks for the suggestion @tabern, actually I use "AmazonEC2ContainerRegistryReadOnly". This A1 instance node is running alongside other amd64 nodes, and share the same IAM Role.

pablomfc on 12 Jun 2019

I figured out the problem !!

As I'm using an existing cluster to run the A1 instance I forget to include the BootstrapArguments : --pause-container-account 940911992744 for bootstrap.sh.

After that I found another problem. The repository for 940911992744.dkr.ecr.us-east-2.amazonaws.com/eks/pause-arm64 is absent at Region us-east-2 (where my cluster is located)

Looking at https://github.com/awslabs/amazon-eks-ami/blob/16bd0311c069f4b70a10205211b41845e59259d7/files/bootstrap.sh#L198cound I realize that
I needed to update the file: /etc/systemd/system/kubelet.service.d/10-kubelet-args.conf
at node from:

[Service]
Environment='KUBELET_ARGS=--node-ip=xx.xx.xx.xx --pod-infra-container-image=940911992744.dkr.ecr.us-east-2.amazonaws.com/eks/pause-arm64:3.1'

[Service]
Environment='KUBELET_ARGS=--node-ip=xx.xx.xx.xx --pod-infra-container-image=940911992744.dkr.ecr.us-west-2.amazonaws.com/eks/pause-arm64:3.1'

systemctl daemon-reload
systemctl restart kubelet

pablomfc on 12 Jun 2019

For a quick fix I put this after the bootstrap.sh at user_data:

INTERNAL_IP=$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)

cat <<EOF > /etc/systemd/system/kubelet.service.d/10-kubelet-args.conf
[Service]
Environment='KUBELET_ARGS=--node-ip=$INTERNAL_IP --pod-infra-container-image=940911992744.dkr.ecr.us-west-2.amazonaws.com/eks/pause-arm64:3.1'
EOF

systemctl daemon-reload
systemctl restart kubelet

pablomfc on 12 Jun 2019

I had the same issue as @pablomfc with a cluster in us-east-1. Applying the changes described fixed the issue for the aws-node-arm pods.

However, kube-proxy does not work, and just goes into a crash cycle, seemingly due to it not being compiled for ARM:

$ kubectl logs -f kube-proxy-glxk4
standard_init_linux.go:190: exec user process caused "exec format error"

The container image is pulling from 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.12.6. I've also tried changing the account to 940911992744 but the image does not exist there.

Anyone have any success getting kube-proxy to work?

scraton on 20 Jun 2019

I was able to get kube-proxy to work on the cluster by executing the following:

kubectl patch pod kube-proxy-glxk4 -p '{"spec":{"containers":[{"image":"k8s.gcr.io/kube-proxy:v1.12.6","name":"kube-proxy"}]}}'

This uses the kube-proxy image from gcr.io rather than Amazon's. The one from gcr.io has multiarch support, unlike AWS's.

This command only patches the specific pod, but you should be able to apply it to the entire DaemonSet if desired. I'm just not sure what modifications AWS has made to the kube-proxy image, so it might not be entirely safe to do that.

scraton on 20 Jun 2019

@scraton both coredns and kube-proxy default to the amd64 versions in a default EKS cluster so they will both need to be patched. We build our container images from the upstream code-base without local modifications so you should not be missing anything. I'm in the process of publishing updated container images for ARM as well as doc updates for the beta that should resolve this issue. I'll post back here once those are live.

Note that because ECR does not yet support multi-architecture images I'll be posting them to repos with a -arm64 suffix, much the way we distribute the CNI image.

mcrute on 20 Jun 2019

@left4taco Hi, I encounter the same issue with yours. did you resolve this issue?

I checked ipamd.log on Arm64 node, and it tells "Failed to create client: error communicating with apiserver: Get https://10.100.0.1:443/version?timeout=32s".

the Amd64 node works fine (can connect to API server), but Arm64 node not. both nodes are in the same VPC (and the same security group) but not same subnets (Availability Zone).

Any idea? Thanks!

zhouziyang on 29 Jul 2019

@zhouziyang

I just gave it a try yesterday and still it didn’t work out.

This time I can see the node became ready but aws-node pod running on the aarch64 machine keeps crashing. The CNI plugin can not assign secondary IP to the machine.

left4taco on 29 Jul 2019

I make a DNAT rule for API server (DNAT to it's public IP), and aws-node-arm pods worked (but introduce network issues across nodes). I even make a cni 1.5.0 docker image, but still not work out. I think (as you said), the root cause is related to network connection between VPC and k8s cluster. still investigation~~
@tabern any idea to this issue? or maybe any incorrect config during EKS setup? Thanks!

zhouziyang on 29 Jul 2019

@zhouziyang I didn't get it. The aarch64 nodes has exactly the same VPC and security group as the x86_64 node. It has correct arch of CNI 1.3.3 How come that it cannot establish a connection to the control plane?

left4taco on 29 Jul 2019

@left4taco seems missing iptable rules on arm64 node. I restored iptable rules from x86_64 nodes. seems worked!

zhouziyang on 29 Jul 2019

@tabern May I ask what's the progress of adding official support of ARM64 machine into EKS? Thanks. Can't wait to see this feature.

left4taco on 25 Sep 2019

👍3

Hi @left4taco we're working on solving these issues and updating the preview so it doesn't take so much manual work to get going! Stay tuned.

tabern on 26 Sep 2019

Here's the issues I've identified from the conversation here, let me know if I'm missing any @left4taco @zhouziyang @scraton @pablomfc

[ ] Add repo for pause container in all regions
[ ] Patch Kube-proxy for ARM
[ ] Patch coreDNS for ARM
[ ] Add missing iptables rules to ARM node bootstrap

tabern on 26 Sep 2019

👍4

Any news on getting AMI for 1.13/1.14?

midN on 16 Oct 2019

Hello

My A1 instances are stuck in a cycle: Running -> Error -> CrashLoopBackoff

kubectl get pods -n kube-system

aws-node-arm-5rbt7 0/1 CrashLoopBackOff 147 13h

aws-node-arm-pwtv4 0/1 CrashLoopBackOff 146 13h

aws-node-arm-t2pvv 1/1 Running 147 13h

Is this a known issue? I followed the current instructions exactly

adebesin on 24 Oct 2019

Hi,

Great to see Arm support on AWS.

I've just followed the guide for using A1 EKS and have come across an issue when deploying the Redis example.

The pods get stuck on creation and kubectl events show this:

Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "b07c2b2d3ed1ecb8f554a8ec0da7081b570f055e7b8047b4a0da231cbda35dc9" network for pod "redis-slave-kxdps": NetworkPlugin cni failed to set up pod "redis-slave-kxdps_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "b07c2b2d3ed1ecb8f554a8ec0da7081b570f055e7b8047b4a0da231cbda35dc9" network for pod "redis-slave-kxdps": NetworkPlugin cni failed to teardown pod "redis-slave-kxdps_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"

Colleagues have had similar issues. Has this been tested recently and confrimed to still be working?

Thanks

lyndon160 on 30 Oct 2019

Hey everyone, a few updates here.

We just updated the ARM preview to include support for the AMI SSM parameters _and_ the new M6g Graviton instances! Check it out

We have a new issue template and labels specifically for the ARM preview. PLEASE use this template to create any additional issues for questions or bugs so we can track them and mark them resolved. We'll continue to keep this issue open to track our GA deliverable.

-Nate

tabern on 10 Feb 2020

Can we use the bottlerocket OS also with ARM? https://aws.amazon.com/jp/about-aws/whats-new/2020/03/announcing-bottlerocket-a-new-open-source-linux-based-operating-system-optimized-to-run-containers/

rverma-jm on 11 Mar 2020

@rverma-jm Not yet; we're tracking that work in https://github.com/bottlerocket-os/bottlerocket/issues/468.

samuelkarp on 12 Mar 2020

👍1

Is 1.15 supported right now? I see official AMIs available:

docs still mention 1.15 is not supported https://docs.aws.amazon.com/eks/latest/userguide/arm-support.html

I'm going to try anyways :D

max-lobur on 17 Apr 2020

Its actually worked. I used 1.15 yamls where available (only kube-proxy now), and 1.14 for everything else
https://github.com/aws/containers-roadmap/tree/master/preview-programs/eks-arm-preview

max-lobur on 18 Apr 2020

M6g EC2 instances powered by Graviton2 processors went GA today - it would be useful to publish EKS-optimized AMIs for Arm alongside the AMIs for x86-64. These offer significant cost-optimization opportunities for AWS customers.

otterley on 12 May 2020

@otterley today the ARM preview supports M6 instances! Check it out

We are also working to make this support generally available to all customers and will update this ticket when we launch.

tabern on 12 May 2020

@tabern Thanks for the pointer. Instructions for EKS 1.15/1.16 appear to be missing - is this because the components aren't available (I know the AMI is available), or do the details just need to be updated?

otterley on 12 May 2020

Are mixed x86_64 and arm64 clusters supported?

tkinz27 on 12 May 2020

👍6

As ECR now supports manifest lists (#505), it would be awesome to push the EKS Helper Images as multi-arch images.
Then mixed clusters would be super easy.

schjan on 13 May 2020

👍4

Thanks for bringing support for ARM in EKS. This works well. I am now trying to add Container Insights to monitor an ARM cluster. I followed the instructions in https://docs.aws.amazon.com/eks/latest/userguide/arm-support.html to create a new cluster, and then followed https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-quickstart.html to try and set up Container Insights.

Unfortunately, the CloudWatch Agent and the FluentD agent pods don't seem to be starting up - they get into a CrashLoopBackoff. The events in the pod seem to indicate that the image being used is amazon/cloudwatch-agent:1.231221.0. At least on Docker Hub, this seems to be compiled for x64 and not ARM. Any chance there's an image for this and Fluentd for ARM architecture?

Thanks!

srinivasrb on 19 May 2020

@mikestef9 Is there a ticket tracking mixed support of x86_64 and arm64?

ckdarby on 22 Jun 2020

@ckdarby this is that ticket – as part of GA support for ARM, we are leveraging recently launched ECR multi-arch feature to allow for heterogenous clusters.

mikestef9 on 22 Jun 2020

@mikestef9 I should have been more specific mixed managed node groups.

ckdarby on 22 Jun 2020

@ckdarby We will be launching Arm support for managed node groups as part of the GA launch. Node groups will still a single instance type though, so you can have heterogenous clusters with multiple node groups. Are you asking for a single managed node group with multiple instance types?

mikestef9 on 22 Jun 2020

@mikestef9 Thanks for the update, excited to see GA come :)

Are you asking for a single managed node group with multiple instance types?

Nope, just multiple managed groups with some ARM and some not.

ckdarby on 22 Jun 2020

👍3

So while we are waiting for official support, at the time of writing it is possible to create a mixed mode EKS cluster, where some nodegroups run amd64 architecture and some with arm64.

It's pretty straightforward, to do by broadly following AWS guide, with the exception of:

(obviously) ignore the first and last considerations
download and save CoreDNS, kube-proxy and aws-node manifests in the Enable ARM support section first
modify the names as not to collide with existing amd64 deployments/daemonsets in kube-system
deploy the modified manifests

(e.g.) kube-proxy-arm64 manifest

kind: DaemonSet
apiVersion: apps/v1
metadata:
  labels:
    k8s-app: kube-proxy-arm64
    eks.amazonaws.com/component: kube-proxy
  name: kube-proxy-arm64
  namespace: kube-system
spec:
  selector:
    matchLabels:
      k8s-app: kube-proxy-arm64
...

Making these changes will ensure when the Graviton arm64 nodes join the EKS cluster, they will have the correct architecture containers deployed to them and become ready.

ab77 on 10 Jul 2020

👍2 🚀1

Amazon EKS support for Arm-based instances is now generally available! See the launch blog and EKS documentation for more details.

Notable updates with general availability include:

Support for ARMv8.2 architecture (64 bit) and AWS Graviton2 based instances.
Multi-architecture images for the core add ons in EKS clusters (CoreDNS, kube-proxy, VPC CNI)
End-to-end multi-architecture support (see below for details).
Mixed x86 and Arm node groups within a cluster are now supported.
Managed node groups support.
eksctl support.

mikestef9 on 17 Aug 2020

👍3 🎉1

I upgraded to EKS 1.17 to leverage the support of ARM architecture ( r6g.large instances). The instance joins the cluster but when i run "kubectl get nodes", all instances have clear status except the ARM instance which comes with "Unknown" status, "Unknown" name.