Kubespray: Gaps when porting to AArch64

Created on 29 Mar 2018  路  18Comments  路  Source: kubernetes-sigs/kubespray

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
FEATURE REQUEST

Environment:

  • Cloud provider or hardware configuration:
    Hardware
  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
Linux 4.12.0-221-arm64 aarch64
NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.3 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
  • Version of Ansible (ansible --version):
ansible 2.3.1.0
  config file = /xxx/ansible.cfg
  configured module search path = [u'./library']
  python version = 2.7.5 (default, Aug 25 2017, 09:08:42) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]

Kubespray version (commit) (git rev-parse --short HEAD): Latest master branch

Anything else do we need to know:

I've found several gaps when porting kubespray to Aarch64.

@xd007 has submitted several PRs to fix all the below addressing issues.

1. Image Arch:

  • [x] All the gcr.io images used in kubespray are explicitly specified with --amd64 as postfix, which can only run on x86. #2104

2. Docker Imcompatible

  • [x] Missing corresponding package docker-engine on aarch64, use docker instead. #2103
  • [x] On Aarch64, the default cgroup driver for docker is systemd instead of cgroupfs. Should conform kubelet to use systemd as cgroup driver as well to keep it consistent with docker. #2168

3. Etcd start-up

  • [ ] Arm64 is still in experimental stage. Extra environment variable ETCD_UNSUPPORTED_ARCH has to be set. #2118

/cc @mattymo @rsmitty

lifecyclrotten

Most helpful comment

@kskewes Calico version bump has already been done to v3.4, and I've opened #4176 to handle the arm64 checksums.

As far as I tested, this is enough to get Kubespray with Calico up on arm64. Other CNI plugins will need more work, but Calico is a good start since it's the Kubespray default.

All 18 comments

thank you @xd007, @dixudx ! That's awesome !

Few things,
I would do set_fact about the architecture for each node. So instead of being manual, the detection would be automatic per node (fewer errors in configuration).
I think the nodes are automatically labeled with the arch, but if not we probably should do it (hybrid clusters): built-in-node-labels

For the etcd PR. it uses a different way to switch between image-tags. I would like to have it consistent with other images.

Let me know if you need help or more info

I think the nodes are automatically labeled with the arch, but if not we probably should do it (hybrid clusters): built-in-node-labels

@ant31 Those labels are automatically labeled by kubelet.

I would like to have it consistent with other images.

Adding such an environment variable ETCD_UNSUPPORTED_ARCH for other images, like amd64 or ppc64le, does no harm. But it seems misleading. Do you want to do like this?

For etcd, it will check the architecture for non amd64/ppc64le platforms.

    // TODO qualify arm64
    if runtime.GOARCH == "amd64" || runtime.GOARCH == "ppc64le" {
        return
    }
    // unsupported arch only configured via environment variable
    // so unset here to not parse through flag
    defer os.Unsetenv("ETCD_UNSUPPORTED_ARCH")
    if env, ok := os.LookupEnv("ETCD_UNSUPPORTED_ARCH"); ok && env == runtime.GOARCH {
        fmt.Printf("running etcd on unsupported architecture %q since ETCD_UNSUPPORTED_ARCH is set\n", env)
        return
    }

I'm not saying to drop the way your doing adding the envvar ETCD_UNSUPPORTED_ARCH, just to be consistent in how tags are chose with other images.

doing adding the envvar ETCD_UNSUPPORTED_ARCH, just to be consistent in how tags are chose with other images.

@ant31 Yeah, I know. Right. That would seems more native and elegant.

But currently I can't find a better way to handle this case.

@dixudx I've proposed a solution in #3140 , could you review please ?
the two commits to check are: https://github.com/kubernetes-incubator/kubespray/pull/3140/commits/6de0076f817d312d5305fbc18656326d3f3897a0 and 5c47d8a6cba10e90a8b57d37d20438248e641ece

Looks like #3140 was merged, @ant31 are there other open tasks or does this now work?

I'm just going through this now with fresh hosts and looks like a few more changes required for cluster.yml to succeed.

  1. {{ image_arch }} added to download role defaults - PR #3975
  2. Deleting ubuntu-bionic.yml in https://github.com/kubernetes-sigs/kubespray/issues/3972 - PR https://github.com/kubernetes-sigs/kubespray/pull/3974
  3. Image SHA's added for arm64 binaries extending this pattern perhaps ? - Open to suggestions, happy to PR. I've just put in local overrides .yml for now.
  4. Bumping calico version to 3.2.x where arm64 was enabled or even better to 3.3.x where the Calico Quay repositories seem to offer v3.3.x-{amd64|arm64} naming convention to suit #3140 from above. EDIT: This could also be a TODO comment for arm users to be aware of.
    EDITv2: It seems part of #3140 was reverted due to Calico breaking changes in 3.2 and 3.3 releases. So right now I'm trying weave (used before without kubespray). Will come back to Calico.

@vielmetti and @Miouge1 are working on https://github.com/WorksOnArm/cluster/issues/127 relevant to this.

@kskewes Calico version bump has already been done to v3.4, and I've opened #4176 to handle the arm64 checksums.

As far as I tested, this is enough to get Kubespray with Calico up on arm64. Other CNI plugins will need more work, but Calico is a good start since it's the Kubespray default.

Nice work!
I'm still running weave but can rebase and try calico.

Good one on getting all those SHA's in there.
Changed to Calico 3.4 after a cluster reset.
All running great, Metallb (deployed separately) advertising LoadBalancer routes via BGP.

I did further tests around arm64 support:

  • ran into a problem with a mixed x86 and arm64 cluster, see PR #4253
  • AFAIK flannel has no docker image for arm64
  • Tests of Weave on arm64 look promising

Here is a related Flannel issue for arm64: https://github.com/coreos/flannel/issues/663

and a request to address the images for Flannel: https://github.com/coreos/flannel-cni/issues/10 with this PR https://github.com/coreos/flannel-cni/pull/13

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

It appears that we're still stuck on this with Flannel.

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings