Kubespray: Track the gaps when porting to ARM (arm7l)

Created on 22 Feb 2019 · 28Comments · Source: kubernetes-sigs/kubespray

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
FEATURE REQUEST

Environment:

Cloud provider or hardware configuration:
Hardware

OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):

Linux 4.14.78-150 armv7l
NAME="Ubuntu"
VERSION="18.04.1 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.1 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Version of Ansible (ansible --version):
ansible 2.7.2
python version = 2.7.10 (default, Oct 6 2017, 22:29:07) [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)]

Kubespray version (commit) (git rev-parse --short HEAD): Latest master branch

Anything else do we need to know:

At the moment it is impossible to install kubespray on pure arm7l hardware.
Most of the containers does not provide binaries/containers for non-amd64 arch.

The aim of this ticket is to make it possible use ARM hardware as pool of worker nodes alongside the amd64 master/nodes.
I've found a several gaps when trying to install kubespray on arm7l devices. So

Checksums aren't available for arm, only for amd64,arm64
[x] add checksums for the main components - hyperkube, kubeadm and cni_binary #4261
Add NodeSelector to some manifests
[ ] find all the places where only amd64 containers are available (tiller, dashboard, dnsautoscaler)
Overlay network support, provide per architecture daemonsets gated by NodeSelector
[ ] flannel: could be deployed to all the architectures (arm, arm64,amd64,etc)
[ ] calico: could be run on amd64,arm64
Etcd - etcd does not provide binaries for arm32. Until there will be one, ARM nodes can't act as a master nodes.
[ ] build etcd for arm7l and create a container

related issues: #4261 #4065

help wanted kinfeature lifecyclrotten

Source

lwolf

👍1

All 28 comments

Also default pause container is invalid (( plus has 'n' in it for arm7l and it breaks systemd unit ))

nmiculinic on 25 Feb 2019

pod_infra_image_repo: k8s.gcr.io/pause
pod_infra_image_tag: "3.1"

works for me

nmiculinic on 25 Feb 2019

what are the production usecase to run workload on arm7 ?
It's already to complex to have both amd64 and arm64, I would limit the options to only real production usecases.

ant31 on 25 Feb 2019

IoT on BBB devices in my usecase for example

nmiculinic on 25 Feb 2019

(( Beagle bone black ))

nmiculinic on 25 Feb 2019

what can of load are you running on those machines? why kubernetes?

ant31 on 26 Feb 2019

A quite simple load, reading from serial UARTs and sending to the message queue data.

Why kubernetes?

Because I'm familiar with it and it serves me health checks, deployments, upgrade process nicely. It nicely manages secrets and monitoring. With other solutions such as consul + ansible + docker I'd had to have some verification deployment successfully completed + gradual rollout. Maybe I could also use a spinnaker or something like that, though I'm not familiar with it, and I run k8s for the rest of the infrastructure.

K8s downside on edge devices in CPU usage...it's around 15% CPU time just on kubelet, even after tuning various housekeeping/node status freq parameters. Most of the time is spent during syscalls, runtime, and some JSON decoding.

EDIT: This is AM335x 1GHz ARM® Cortex-A8, and I see 30% branch misprediction rate system-wide ( also similar amount for kubelet )...so yeah, not the best processor in the world nor the most powerful.

nmiculinic on 26 Feb 2019

👍1

According to coreOS docs

etcd has known issues on 32-bit systems due to a bug in the Go runtime

Etcd-io doesn't provide any 32bits ARM binaries because of a Go language issue. Otherwise we could have downloaded tarballs from cores/etcd-io (https://github.com/coreos/etcd/releases/download/) and checksums .

b23prodtm on 1 Apr 2019

/kind feature
/help

Miouge1 on 5 Apr 2019

@Miouge1:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/kind feature
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 5 Apr 2019

I ran into this issue today. I have a esxi home lab with limited amount of RAM so I thought it would be nice to run masters / etcd on ARM and minions on vm's. etcd seems to be the only blocker.

rich-nahra on 5 May 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 3 Aug 2019

/remove-lifecycle stale

lwolf on 4 Aug 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 2 Nov 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 2 Dec 2019

@rich-nahra well etcd works on arm64, but not on arm32. I've seen lots of people running k3s on arm32 (raspberry pi) which uses sqlite instead of etcd, so I guess that's a possible work around for test labs.

@lwolf and @nmiculinic I would consider "etcd for arm32" out of scope of Kubespray, is there anything else we can do to make life easier for Kubespray on ARM 32bits?

Miouge1 on 12 Dec 2019

@Miouge1 I agree that etcd is probably out of scope. I recently migrated my arm32 cluster to k3s.

I need to check if it still relevant, but last time I check this one was still an issue - 2.Add NodeSelector to some manifests
It's about gating deployments to specific node types if container exists only for a specific arch. Like gate helm/tiller to only amd64.

lwolf on 13 Dec 2019

There's no calico release for arm (32bit) too, is there a work around for that too ? (I have a seperate 64bit etcd node to get around the lack of a 32bit etcd)

visago on 9 Jan 2020

The only CNI that works on arm32 at the moment is flannel.

lwolf on 9 Jan 2020

@Miouge1 I agree that etcd is probably out of scope. I recently migrated my arm32 cluster to k3s.

I need to check if it still relevant, but last time I check this one was still an issue - 2.Add NodeSelector to some manifests
It's about gating deployments to specific node types if container exists only for a specific arch. Like gate helm/tiller to only amd64.

Well, arm is supported, there are just no builds in the official repo and its flagged as unstable:

https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/supported-platform.md

For example Ubuntu has it in the official repo:

https://packages.ubuntu.com/search?keywords=etcd

But it seems to require an experimental flag set - its unstable due to a bug in Go, that's existing 9 years now. Funnily speaking just recently it picked up tracktion and there was a prototype implemented to fix this bug for 32 bit systems:

https://github.com/golang/go/issues/599

That said we're speaking about armhf only here so armv7+ with hardware floating point unit - no other architectures likely possible (???)

That includes the Raspberry Pi 3 Pre-B or the Odroid XU4-Platform (e.g. Odroid HC1)

Anyway it runs on Ubuntu Bionic Beaver 18.04 on the Odroid HC1 for me with the flag mentioned set in the systemd.service:

doh@node1:~$ etcd
2020-01-26 00:35:41.336453 W | etcdmain: running etcd on unsupported architecture "arm" since ETCD_UNSUPPORTED_ARCH is set
2020-01-26 00:35:41.338147 W | pkg/flags: unrecognized environment variable ETCD_UNSUPPORTED_ARCH=arm
2020-01-26 00:35:41.338300 I | etcdmain: etcd Version: 3.2.17
2020-01-26 00:35:41.338387 I | etcdmain: Git SHA: Not provided (use ./build instead of go build)
2020-01-26 00:35:41.338470 I | etcdmain: Go Version: go1.10
2020-01-26 00:35:41.338552 I | etcdmain: Go OS/Arch: linux/arm
2020-01-26 00:35:41.338636 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8
2020-01-26 00:35:41.338740 W | etcdmain: no data-dir provided, using default data-dir ./default.etcd
2020-01-26 00:35:41.341755 C | etcdmain: listen tcp 127.0.0.1:2380: bind: address already in use
doh@node1:~$

thiscantbeserious on 25 Jan 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 25 Feb 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot on 25 Feb 2020

/reopen

hadrien-toma on 25 Feb 2020

@hadrien-toma: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

k8s-ci-robot on 25 Feb 2020

👎1

/reopen

lwolf on 25 Feb 2020

👍1

@lwolf: Reopened this issue.

In response to this:

/reopen

k8s-ci-robot on 25 Feb 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 26 Mar 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close