kind 🚀 - ARM64 CI | bleepingcoder.com

@lubinsz You might help on this?

dixudx on 11 Jan 2019

note that we will need to fix #166 first, however that is very doable. Dims previously made a quick patch that worked, but we haven't PRed anything yet.

BenTheElder on 11 Jan 2019

@BenTheElder @dixudx
I see.
At least, it contains a multi-arch image issue.
Let me apply an internal legal request for this project firstly ...

lubinsz on 11 Jan 2019

@lubinsz see my previous patch in https://github.com/kubernetes-sigs/kind/issues/166#issuecomment-448766016

dims on 11 Jan 2019

https://github.com/WorksOnArm/cluster/issues/154 gives us access to Packet hardware.
How would we like it configured?

Running k8s?
This would allow us to specify this cluster to run kind jobs and fit into sig-testings usual approach to testing.
running docker only
We would need to ssh in, run kind + test, + cleanup.

/cc @devaii

hh on 18 Feb 2019

I think running docker / SSH only is the most well understood path currently, we can treat these similar to a node or cadvisor e2e job, and put credentials in Prow to access them.

Long term it might be interesting to be able to run prowjobs on these machines directly, but that will require more work to maintain the cluster and it will take figuring out how we want to handle distributing other credentials.

BenTheElder on 18 Feb 2019

When trying to kind build, we note that docker-ce and friends are not available directly from the same repos:

E: Version '18.06.*' for 'docker-ce' was not found
The command '/bin/sh -c curl -fsSL "https://download.docker.com/linux/$(. /etc/os-release; echo "$ID")/gpg" | apt-key add -     && apt-key fingerprint 0EBFCD88     && ARCH="${ARCH}" add-apt-repository         "deb [arch=${ARCH}] https://download.docker.com/linux/$(. /etc/os-release; echo "$ID") $(lsb_release -cs) stable"     && clean-install "docker-ce=${DOCKER_VERSION}"' returned a non-zero code: 100
ERRO[22:13:41] Docker build Failed! exit status 100         
Error: build failed: exit status 100

Probably need some debugging:

root@kind:~# kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.13.3) 🖼 
ERRO[22:14:59] machine-id-setup error: exit status 1        
 ✗ [control-plane] Creating node container 📦 
Error: failed to create cluster: machine-id-setup error: exit status 1

Version check:

root@kind:~# docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.1
 Git commit:        e68fc7a
 Built:             Fri Jan 25 14:35:17 2019
 OS/Arch:           linux/arm64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.1
  Git commit:       e68fc7a
  Built:            Thu Jan 24 10:49:48 2019
  OS/Arch:          linux/arm64
  Experimental:     false

hh on 18 Feb 2019

It looks like the issue is due to the ARCH variable in the base-image being hard-coded to AMD64.

See: https://github.com/kubernetes-sigs/kind/blob/master/images/base/Dockerfile#L29
It looks like the ARCH variable is used later for the CNI plugin tarball, as well.

I am working on a patch.

devaips on 19 Feb 2019

👍1

Yeah, there's a bunch of places marked TODO for handling this because I wasn't sure where / how to plumb it through, I think using runtime.GOARCH should be fine, Dims's previous patch is here: https://github.com/kubernetes-sigs/kind/issues/188#issuecomment-453514226

BenTheElder on 19 Feb 2019

@BenTheElder unfortunately the paste with patch expired

dims on 19 Feb 2019

😕1

we still need CI, https://github.com/kubernetes-sigs/kind/pull/358 works well!

dims on 12 Mar 2019

❤1

Thanks to @ZhengZhenyu and other awesome folks at OpenLab (https://github.com/theopenlab) We now have a functional KinD on ARM CI !!!

Please see:
http://status.openlabtesting.org/builds?job_name=kind-integration-test-arm64

dims on 12 Apr 2019

🎉3

Now that we have the jobs running successfully for a few days we would like to know how the kind community would like to do the testgrid reporting. I believe there are two options:

push to openlab owned gcs bucket
push to kind owned gcs bucket

I believe the second option is possible but would require setting up a user/auth acct for openlab and of course the other would be for openlab to resolve; using the existing bucket we use for cloud-provider-openstack or setup a new one

I could be wrong but we are ready to get the reporting to the proper place so the community can work as expected on any issues surfaced.

mrhillsman on 16 Apr 2019

cc @BenTheElder - please see the question from Melvin ^^

dims on 16 Apr 2019

👍1

either works! see also https://github.com/kubernetes/test-infra/tree/master/testgrid/conformance, we can setup a GCS bucket if we don't want to use any existing ones.

BenTheElder on 30 Apr 2019

[sorry for the huge delay, this slipped through my inbox :(]

BenTheElder on 30 Apr 2019

No problem @BenTheElder totally understand. Will create a new one just to keep things separated since that is possible

mrhillsman on 30 Apr 2019

Re-Reading through this..
I created gs://k8s-conformance-kind-arm64-openlab if we need it, @mrhillsman @dims shoot me an email if we need that and I'll coordinate the service account credentials there :sweat_smile:

BenTheElder on 4 May 2019

ack @BenTheElder
/cc @dims

mrhillsman on 4 May 2019

Hi @mrhillsman , @dims and @BenTheElder I add an issue in OpenLab side to trace this job https://github.com/theopenlab/openlab/issues/257

kiwik on 13 May 2019

Hi, @BenTheElder we have now done the adding GCP account part, but our job failed to setup k8s cluster with master KinD and master K8S, it was previously alright but seems failed to build for few weeks, I tried to debug it and solved some problems, but it still won't come up, could you help to have a look on it? I uploaded full logs in our openlab log server:
https://logs.openlabtesting.org/logs/periodic-6/18/github.com/kubernetes-sigs/kind/master/kind-integration-test-arm64/728b70b/
and here is the script I used for the job:
https://github.com/theopenlab/openlab-zuul-jobs/blob/master/playbooks/kind-integration-test-arm64/run.yaml

Thanks alot

ZhengZhenyu on 6 Jun 2019

seems a problem with the internal images and containerd 🤔

Jun 06 06:47:03 kind-kubetest-control-plane kubelet[440]: E0606 06:47:03.335923     440 remote_image.go:113] PullImage "k8s.gcr.io/kube-scheduler:v1.16.0-alpha.0.868_ef7808fec57284" from image service failed: rpc error: code = Unknown desc = failed to resolve image "k8s.gcr.io/kube-scheduler:v1.16.0-alpha.0.868_ef7808fec57284": no available registry endpoint: k8s.gcr.io/kube-scheduler:v1.16.0-alpha.0.868_ef7808fec57284 not found
Jun 06 06:47:03 kind-kubetest-control-plane kubelet[440]: E0606 06:47:03.336147     440 kuberuntime_image.go:51] Pull image "k8s.gcr.io/kube-scheduler:v1.16.0-alpha.0.868_ef7808fec57284" failed: rpc error: code = Unknown desc = failed to resolve image "k8s.gcr.io/kube-scheduler:v1.16.0-alpha.0.868_ef7808fec57284": no available registry endpoint: k8s.gcr.io/kube-scheduler:v1.16.0-alpha.0.868_ef7808fec57284 not found
Jun 06 06:47:03 kind-kubetest-control-plane kubelet[440]: E0606 06:47:03.336603     440 kuberuntime_manager.go:775] container start failed: ErrImagePull: rpc error: code = Unknown desc = failed to resolve image "k8s.gcr.io/kube-scheduler:v1.16.0-alpha.0.868_ef7808fec57284": no available registry endpoint: k8s.gcr.io/kube-scheduler:v1.16.0-alpha.0.868_ef7808fec57284 not found
Jun 06 06:47:03 kind-kubetest-control-plane kubelet[440]: E0606 06:47:03.336788     440 pod_workers.go:190] Error syncing pod f09c4fe814efa82a9c97906695a40f30 ("kube-scheduler-kind-kubetest-control-plane_kube-system(f09c4fe814efa82a9c97906695a40f30)"), skipping: failed to "StartContainer" for "kube-scheduler" with ErrImagePull: "rpc error: code = Unknown desc = failed to resolve image \"k8s.gcr.io/kube-scheduler:v1.16.0-alpha.0.868_ef7808fec57284\": no available registry endpoint: k8s.gcr.io/kube-scheduler:v1.16.0-alpha.0.868_ef7808fec57284 not found"
Jun 06 06:47:03 kind-kubetest-control-plane kubelet[440]: E0606 06:47:03.408287     440 kubelet.go:2248] node "kind-kubetest-control-plane" not found
Jun 06 06:47:03 kind-kubetest-control-plane kubelet[440]: E0606 06:47:03.429569     440 reflector.go:125] pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://172.17.0.5:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dkind-kubetest-control-plane&limit=500&resourceVersion=0: dial tcp 172.17.0.5:6443: connect: connection refused
Jun 06 06:47:03 kind-kubetest-control-plane kubelet[440]: E0606 06:47:03.508929     440 kubelet.go:2248] node "kind-kubetest-control-plane" not found
Jun 06 06:47:03 kind-kubetest-control-plane kubelet[440]: E0606 06:47:03.609433     440 kubelet.go:2248] node "kind-kubetest-control-plane" not found
Jun 06 06:47:03 kind-kubetest-control-plane kubelet[440]: E0606 06:47:03.629499     440 reflector.go:125] pkg/kubelet/kubelet.go:444: Failed to list *v1.Service: Get https://172.17.0.5:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.17.0.5:6443: connect: connection refused
Jun 06 06:47:03 kind-kubetest-control-plane kubelet[440]: E0606 06:47:03.711873     440 kubelet.go:2248] node "kind-kubetest-control-plane" not found
Jun 06 06:47:03 kind-kubetest-control-plane kubelet[440]: E0606 06:47:03.812356     440 kubelet.go:2248] node "kind-kubetest-control-plane" not found
Jun 06 06:47:03 kind-kubetest-control-plane kubelet[440]: E0606 06:47:03.828919     440 reflector.go:125] pkg/kubelet/kubelet.go:453: Failed to list *v1.Node: Get https://172.17.0.5:6443/api/v1/nodes?fieldSelector=metadata.name%3Dkind-kubetest-control-plane&limit=500&resourceVersion=0: dial tcp 172.17.0.5:6443: connect: connection refused
Jun 06 06:47:03 kind-kubetest-control-plane kubelet[440]: E0606 06:47:03.887205     440 remote_runtime.go:200] CreateContainer in sandbox "a4e678b343c741029425c2141082e3d71ec40606af2c8f8c2d191c3aa1bdaaaa" from runtime service failed: rpc error: code = Unknown desc = failed to create containerd container: error unpacking image: failed to resolve rootfs: content digest sha256:a6e2bdd03a1214512221080f1f8f4aaf183a8b98d4a391d222aa50f34fbc20e3: not found
Jun 06 06:47:03 kind-kubetest-control-plane kubelet[440]: E0606 06:47:03.887459     440 kuberuntime_manager.go:775] container start failed: CreateContainerError: failed to create containerd container: error unpacking image: failed to resolve rootfs: content digest sha256:a6e2bdd03a1214512221080f1f8f4aaf183a8b98d4a391d222aa50f34fbc20e3: not found

https://logs.openlabtesting.org/logs/periodic-6/18/github.com/kubernetes-sigs/kind/master/kind-integration-test-arm64/728b70b/logs/kubernetes/kind-kubetest-control-plane/containerd.log

I think that node build is not creating some of the images, in my environment

kubeadm config images list
k8s.gcr.io/kube-apiserver:v1.14.2
k8s.gcr.io/kube-controller-manager:v1.14.2
k8s.gcr.io/kube-scheduler:v1.14.2
k8s.gcr.io/kube-proxy:v1.14.2
k8s.gcr.io/pause:3.1
k8s.gcr.io/etcd:3.3.10
k8s.gcr.io/coredns:1.3.1

however I can't see in the logs the kube-* images, maybe I'm missing something?

time="06:32:55" level=debug msg="Running: /usr/bin/docker [docker exec --privileged kind-build-811a5dd6-2d89-48b2-80d3-012d80828821 kubeadm config images list --kubernetes-version v1.16.0-alpha.0.868+ef7808fec57284]"
Pulling: k8s.gcr.io/pause:3.1
time="06:32:55" level=info msg="Pulling image: k8s.gcr.io/pause:3.1 ..."
time="06:32:55" level=debug msg="Running: /usr/bin/docker [docker pull k8s.gcr.io/pause:3.1]"
time="06:33:27" level=debug msg="Running: /usr/bin/docker [docker save -o /tmp/kind-node-image717940973/bits/images/4.tar k8s.gcr.io/pause:3.1]"
Pulling: k8s.gcr.io/etcd:3.3.10
time="06:33:27" level=info msg="Pulling image: k8s.gcr.io/etcd:3.3.10 ..."
time="06:33:27" level=debug msg="Running: /usr/bin/docker [docker pull k8s.gcr.io/etcd:3.3.10]"
time="06:33:49" level=debug msg="Running: /usr/bin/docker [docker save -o /tmp/kind-node-image717940973/bits/images/5.tar k8s.gcr.io/etcd:3.3.10]"
Pulling: k8s.gcr.io/coredns:1.3.1
time="06:34:24" level=info msg="Pulling image: k8s.gcr.io/coredns:1.3.1 ..."
time="06:34:24" level=debug msg="Running: /usr/bin/docker [docker pull k8s.gcr.io/coredns:1.3.1]"
time="06:34:28" level=debug msg="Running: /usr/bin/docker [docker save -o /tmp/kind-node-image717940973/bits/images/6.tar k8s.gcr.io/coredns:1.3.1]"
Pulling: kindest/kindnetd:0.1.0
time="06:34:47" level=info msg="Pulling image: kindest/kindnetd:0.1.0 ..."
time="06:34:47" level=debug msg="Running: /usr/bin/docker [docker pull kindest/kindnetd:0.1.0]"
time="06:34:52" level=debug msg="Running: /usr/bin/docker [docker save -o /tmp/kind-node-image717940973/bits/images/7.tar kindest/kindnetd:0.1.0]"
Pulling: k8s.gcr.io/ip-masq-agent:v2.4.1
time="06:34:53" level=info msg="Pulling image: k8s.gcr.io/ip-masq-agent:v2.4.1 ..."
time="06:34:53" level=debug msg="Running: /usr/bin/docker [docker pull k8s.gcr.io/ip-masq-agent:v2.4.1]"
time="06:35:01" level=debug msg="Running: /usr/bin/docker [docker save -o /tmp/kind-node-image717940973/bits/images/8.tar k8s.gcr.io/ip-masq-agent:v2.4.1]"
time="06:35:14" level=debug msg="Running: /usr/bin/docker [docker exec --privileged kind-build-811a5dd6-2d89-48b2-80d3-012d80828821 mkdir -p /kind/images]"
time="06:35:14" level=debug msg="Running: /usr/bin/docker [docker exec --privileged kind-build-811a5dd6-2d89-48b2-80d3-012d80828821 mv /build/bits/images/4.tar /build/bits/images/5.tar /build/bits/images/6.tar /build/bits/images/7.tar /build/bits/images/8.tar /kind/images]"

aojea on 6 Jun 2019

@ZhengZhenyu seems that the last job was successful https://logs.openlabtesting.org/builds?project=kubernetes-sigs/kind, am I looking into the right place?

aojea on 6 Jun 2019

@aojea Thanks alot for the reply, I guess there could be some misconfiguration for the job workflow so that it shows the process is sucess but actually no test is running, I will check it latter. But yes, the K8S cluster did not come up for a few weeks, I've tried alot of combinations of kind and k8s, they didn't work, we used to have a successfully runing test workflow for about a month but we have recently changed our OpenLab deployment so I cannot provide the previous running logs.

ZhengZhenyu on 6 Jun 2019

@aojea As you can see from the test scripts, I copied all files from _artifacts folder, I guess the kube-* log should be in their if they were generated?

ZhengZhenyu on 6 Jun 2019

And see from:
https://logs.openlabtesting.org/logs/periodic-6/18/github.com/kubernetes-sigs/kind/master/kind-integration-test-arm64/728b70b/job-output.txt.gz

2019-06-06 06:36:33.026854 | ubuntu-xenial-arm64 | time="2019-06-06T06:36:33.024811073Z" level=info msg="ImageCreate event &ImageCreate{Name:sha256:9730ba13d51885429e8a0544daef7ebb527fbaa8387b8b6f8458029f5ebac9de,Labels:map[string]string{io.cri-containerd.image: managed,},}"
2019-06-06 06:36:33.028068 | ubuntu-xenial-arm64 | time="2019-06-06T06:36:33.026746014Z" level=info msg="ImageUpdate event &ImageUpdate{Name:docker.io/kindest/kindnetd:0.1.0,Labels:map[string]string{io.cri-containerd.image: managed,},}"
2019-06-06 06:36:35.791139 | ubuntu-xenial-arm64 | time="2019-06-06T06:36:35.789568652Z" level=info msg="ImageCreate event &ImageCreate{Name:k8s.gcr.io/coredns:1.3.1,Labels:map[string]string{},}"
2019-06-06 06:36:38.405538 | ubuntu-xenial-arm64 | time="2019-06-06T06:36:38.403412019Z" level=info msg="ImageCreate event &ImageCreate{Name:sha256:7e8edeee9a1e73cdd4a1209eaa12aee15933456c7b6c0eb7d6758c8e1a078d0a,Labels:map[string]string{io.cri-containerd.image: managed,},}"
2019-06-06 06:36:38.408607 | ubuntu-xenial-arm64 | time="2019-06-06T06:36:38.407053352Z" level=info msg="ImageCreate event &ImageCreate{Name:k8s.gcr.io/ip-masq-agent:v2.4.1,Labels:map[string]string{},}"
2019-06-06 06:36:39.297464 | ubuntu-xenial-arm64 | time="2019-06-06T06:36:39.295437592Z" level=error msg="(*service).Write failed" error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-repositories locked: unavailable" ref=tar-repositories total=137
2019-06-06 06:36:39.436158 | ubuntu-xenial-arm64 | time="2019-06-06T06:36:39.434494073Z" level=info msg="ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/coredns:1.3.1,Labels:map[string]string{io.cri-containerd.image: managed,},}"
2019-06-06 06:36:39.438426 | ubuntu-xenial-arm64 | time="2019-06-06T06:36:39.436674882Z" level=info msg="ImageCreate event &ImageCreate{Name:sha256:62e7d8e75a3fe2e9097d3c9fde8a5d22593b60db56e2e7584a5000c00f1815a7,Labels:map[string]string{io.cri-containerd.image: managed,},}"
2019-06-06 06:36:39.440870 | ubuntu-xenial-arm64 | time="2019-06-06T06:36:39.439220695Z" level=info msg="ImageCreate event &ImageCreate{Name:k8s.gcr.io/kube-scheduler:v1.16.0-alpha.0.868_ef7808fec57284,Labels:map[string]string{},}"
2019-06-06 06:36:39.910683 | ubuntu-xenial-arm64 | time="2019-06-06T06:36:39.909068338Z" level=info msg="ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/ip-masq-agent:v2.4.1,Labels:map[string]string{io.cri-containerd.image: managed,},}"
2019-06-06 06:36:39.912988 | ubuntu-xenial-arm64 | time="2019-06-06T06:36:39.911339351Z" level=info msg="ImageCreate event &ImageCreate{Name:k8s.gcr.io/kube-controller-manager:v1.16.0-alpha.0.868_ef7808fec57284,Labels:map[string]string{},}"
2019-06-06 06:36:39.989654 | ubuntu-xenial-arm64 | time="2019-06-06T06:36:39.987856531Z" level=info msg="ImageCreate event &ImageCreate{Name:sha256:e13edde249eb7dea3d2718a3d5b1209580a11351c9e3bbeacda4dcb8210b52a9,Labels:map[string]string{io.cri-containerd.image: managed,},}"
2019-06-06 06:36:39.991741 | ubuntu-xenial-arm64 | time="2019-06-06T06:36:39.989892306Z" level=info msg="ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/kube-scheduler:v1.16.0-alpha.0.868_ef7808fec57284,Labels:map[string]string{io.cri-containerd.image: managed,},}"
2019-06-06 06:36:39.993624 | ubuntu-xenial-arm64 | time="2019-06-06T06:36:39.991824576Z" level=info msg="ImageCreate event &ImageCreate{Name:sha256:a6e2bdd03a1214512221080f1f8f4aaf183a8b98d4a391d222aa50f34fbc20e3,Labels:map[string]string{io.cri-containerd.image: managed,},}"
2019-06-06 06:36:39.995324 | ubuntu-xenial-arm64 | time="2019-06-06T06:36:39.993830849Z" level=info msg="ImageCreate event &ImageCreate{Name:k8s.gcr.io/kube-proxy:v1.16.0-alpha.0.868_ef7808fec57284,Labels:map[string]string{},}"
2019-06-06 06:36:40.033304 | ubuntu-xenial-arm64 | time="2019-06-06T06:36:40.032145947Z" level=info msg="Stop CRI service"
2019-06-06 06:36:40.033913 | ubuntu-xenial-arm64 | time="2019-06-06T06:36:40.032630064Z" level=info msg="Stop CRI service"

Seems some of the kube-* service is missing, and seems kube-proxy and kube-scheduler didn't show managed in the log.

ZhengZhenyu on 6 Jun 2019

@ZhengZhenyu how can I check if this works? https://github.com/theopenlab/openlab-zuul-jobs/pull/550
I think that the best thing moving forward is mimic the prow jobs

aojea on 6 Jun 2019

FWIW the snippet in https://github.com/kubernetes-sigs/kind/issues/188#issuecomment-499420602 looks like build output, which will have some scary looking errors that can be ignore currently, if kind build node-image exits 0, it should be fine.

BenTheElder on 6 Jun 2019

@BenTheElder yeah it seems the build operation is OK, but if we only test build, how are we suppose to add results to testgrid?

ZhengZhenyu on 10 Jun 2019

https://github.com/kubernetes/test-infra/pull/13273/ has been merged so the CI itself is done

ZhengZhenyu on 4 Jul 2019

@ZhengZhenyu the job is failing to compile, can you check with master https://github.com/theopenlab/openlab-zuul-jobs/blob/c14fd9c68bc529dcbed15930b6e30fffbc952b49/playbooks/kind-integration-test-arm64/run.yaml#L46?

aojea on 3 Aug 2019

@aojea Sorry for the delay, checking

ZhengZhenyu on 6 Aug 2019

@aojea the error seems to be:

INFO: Call stack for the definition of repository 'containerregistry' which is a http_archive (rule definition at /root/.cache/bazel/_bazel_root/e96b51b7d54f19bc30c74f61e708a712/external/bazel_tools/tools/build_defs/repo/http.bzl:237:16):
2019-08-05 18:15:47.172462 | ubuntu-xenial-arm64 | - /root/.cache/bazel/_bazel_root/e96b51b7d54f19bc30c74f61e708a712/external/io_bazel_rules_docker/repositories/repositories.bzl:78:9
2019-08-05 18:15:47.172814 | ubuntu-xenial-arm64 | - /home/zuul/src/k8s.io/kubernetes/WORKSPACE:48:1
2019-08-05 18:15:47.195743 | ubuntu-xenial-arm64 | ERROR: An error occurred during the fetch of repository 'containerregistry':
2019-08-05 18:15:47.197022 | ubuntu-xenial-arm64 | java.io.IOException: Error downloading [https://github.com/google/containerregistry/archive/v0.0.34.tar.gz] to /root/.cache/bazel/_bazel_root/e96b51b7d54f19bc30c74f61e708a712/external/containerregistry/v0.0.34.tar.gz: connect timed out
2019-08-05 18:15:47.251361 | ubuntu-xenial-arm64 | ERROR: /home/zuul/src/k8s.io/kubernetes/build/BUILD:63:2: //build:kube-scheduler-internal depends on @containerregistry//:digester in repository @containerregistry which failed to fetch. no such package '@containerregistry//': java.io.IOException: Error downloading [https://github.com/google/containerregistry/archive/v0.0.34.tar.gz] to /root/.cache/bazel/_bazel_root/e96b51b7d54f19bc30c74f61e708a712/external/containerregistry/v0.0.34.tar.gz: connect timed out
2019-08-05 18:15:47.330461 | ubuntu-xenial-arm64 | ERROR: Analysis of target '//build:docker-artifacts' failed; build aborted: no such package '@containerregistry//': java.io.IOException: Error downloading [https://github.com/google/containerregistry/archive/v0.0.34.tar.gz] to /root/.cache/bazel/_bazel_root/e96b51b7d54f19bc30c74f61e708a712/external/containerregistry/v0.0.34.tar.gz: connect timed out

do you think this is related to k8s branch? or is it related to bazel itself?

ZhengZhenyu on 6 Aug 2019

:thinking: I can download this file https://github.com/google/containerregistry/archive/v0.0.34.tar.gz, however the message says the connect timed out

aojea on 6 Aug 2019

@ZhengZhenyu the logs are showing a lot of errors related to bazel issues and I know that upstream is having a hard time everytime that bazel release a new version. https://logs.openlabtesting.org/logs/periodic-6/18/github.com/kubernetes-sigs/kind/master/kind-integration-test-arm64/d1a63fc/job-output.txt.gz

2019-08-06 06:19:38.429994 | ubuntu-xenial-arm64 | sequence element must be a string (got 'bool'). See https://github.com/bazelbuild/bazel/issues/7802 for information about --incompatible_string_join_requires_strings.

2019-08-06 06:19:39.889926 | ubuntu-xenial-arm64 | Incompatible flag --incompatible_require_ctx_in_configure_features has been flipped, and the mandatory parameter 'ctx' of cc_common.configure_features is missing. Please add 'ctx' as a named parameter. See https://github.com/bazelbuild/bazel/issues/7793 for details.

@ZhengZhenyu I can`t check what bazel version is used for the jobs, is it pinned to a specific version?
@BenTheElder you are the bazel expert, what'd the recommended version of bazel?
Is there a relation between bazel release and kubernetes release?

aojea on 6 Aug 2019

@aojea yes, the bazel version is now pinned to 0.28.1 and it was pinned to 0.23, we upgraded it few weeks ago because envoy project required. could that be the problem?

ZhengZhenyu on 6 Aug 2019

yeah, definitively, can we try with another version only for that job? I guess it has to be <= 0.26

aojea on 6 Aug 2019

@aojea I will have to discuss with the team to see if we can have a unique worker for kind

ZhengZhenyu on 6 Aug 2019

@ZhengZhenyu we can install our own version of bazel for the kind job https://docs.bazel.build/versions/master/install-ubuntu.html#install-with-installer-ubuntu , no need to have a unique worker

The --user flag installs Bazel to the $HOME/bin directory on your system and sets the .bazelrc path to $HOME/.bazelrc. Use the --help command to see additional installation options.

EDIT

er, seems we have to compile it from source, anyway, point is we can use our own version for this job :sweat_smile: , what do you think?
https://docs.bazel.build/versions/master/install-compile-source.html#compiling-from-source

aojea on 6 Aug 2019

@aojea yes, there is no arm64 release so we have to build from source which takes hours, so we pre-built it and put it in the image of the worker, I've checked with my team and maybe the old version is still there and I could just simply copy it to path to overwrite the newer version the first step in our kind job.

ZhengZhenyu on 6 Aug 2019

👍1

@aojea Oops, checking from the server, the old version got deleted, so I have to rebuild it, this might take few hours, so I guess it could not be done during my office hours today.

ZhengZhenyu on 6 Aug 2019

@ZhengZhenyu Please, check if this version works for you, this is the version I´m using on my local tests https://www.dropbox.com/s/0gygwih4256974k/bazel-0.26.1-aarch64?dl=0

Build label: 0.26.1- (@non-git)
Build target: bazel-out/aarch64-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar

```sh
$ file /users/aojea/bazel/output/bazel
/users/aojea/bazel/output/bazel: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-, for GNU/Linux 3.7.0, BuildID[sha1]=a424820c1608c335533704ac6cecfef062005a34, not stripped

```sh
$ md5sum /users/aojea/bazel/output/bazel
73050b4be346f4086ba6f05b5a0c62a1  /users/aojea/bazel/output/bazel

aojea on 6 Aug 2019

@aojea Hmm, are you sure? 0.26 seems not ok + kind build node-image --base-image kindest/base:latest --type=bazel --kube-root=/home/zuul/src/k8s.io/kubernetes
2019-08-07 03:04:48.421646 | ubuntu-xenial-arm64 | Starting local Bazel server and connecting to it...
2019-08-07 03:04:58.516265 | ubuntu-xenial-arm64 | Loading:
2019-08-07 03:04:58.527968 | ubuntu-xenial-arm64 | Loading: 0 packages loaded
2019-08-07 03:04:59.535464 | ubuntu-xenial-arm64 | Loading: 0 packages loaded
2019-08-07 03:05:00.545164 | ubuntu-xenial-arm64 | Loading: 0 packages loaded
2019-08-07 03:05:02.538017 | ubuntu-xenial-arm64 | Loading: 0 packages loaded
2019-08-07 03:05:03.539170 | ubuntu-xenial-arm64 | Loading: 0 packages loaded
2019-08-07 03:05:04.539531 | ubuntu-xenial-arm64 | Loading: 0 packages loaded
2019-08-07 03:05:05.579609 | ubuntu-xenial-arm64 | Loading: 0 packages loaded
2019-08-07 03:05:06.849602 | ubuntu-xenial-arm64 | Loading: 0 packages loaded
2019-08-07 03:05:06.850068 | ubuntu-xenial-arm64 | currently loading: build ... (4 packages)
2019-08-07 03:05:07.866957 | ubuntu-xenial-arm64 | Analyzing: 4 targets (4 packages loaded, 0 targets configured)
2019-08-07 03:05:09.435783 | ubuntu-xenial-arm64 | Analyzing: 4 targets (17 packages loaded, 31 targets configured)
2019-08-07 03:05:11.236828 | ubuntu-xenial-arm64 | Analyzing: 4 targets (18 packages loaded, 31 targets configured)
2019-08-07 03:05:13.445792 | ubuntu-xenial-arm64 | Analyzing: 4 targets (18 packages loaded, 31 targets configured)
2019-08-07 03:05:15.845550 | ubuntu-xenial-arm64 | Analyzing: 4 targets (18 packages loaded, 31 targets configured)
2019-08-07 03:05:18.655987 | ubuntu-xenial-arm64 | Analyzing: 4 targets (18 packages loaded, 31 targets configured)
2019-08-07 03:05:28.225680 | ubuntu-xenial-arm64 | Analyzing: 4 targets (18 packages loaded, 31 targets configured)
2019-08-07 03:05:32.819114 | ubuntu-xenial-arm64 | Analyzing: 4 targets (176 packages loaded, 2572 targets configured)
2019-08-07 03:05:34.893333 | ubuntu-xenial-arm64 | INFO: SHA256 (https://codeload.github.com/golang/tools/zip/bf090417da8b6150dcfe96795325f5aa78fff718) = 11629171a39a1cb4d426760005be6f7cb9b4182e4cb2756b7f1c5c2b6ae869fe
2019-08-07 03:05:34.980855 | ubuntu-xenial-arm64 | DEBUG: Rule 'debian-iptables-arm64' indicated that a canonical reproducible form can be obtained by modifying arguments digest = "sha256:1a63fdd216fe7b84561d40ab1ebaa0daae1fc73e4232a6caffbd8353d9a14cea"
2019-08-07 03:05:35.093497 | ubuntu-xenial-arm64 | DEBUG: Rule 'debian-base-arm64' indicated that a canonical reproducible form can be obtained by modifying arguments digest = "sha256:17be039c7035bd0897d954c51914ad41cd7e2b0b7c170b3d89ed021833df2fb1"
2019-08-07 03:05:38.109812 | ubuntu-xenial-arm64 | Analyzing: 4 targets (825 packages loaded, 8718 targets configured)
2019-08-07 03:05:40.007761 | ubuntu-xenial-arm64 | DEBUG: Rule 'org_golang_x_tools' indicated that a canonical reproducible form can be obtained by modifying arguments sha256 = "11629171a39a1cb4d426760005be6f7cb9b4182e4cb2756b7f1c5c2b6ae869fe"
2019-08-07 03:05:44.975087 | ubuntu-xenial-arm64 | Analyzing: 4 targets (1619 packages loaded, 13612 targets configured)
2019-08-07 03:05:50.481029 | ubuntu-xenial-arm64 | ERROR: /home/zuul/src/k8s.io/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/sets/BUILD:25:1: in _go_genrule rule //staging/src/k8s.io/apimachinery/pkg/util/sets:set-gen:
2019-08-07 03:05:50.481413 | ubuntu-xenial-arm64 | Traceback (most recent call last):
2019-08-07 03:05:50.482013 | ubuntu-xenial-arm64 | File "/home/zuul/src/k8s.io/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/sets/BUILD", line 25
2019-08-07 03:05:50.482273 | ubuntu-xenial-arm64 | _go_genrule(name = 'set-gen')
2019-08-07 03:05:50.483085 | ubuntu-xenial-arm64 | File "/root/.cache/bazel/_bazel_root/e96b51b7d54f19bc30c74f61e708a712/external/io_k8s_repo_infra/defs/go.bzl", line 37, in _go_genrule_impl
2019-08-07 03:05:50.483316 | ubuntu-xenial-arm64 | all_srcs += dep.files
2019-08-07 03:05:50.484437 | ubuntu-xenial-arm64 | + operator on a depset is forbidden. See https://docs.bazel.build/versions/master/skylark/depsets.html for recommendations. Use --incompatible_depset_union=false to temporarily disable this check.
2019-08-07 03:05:50.782848 | ubuntu-xenial-arm64 | ERROR: Analysis of target '//build:docker-artifacts' failed; build aborted: Analysis of target '//staging/src/k8s.io/apimachinery/pkg/util/sets:set-gen' failed; build aborted
2019-08-07 03:05:50.828674 | ubuntu-xenial-arm64 | INFO: Elapsed time: 62.451s
2019-08-07 03:05:50.829031 | ubuntu-xenial-arm64 | INFO: 0 processes.
2019-08-07 03:05:50.836660 | ubuntu-xenial-arm64 | FAILED: Build did NOT complete successfully (1979 packages loaded, 17048 targets configured)
2019-08-07 03:05:50.845262 | ubuntu-xenial-arm64 | FAILED: Build did NOT complete successfully (1979 packages loaded, 17048 targets configured)
2019-08-07 03:05:50.857241 | ubuntu-xenial-arm64 | time="03:05:50" level=error msg="Failed to build Kubernetes: exit status 1"
2019-08-07 03:05:50.857738 | ubuntu-xenial-arm64 | Error: error building node image: failed to build kubernetes: exit status 1

ZhengZhenyu on 7 Aug 2019

😕1

I'm building again for 0.24

ZhengZhenyu on 7 Aug 2019

👀1 👍1

@aojea Hi, sorry for the delay, I've tried serveral versions, and finally rolled back to 0.23.2 and manually tested(no log will be updated), you should be able to see the results after next periodic run.

ZhengZhenyu on 8 Aug 2019

👍1

@ZhengZhenyu you did it, now is building the cluster.
However, seems that the testgrid config has changed and I can´t find the dashboard to check the errors, will try to check tomorrow

aojea on 9 Aug 2019

@ZhengZhenyu the e2e tests are running but is failing to upload the results because the script seems to need python > 3.6 but the node has python 3.5

2019-08-12 20:36:02.654884 | ubuntu-xenial-arm64 |   File "/usr/lib/python3.5/subprocess.py", line 693, in run
2019-08-12 20:36:02.655245 | ubuntu-xenial-arm64 |     with Popen(*popenargs, **kwargs) as process:
2019-08-12 20:36:02.655688 | ubuntu-xenial-arm64 | TypeError: __init__() got an unexpected keyword argument 'encoding'

The encoding argument is not present in python 3.5

Changed in version 3.6: Added encoding and errors parameters

Is it possible to use python > 3.6?

aojea on 13 Aug 2019

@aojea sure, I will try

ZhengZhenyu on 13 Aug 2019

we hit another problem , seems the account is not longer valid

2019-08-13 20:35:25.925443 | ubuntu-xenial-arm64 | WARNING: [kind-arm64-openlab-logs@k8s-federated-conformance.iam.gserviceaccount.com] appears to be a service account. Service account tokens cannot be revoked, but they will expire automatically. To prevent use of the service account token earlier than the expiration, revoke the parent service account or service account key.
2019-08-13 20:35:25.930826 | ubuntu-xenial-arm64 | Revoked credentials:
2019-08-13 20:35:25.931313 | ubuntu-xenial-arm64 |  - kind-arm64-openlab-logs@k8s-federated-conformance.iam.gserviceaccount.com
2019-08-13 20:35:26.083693 | ubuntu-xenial-arm64 | Traceback (most recent call last):
2019-08-13 20:35:26.084662 | ubuntu-xenial-arm64 |   File "upload_e2e.py", line 328, in <module>
2019-08-13 20:35:26.085125 | ubuntu-xenial-arm64 |     main(sys.argv[1:])
2019-08-13 20:35:26.085801 | ubuntu-xenial-arm64 |   File "upload_e2e.py", line 318, in main
2019-08-13 20:35:26.086858 | ubuntu-xenial-arm64 |     upload_string(gcs_dir+'/started.json', started_json, args.dry_run)
2019-08-13 20:35:26.087320 | ubuntu-xenial-arm64 |   File "upload_e2e.py", line 175, in upload_string
2019-08-13 20:35:26.087583 | ubuntu-xenial-arm64 |     proc.communicate(input=text)
2019-08-13 20:35:26.088014 | ubuntu-xenial-arm64 |   File "/usr/lib/python3.6/subprocess.py", line 848, in communicate
2019-08-13 20:35:26.088260 | ubuntu-xenial-arm64 |     self._stdin_write(input)
2019-08-13 20:35:26.088730 | ubuntu-xenial-arm64 |   File "/usr/lib/python3.6/subprocess.py", line 801, in _stdin_write
2019-08-13 20:35:26.088972 | ubuntu-xenial-arm64 |     self.stdin.write(input)
2019-08-13 20:35:26.089336 | ubuntu-xenial-arm64 | TypeError: a bytes-like object is required, not 'str'
2019-08-13 20:35:26.089645 | ubuntu-xenial-arm64 | Run: ['gcloud', 'auth', 'revoke']

aojea on 13 Aug 2019

@dims @ZhengZhenyu do you have an idea on what can be the problem with the service account? ^^

These are the logs https://logs.openlabtesting.org/logs/periodic-6/18/github.com/kubernetes-sigs/kind/master/kind-integration-test-arm64/230250b/

aojea on 14 Aug 2019

@aojea Hi, sorry for the delay, we also had the similar problem in cloud-provider-openstack job, and my colleague checked yesterday, it turns out it is a wrong use of subprocess, the stdin should probably be deleted.

ZhengZhenyu on 16 Aug 2019

👍1

@aojea Hi, seems the job successed once in 8.20 and I can see both results from testgrid and openlab:
https://logs.openlabtesting.org/logs/periodic-6/18/github.com/kubernetes-sigs/kind/master/kind-integration-test-arm64/4631b06/
https://k8s-testgrid.appspot.com/conformance-kind#kind,%20v1.14%20(dev,%20ARM64)
And the tests seems actually running for the first time

But then the job starts to fail again.

ZhengZhenyu on 26 Aug 2019

@aojea Hmm, seems there is something wrong setting up the env again and the tests did not run and thus nothing can be uploaded.

ZhengZhenyu on 26 Aug 2019

@ZhengZhenyu the e2e.sh script changed at that time but don't know if one of those changes broke the openlab CI. Seems that the containers for the kubernetes components are not able to spawn, i.e the kubelet fails and there are no logs for the kubeapi-server , ....

https://github.com/kubernetes-sigs/kind/commit/97b044c5d1fe869af8d9e9052d1c218f590fd3c4#diff-d9fa0450190d60ba133fb92282a94725

I've sent a PR to try to align the CI job with the new changed on the e2e.sh, and we can iterate from m there.

https://github.com/theopenlab/openlab-zuul-jobs/pull/625

aojea on 26 Aug 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 24 Nov 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 24 Dec 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 23 Jan 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 23 Jan 2020

Kind: ARM64 CI

Most helpful comment

All 58 comments

Related issues