Cloud-on-k8s: GKE E2E tests fail for a go dependency issue

Created on 18 Feb 2020 · 8Comments · Source: elastic/cloud-on-k8s

build github.com/elastic/cloud-on-k8s/test/e2e/cmd: cannot load github.com/Masterminds/sprig: open /go/pkg/mod/cache/download/github.com/!masterminds/sprig/@v/v2.22.0+incompatible.lock: permission denied

>test

Source

sebgl

Most helpful comment

You guessed it right! I have some information to understand what are the conditions that make chgrp -R 0 /go && chmod -R g=u /go can be a problem.

The eck-vanilla job fails too. This indicates it is not related to several CI jobs working in parallel with the same go modules cache. This seems to indicate it is related to the kubernetes version (as for the job eck-versions-k8s). Looking at the 2 jobs, it seems that the issue occurs from k8s 1.14.

I managed to isolate the problem:

Create 2 k8s clusters, one in version < 1.14 and one in version >= 1.14:

> kind create cluster --image kindest/node:v1.12.10 --name kind-12
> kind load --name kind-12 docker-image eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2

> kind create cluster --image kindest/node:v1.15.3 --name kind-15
> kind load --name kind-15 docker-image eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2

And deploy this job:

apiVersion: batch/v1
kind: Job
metadata:
  name: eck-e2e-debug
spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        fsGroup: 101
        runAsUser: 101
        runAsGroup: 101
      containers:
        - name: e2e
          image: eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2
          command: [ "go", "build", "test/e2e/cmd/main.go" ]
      restartPolicy: Never

Or just a pod to exec into it and run some commands manually:

apiVersion: v1
kind: Pod
metadata:
  name: e2e
spec:
  securityContext:
    fsGroup: 101
    runAsGroup: 101
    runAsNonRoot: true
    runAsUser: 101
  containers:
  - name: main
    image: eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2
    command: [ "sh", "-c", "sleep 1h" ]

In 1.12, it's ok. The current user (uid=101) can read and write files on the filesystem of the container.
In 1.15, it fails. The current user can't write files and can even read only a few files.

Let's compare the output of the id command:

[1.12] > id
[1.12] uid=101 gid=0(root) groups=0(root),101(ssh)

[1.15] > id
[1.15] uid=101 gid=101(ssh) groups=101(ssh)

I looked for elements related to the security context in the 1.14 release notes.
Since k8s 1.14, the RunAsGroup feature is now in beta and enabled by default. You can set the runAsGroup field in a pod’s securityContext to specify the primary group ID of all processes that will run in the pod.

This combined with 43ef707 seems to me to be the cause of the problem.

thbkrkr on 19 Feb 2020

❤3 🎉3 👍3

All 8 comments

The dependency to sprig showing up as +incompatible just means it's not using go modules.
Looking at their github page it looks like version 3+ is compatible with go modules and we are using v2+.

I don't think this is related to the error we get.

sebgl on 18 Feb 2020

I cannot reproduce the error locally after running go clean --modcache first to re-download all dependencies.

I'm wondering if it could be related to several CI jobs working in parallel with the same go modules cache?

sebgl on 18 Feb 2020

We get different somewhat-related errors for other modules as well:

22:49:37  go: writing stat cache: open /go/pkg/mod/cache/download/k8s.io/api/@v/v0.17.3.info570321898.tmp: permission denied

22:49:37  go: writing stat cache: open /go/pkg/mod/cache/download/k8s.io/client-go/@v/v11.0.0+incompatible.info683262243.tmp: permission denied

sebgl on 18 Feb 2020

It seems to fail since build 209. I'm wondering if it could be related to us changing the Jenkins worker disk size?

sebgl on 18 Feb 2020

Looking at https://devops-ci.elastic.co/blue/organizations/jenkins/cloud-on-k8s-versions-gke/detail/cloud-on-k8s-versions-gke/216/pipeline/35, I don't understand why some go mod logs are interleaved with the e2e runner logs.
Edit: the interleaved go mod logs are probably from within the E2E container running on the K8s env?

sebgl on 18 Feb 2020

@charith-elastic noticed it could come from this change: https://github.com/elastic/cloud-on-k8s/blob/cda077a249cb828f933b0267ed59ebfc03e06005/test/e2e/Dockerfile#L21

However it does not explain why the tests sometimes run correctly, sometimes don't.
When they don't, logs reveal the Pod has to fetch go dependencies when running go tests.
When they run correctly, logs reveal the Pod does not have to fetch go dependencies.
Both are running with the same image: eu.gcr.io/****/eck-operator--e2e-tests:f63396e9.

It seems we are building and pushing that same docker image from every single E2E job. Is it possible it includes dependencies in some cases, but not others? Which would explain concurrent builds leading to different images?

sebgl on 18 Feb 2020

You guessed it right! I have some information to understand what are the conditions that make chgrp -R 0 /go && chmod -R g=u /go can be a problem.

I managed to isolate the problem:

Create 2 k8s clusters, one in version < 1.14 and one in version >= 1.14:

> kind create cluster --image kindest/node:v1.12.10 --name kind-12
> kind load --name kind-12 docker-image eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2

> kind create cluster --image kindest/node:v1.15.3 --name kind-15
> kind load --name kind-15 docker-image eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2

And deploy this job:

apiVersion: batch/v1
kind: Job
metadata:
  name: eck-e2e-debug
spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        fsGroup: 101
        runAsUser: 101
        runAsGroup: 101
      containers:
        - name: e2e
          image: eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2
          command: [ "go", "build", "test/e2e/cmd/main.go" ]
      restartPolicy: Never

Or just a pod to exec into it and run some commands manually:

apiVersion: v1
kind: Pod
metadata:
  name: e2e
spec:
  securityContext:
    fsGroup: 101
    runAsGroup: 101
    runAsNonRoot: true
    runAsUser: 101
  containers:
  - name: main
    image: eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2
    command: [ "sh", "-c", "sleep 1h" ]

In 1.12, it's ok. The current user (uid=101) can read and write files on the filesystem of the container.
In 1.15, it fails. The current user can't write files and can even read only a few files.

Let's compare the output of the id command:

[1.12] > id
[1.12] uid=101 gid=0(root) groups=0(root),101(ssh)

[1.15] > id
[1.15] uid=101 gid=101(ssh) groups=101(ssh)

This combined with 43ef707 seems to me to be the cause of the problem.

thbkrkr on 19 Feb 2020

❤3 🎉3 👍3

Changing the group to root of the filesystem of the e2e tests Docker image was introduced in 43ef707 because it is the recommended way by the OpenShift Container Platform Guidelines to create Docker images. This is because the container user is always a member of the root group.

A simple fix is to reproduce this behaviour in non-OCP environments using spec.securityContext.runAsGroup:

diff --git a/config/e2e/batch_job.yaml b/config/e2e/batch_job.yaml
index 97c8ff99..07f78c11 100644
--- a/config/e2e/batch_job.yaml
+++ b/config/e2e/batch_job.yaml
@@ -43,7 +43,7 @@ spec:
 {{ if not .Context.OcpCluster }}
         fsGroup: 101
         runAsUser: 101
-        runAsGroup: 101
+        runAsGroup: 0
 {{ end }}

We could also revert the chgrp -R 0 /go && chmod -R g=u /go that is not really necessary but then we don't follow the OCP guidelines.

I thought we could create a real user/group with useradd, configure the GOPATH to its HOME, etc but the OCP guidelines recommends to use a number of the uid:

Lastly, the final USER declaration in the Dockerfile should specify the user ID (numeric value) and not the user name. This allows OpenShift Container Platform to validate the authority the image is attempting to run with and prevent running images that are trying to run as root, because running containers as a privileged user exposes potential security holes.

thbkrkr on 21 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings