Cloud-on-k8s: GKE E2E tests fail for a go dependency issue

Created on 18 Feb 2020  路  8Comments  路  Source: elastic/cloud-on-k8s

build github.com/elastic/cloud-on-k8s/test/e2e/cmd: cannot load github.com/Masterminds/sprig: open /go/pkg/mod/cache/download/github.com/!masterminds/sprig/@v/v2.22.0+incompatible.lock: permission denied
>test

Most helpful comment

You guessed it right! I have some information to understand what are the conditions that make chgrp -R 0 /go && chmod -R g=u /go can be a problem.

The eck-vanilla job fails too. This indicates it is not related to several CI jobs working in parallel with the same go modules cache. This seems to indicate it is related to the kubernetes version (as for the job eck-versions-k8s). Looking at the 2 jobs, it seems that the issue occurs from k8s 1.14.

I managed to isolate the problem:

Create 2 k8s clusters, one in version < 1.14 and one in version >= 1.14:

> kind create cluster --image kindest/node:v1.12.10 --name kind-12
> kind load --name kind-12 docker-image eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2

> kind create cluster --image kindest/node:v1.15.3 --name kind-15
> kind load --name kind-15 docker-image eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2

And deploy this job:

apiVersion: batch/v1
kind: Job
metadata:
  name: eck-e2e-debug
spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        fsGroup: 101
        runAsUser: 101
        runAsGroup: 101
      containers:
        - name: e2e
          image: eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2
          command: [ "go", "build", "test/e2e/cmd/main.go" ]
      restartPolicy: Never

Or just a pod to exec into it and run some commands manually:

apiVersion: v1
kind: Pod
metadata:
  name: e2e
spec:
  securityContext:
    fsGroup: 101
    runAsGroup: 101
    runAsNonRoot: true
    runAsUser: 101
  containers:
  - name: main
    image: eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2
    command: [ "sh", "-c", "sleep 1h" ]

In 1.12, it's ok. The current user (uid=101) can read and write files on the filesystem of the container.
In 1.15, it fails. The current user can't write files and can even read only a few files.

Let's compare the output of the id command:

[1.12] > id
[1.12] uid=101 gid=0(root) groups=0(root),101(ssh)

[1.15] > id
[1.15] uid=101 gid=101(ssh) groups=101(ssh)

I looked for elements related to the security context in the 1.14 release notes.
Since k8s 1.14, the RunAsGroup feature is now in beta and enabled by default. You can set the runAsGroup field in a pod鈥檚 securityContext to specify the primary group ID of all processes that will run in the pod.

This combined with 43ef707 seems to me to be the cause of the problem.

All 8 comments

The dependency to sprig showing up as +incompatible just means it's not using go modules.
Looking at their github page it looks like version 3+ is compatible with go modules and we are using v2+.

I don't think this is related to the error we get.

I cannot reproduce the error locally after running go clean --modcache first to re-download all dependencies.

I'm wondering if it could be related to several CI jobs working in parallel with the same go modules cache?

We get different somewhat-related errors for other modules as well:

22:49:37  go: writing stat cache: open /go/pkg/mod/cache/download/k8s.io/api/@v/v0.17.3.info570321898.tmp: permission denied
22:49:37  go: writing stat cache: open /go/pkg/mod/cache/download/k8s.io/client-go/@v/v11.0.0+incompatible.info683262243.tmp: permission denied

It seems to fail since build 209. I'm wondering if it could be related to us changing the Jenkins worker disk size?

Looking at https://devops-ci.elastic.co/blue/organizations/jenkins/cloud-on-k8s-versions-gke/detail/cloud-on-k8s-versions-gke/216/pipeline/35, I don't understand why some go mod logs are interleaved with the e2e runner logs.
Edit: the interleaved go mod logs are probably from within the E2E container running on the K8s env?

@charith-elastic noticed it could come from this change: https://github.com/elastic/cloud-on-k8s/blob/cda077a249cb828f933b0267ed59ebfc03e06005/test/e2e/Dockerfile#L21

However it does not explain why the tests sometimes run correctly, sometimes don't.
When they don't, logs reveal the Pod has to fetch go dependencies when running go tests.
When they run correctly, logs reveal the Pod does not have to fetch go dependencies.
Both are running with the same image: eu.gcr.io/****/eck-operator--e2e-tests:f63396e9.

It seems we are building and pushing that same docker image from every single E2E job. Is it possible it includes dependencies in some cases, but not others? Which would explain concurrent builds leading to different images?

You guessed it right! I have some information to understand what are the conditions that make chgrp -R 0 /go && chmod -R g=u /go can be a problem.

The eck-vanilla job fails too. This indicates it is not related to several CI jobs working in parallel with the same go modules cache. This seems to indicate it is related to the kubernetes version (as for the job eck-versions-k8s). Looking at the 2 jobs, it seems that the issue occurs from k8s 1.14.

I managed to isolate the problem:

Create 2 k8s clusters, one in version < 1.14 and one in version >= 1.14:

> kind create cluster --image kindest/node:v1.12.10 --name kind-12
> kind load --name kind-12 docker-image eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2

> kind create cluster --image kindest/node:v1.15.3 --name kind-15
> kind load --name kind-15 docker-image eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2

And deploy this job:

apiVersion: batch/v1
kind: Job
metadata:
  name: eck-e2e-debug
spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        fsGroup: 101
        runAsUser: 101
        runAsGroup: 101
      containers:
        - name: e2e
          image: eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2
          command: [ "go", "build", "test/e2e/cmd/main.go" ]
      restartPolicy: Never

Or just a pod to exec into it and run some commands manually:

apiVersion: v1
kind: Pod
metadata:
  name: e2e
spec:
  securityContext:
    fsGroup: 101
    runAsGroup: 101
    runAsNonRoot: true
    runAsUser: 101
  containers:
  - name: main
    image: eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2
    command: [ "sh", "-c", "sleep 1h" ]

In 1.12, it's ok. The current user (uid=101) can read and write files on the filesystem of the container.
In 1.15, it fails. The current user can't write files and can even read only a few files.

Let's compare the output of the id command:

[1.12] > id
[1.12] uid=101 gid=0(root) groups=0(root),101(ssh)

[1.15] > id
[1.15] uid=101 gid=101(ssh) groups=101(ssh)

I looked for elements related to the security context in the 1.14 release notes.
Since k8s 1.14, the RunAsGroup feature is now in beta and enabled by default. You can set the runAsGroup field in a pod鈥檚 securityContext to specify the primary group ID of all processes that will run in the pod.

This combined with 43ef707 seems to me to be the cause of the problem.

Changing the group to root of the filesystem of the e2e tests Docker image was introduced in 43ef707 because it is the recommended way by the OpenShift Container Platform Guidelines to create Docker images. This is because the container user is always a member of the root group.

A simple fix is to reproduce this behaviour in non-OCP environments using spec.securityContext.runAsGroup:

diff --git a/config/e2e/batch_job.yaml b/config/e2e/batch_job.yaml
index 97c8ff99..07f78c11 100644
--- a/config/e2e/batch_job.yaml
+++ b/config/e2e/batch_job.yaml
@@ -43,7 +43,7 @@ spec:
 {{ if not .Context.OcpCluster }}
         fsGroup: 101
         runAsUser: 101
-        runAsGroup: 101
+        runAsGroup: 0
 {{ end }}

We could also revert the chgrp -R 0 /go && chmod -R g=u /go that is not really necessary but then we don't follow the OCP guidelines.

I thought we could create a real user/group with useradd, configure the GOPATH to its HOME, etc but the OCP guidelines recommends to use a number of the uid:

Lastly, the final USER declaration in the Dockerfile should specify the user ID (numeric value) and not the user name. This allows OpenShift Container Platform to validate the authority the image is attempting to run with and prevent running images that are trying to run as root, because running containers as a privileged user exposes potential security holes.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

barkbay picture barkbay  路  5Comments

pebrc picture pebrc  路  3Comments

sebgl picture sebgl  路  5Comments

spencergilbert picture spencergilbert  路  3Comments

SebastianCaceresUltra picture SebastianCaceresUltra  路  3Comments