build github.com/elastic/cloud-on-k8s/test/e2e/cmd: cannot load github.com/Masterminds/sprig: open /go/pkg/mod/cache/download/github.com/!masterminds/sprig/@v/v2.22.0+incompatible.lock: permission denied
The dependency to sprig showing up as +incompatible just means it's not using go modules.
Looking at their github page it looks like version 3+ is compatible with go modules and we are using v2+.
I don't think this is related to the error we get.
I cannot reproduce the error locally after running go clean --modcache first to re-download all dependencies.
I'm wondering if it could be related to several CI jobs working in parallel with the same go modules cache?
We get different somewhat-related errors for other modules as well:
22:49:37 go: writing stat cache: open /go/pkg/mod/cache/download/k8s.io/api/@v/v0.17.3.info570321898.tmp: permission denied
22:49:37 go: writing stat cache: open /go/pkg/mod/cache/download/k8s.io/client-go/@v/v11.0.0+incompatible.info683262243.tmp: permission denied
It seems to fail since build 209. I'm wondering if it could be related to us changing the Jenkins worker disk size?
Looking at https://devops-ci.elastic.co/blue/organizations/jenkins/cloud-on-k8s-versions-gke/detail/cloud-on-k8s-versions-gke/216/pipeline/35, I don't understand why some go mod logs are interleaved with the e2e runner logs.
Edit: the interleaved go mod logs are probably from within the E2E container running on the K8s env?
@charith-elastic noticed it could come from this change: https://github.com/elastic/cloud-on-k8s/blob/cda077a249cb828f933b0267ed59ebfc03e06005/test/e2e/Dockerfile#L21
However it does not explain why the tests sometimes run correctly, sometimes don't.
When they don't, logs reveal the Pod has to fetch go dependencies when running go tests.
When they run correctly, logs reveal the Pod does not have to fetch go dependencies.
Both are running with the same image: eu.gcr.io/****/eck-operator--e2e-tests:f63396e9.
It seems we are building and pushing that same docker image from every single E2E job. Is it possible it includes dependencies in some cases, but not others? Which would explain concurrent builds leading to different images?
You guessed it right! I have some information to understand what are the conditions that make chgrp -R 0 /go && chmod -R g=u /go can be a problem.
The eck-vanilla job fails too. This indicates it is not related to several CI jobs working in parallel with the same go modules cache. This seems to indicate it is related to the kubernetes version (as for the job eck-versions-k8s). Looking at the 2 jobs, it seems that the issue occurs from k8s 1.14.
I managed to isolate the problem:
Create 2 k8s clusters, one in version < 1.14 and one in version >= 1.14:
> kind create cluster --image kindest/node:v1.12.10 --name kind-12
> kind load --name kind-12 docker-image eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2
> kind create cluster --image kindest/node:v1.15.3 --name kind-15
> kind load --name kind-15 docker-image eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2
And deploy this job:
apiVersion: batch/v1
kind: Job
metadata:
name: eck-e2e-debug
spec:
template:
spec:
securityContext:
runAsNonRoot: true
fsGroup: 101
runAsUser: 101
runAsGroup: 101
containers:
- name: e2e
image: eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2
command: [ "go", "build", "test/e2e/cmd/main.go" ]
restartPolicy: Never
Or just a pod to exec into it and run some commands manually:
apiVersion: v1
kind: Pod
metadata:
name: e2e
spec:
securityContext:
fsGroup: 101
runAsGroup: 101
runAsNonRoot: true
runAsUser: 101
containers:
- name: main
image: eu.gcr.io/elastic-cloud-dev/eck-operator--e2e-tests:cda077a2
command: [ "sh", "-c", "sleep 1h" ]
In 1.12, it's ok. The current user (uid=101) can read and write files on the filesystem of the container.
In 1.15, it fails. The current user can't write files and can even read only a few files.
Let's compare the output of the id command:
[1.12] > id
[1.12] uid=101 gid=0(root) groups=0(root),101(ssh)
[1.15] > id
[1.15] uid=101 gid=101(ssh) groups=101(ssh)
I looked for elements related to the security context in the 1.14 release notes.
Since k8s 1.14, the RunAsGroup feature is now in beta and enabled by default. You can set the runAsGroup field in a pod鈥檚 securityContext to specify the primary group ID of all processes that will run in the pod.
This combined with 43ef707 seems to me to be the cause of the problem.
Changing the group to root of the filesystem of the e2e tests Docker image was introduced in 43ef707 because it is the recommended way by the OpenShift Container Platform Guidelines to create Docker images. This is because the container user is always a member of the root group.
A simple fix is to reproduce this behaviour in non-OCP environments using spec.securityContext.runAsGroup:
diff --git a/config/e2e/batch_job.yaml b/config/e2e/batch_job.yaml
index 97c8ff99..07f78c11 100644
--- a/config/e2e/batch_job.yaml
+++ b/config/e2e/batch_job.yaml
@@ -43,7 +43,7 @@ spec:
{{ if not .Context.OcpCluster }}
fsGroup: 101
runAsUser: 101
- runAsGroup: 101
+ runAsGroup: 0
{{ end }}
We could also revert the chgrp -R 0 /go && chmod -R g=u /go that is not really necessary but then we don't follow the OCP guidelines.
I thought we could create a real user/group with useradd, configure the GOPATH to its HOME, etc but the OCP guidelines recommends to use a number of the uid:
Lastly, the final USER declaration in the Dockerfile should specify the user ID (numeric value) and not the user name. This allows OpenShift Container Platform to validate the authority the image is attempting to run with and prevent running images that are trying to run as root, because running containers as a privileged user exposes potential security holes.
Most helpful comment
You guessed it right! I have some information to understand what are the conditions that make
chgrp -R 0 /go && chmod -R g=u /gocan be a problem.The eck-vanilla job fails too. This indicates it is not related to several CI jobs working in parallel with the same go modules cache. This seems to indicate it is related to the kubernetes version (as for the job eck-versions-k8s). Looking at the 2 jobs, it seems that the issue occurs from k8s 1.14.
I managed to isolate the problem:
Create 2 k8s clusters, one in version < 1.14 and one in version >= 1.14:
And deploy this job:
Or just a pod to exec into it and run some commands manually:
In 1.12, it's ok. The current user (uid=101) can read and write files on the filesystem of the container.
In 1.15, it fails. The current user can't write files and can even read only a few files.
Let's compare the output of the
idcommand:I looked for elements related to the security context in the 1.14 release notes.
Since k8s 1.14, the
RunAsGroupfeature is now in beta and enabled by default. You can set therunAsGroupfield in a pod鈥檚 securityContext to specify the primary group ID of all processes that will run in the pod.This combined with 43ef707 seems to me to be the cause of the problem.