Pipeline: Pipeline nightly build is broken

Created on 3 Jun 2020  路  16Comments  路  Source: tektoncd/pipeline

Expected Behavior

Pipeline nightly build works

Actual Behavior

Pipeline nightly build is broken.
Building the base image fails with:

fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
v3.12.0-30-g01407813ee [http://dl-cdn.alpinelinux.org/alpine/v3.12/main]
v3.12.0-29-gb310a5f576 [http://dl-cdn.alpinelinux.org/alpine/v3.12/community]
OK: 12726 distinct packages available
(1/1) Upgrading alpine-baselayout (3.2.0-r6 -> 3.2.0-r7)
Executing alpine-baselayout-3.2.0-r7.pre-upgrade
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/..data': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/token': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/namespace': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/ca.crt': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/..2020_06_03_02_13_55.407209058/namespace': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/..2020_06_03_02_13_55.407209058/ca.crt': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/..2020_06_03_02_13_55.407209058/token': Read-only file system
Executing alpine-baselayout-3.2.0-r7.post-upgrade
ERROR: alpine-baselayout-3.2.0-r7: failed to rename var/.apk.f752bb51c942c7b3b4e0cf24875e21be9cdcd4595d8db384 to var/run.
Executing busybox-1.31.1-r16.trigger
1 error; 27 MiB in 25 packages
error building image: error building stage: failed to execute command: waiting for process to exit: exit status 1

Steps to Reproduce the Problem

  1. https://dashboard.dogfooding.tekton.dev/#/namespaces/default/pipelineruns/pipeline-release-nightly-tqgdd

Additional Info

The image is base on alpine.
It used to be latest, and now it's pinned on 3.12 which is the version that was used in the last working run: https://dashboard.dogfooding.tekton.dev/#/namespaces/default/pipelineruns/pipeline-release-nightly-w5xcr

The only visible difference in the run log is the following. In the successful run:

fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
v3.12.0-3-gc43b21255b [http://dl-cdn.alpinelinux.org/alpine/v3.12/main]
v3.12.0-1-g9465f17ea9 [http://dl-cdn.alpinelinux.org/alpine/v3.12/community]

while in the failing run:

fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
v3.12.0-30-g01407813ee [http://dl-cdn.alpinelinux.org/alpine/v3.12/main]
v3.12.0-29-gb310a5f576 [http://dl-cdn.alpinelinux.org/alpine/v3.12/community]
kinbug

Most helpful comment

ahh makes sense @joshsleeper ! thanks for explaining :D do you happen to know how one could track this kind of thing (e.g. are there release notes somewhere that mention this?) np if not, thanks anyway for the info

image

All 16 comments

The https://gitlab.alpinelinux.org/alpine/aports/-/tags/v3.12.0 was released 4dd ago, which is when the last successful nightly happened.
The build on the day before ran on v.3.11.0.
We had one successful build on v3.12.0 and then it started failing.

Read-only file system make me think it's either a node or an image problem :thinking:

I can reproduce this on my cluster running Tekton v0.12.0

  1. Apply Task YAML: https://gist.github.com/dibyom/038c9ae01fff69606976971cdb6c4102
  2. Create svc account/secret: https://github.com/tektoncd/pipeline/tree/master/tekton#service-account-and-secrets
  3. Run Task:
 tkn task start \
                --param=imageRegistry=${IMAGE_REGISTRY} \
                --serviceaccount=release-right-meow \
                --inputresource=source=tekton-pipelines-git \
                --outputresource=builtBaseImage=base-image \
               publish-tekton-pipelines
OK: 12726 distinct packages available
(1/1) Upgrading alpine-baselayout (3.2.0-r6 -> 3.2.0-r7)
Executing alpine-baselayout-3.2.0-r7.pre-upgrade
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/..data': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/token': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/namespace': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/ca.crt': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/..2020_06_03_02_13_55.407209058/namespace': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/..2020_06_03_02_13_55.407209058/ca.crt': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/..2020_06_03_02_13_55.407209058/token': Read-only file system

It looks like something is trying to remove a mounted secret:

      volumeMounts:
...
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: default-token-g2t44
        readOnly: true

btw looks like this might be a duplicate of https://github.com/tektoncd/pipeline/issues/2726 looks like pinning didnt fix it :S

Here's a log from a recent successful run, where this "pre-upgrade" thing doesnt seem to be getting involved:

fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
(1/11) Installing ca-certificates (20191127-r2)
(2/11) Installing nghttp2-libs (1.40.0-r0)
(3/11) Installing libcurl (7.69.1-r0)
(4/11) Installing expat (2.2.9-r1)
(5/11) Installing pcre2 (10.35-r0)
(6/11) Installing git (2.26.2-r0)
(7/11) Installing openssh-keygen (8.3_p1-r0)
(8/11) Installing ncurses-terminfo-base (6.2_p20200523-r0)
(9/11) Installing ncurses-libs (6.2_p20200523-r0)
(10/11) Installing libedit (20191231.3.1-r0)
(11/11) Installing openssh-client (8.3_p1-r0)
Executing busybox-1.31.1-r16.trigger
Executing ca-certificates-20191127-r2.trigger
OK: 27 MiB in 25 packages
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
v3.12.0-3-gc43b21255b [http://dl-cdn.alpinelinux.org/alpine/v3.12/main]
v3.12.0-1-g9465f17ea9 [http://dl-cdn.alpinelinux.org/alpine/v3.12/community]
OK: 12725 distinct packages available
OK: 27 MiB in 25 packages

It's interesting that the successful log references these versions:

v3.12.0-3-gc43b21255b [http://dl-cdn.alpinelinux.org/alpine/v3.12/main]
v3.12.0-1-g9465f17ea9 [http://dl-cdn.alpinelinux.org/alpine/v3.12/community]

In the failed log we have these versions:

v3.12.0-30-g01407813ee [http://dl-cdn.alpinelinux.org/alpine/v3.12/main]
v3.12.0-29-gb310a5f576 [http://dl-cdn.alpinelinux.org/alpine/v3.12/community]

It seems like the error might be coming from:

(1/1) Upgrading alpine-baselayout (3.2.0-r6 -> 3.2.0-r7)
Executing alpine-baselayout-3.2.0-r7.pre-upgrade

maybe that's not actually something we want to upgrade?

we're trying to upgrade all packages:

https://github.com/tektoncd/pipeline/blob/4f670ce6b53a13962b1eab19a3b047b3707e2bde/images/Dockerfile#L3-L5

https://git.alpinelinux.org/aports/tree/main/alpine-baselayout/alpine-baselayout.pre-upgrade

# migrate /var/run directory to /run
if [ -d /var/run ]; then
    cp -a /var/run/* /run 2>/dev/null
    rm -rf /var/run
    ln -s ../run /var/run
fi

wut

I'm not sure what's going on but I recommend building from older checkouts of pipelines and seeing if this was caused by a change that was introduced in the pipelines repo; if so we can use a binary search to find the problem.

Just realized that to make 0.13 ill need to fix this - and I'm build cop tomorrow anyway, so no time like the present :D

Okay so I was able to run kaniko locally and reproduce this more or less by introducing the slightly contrived step of mounting a read only file into /var/run:

docker run -v `pwd`:/workspace/go/src/github.com/tektoncd/pipeline -v `pwd`/SECRET.json:/var/run/secrets/SECRET.json:ro -e GOOGLE_APPLICATION_CREDENTIALS=/workspace/go/src/github.com/tektoncd/pipeline/SECRET.json gcr.io/kaniko-project/executor:v0.17.1 --dockerfile=/workspace/go/src/github.com/tektoncd/pipeline/images/Dockerfile --destination=gcr.io/christiewilson-catfactory/pipeline-release-test --context=/workspace/go/src/github.com/tektoncd/pipeline

I got this error:

(1/2) Upgrading alpine-baselayout (3.2.0-r6 -> 3.2.0-r7)
Executing alpine-baselayout-3.2.0-r7.pre-upgrade
rm: can't remove '/var/run/secrets/SECRET.json': Resource busy

I then pinned to 3.11 and it built just fine.

It seems like pinning to 3.12 isn't working b/c 3.12 is a moving target; evne since my comment above (https://github.com/tektoncd/pipeline/issues/2738#issuecomment-639061198) I'm seeing a different version being used when repro-ing:

v3.12.0-43-gfe7417f5c2 [http://dl-cdn.alpinelinux.org/alpine/v3.12/main]
v3.12.0-44-g288d7f5e51 [http://dl-cdn.alpinelinux.org/alpine/v3.12/community]

I'm gonna pin to 3.11 and put a bit more time in to see if I can figure out why this has only started happening and if I should report it somewhere.

I think this is a bizarre collision of kaniko behaviour and alpine relying on /var/run being a symlink to /run so I opened https://github.com/GoogleContainerTools/kaniko/issues/1297

I think our options are:

  1. keep the alpine image pinned (and hope this never starts being a problem for 3.11 - i still dont understand why a script committed in 2017 is only causing this problem now)
  2. fix the problem in kaniko
  3. build with something other than kaniko

I think it's just a perfect storm of conditions that _could've_ happened in any prior alpine release, but by chance didn't.

the base images for alpine 3.12 don't have the latest alpine-baselayout for their release yet, and so anything that's trying to build + upgrade from them with a read-only mount anywhere in /var/run/* (and I wager anywhere in /run/* too!) will throw its hands up.

as soon as the alpine base images include that package upgrade, this issue will mostly disappear until the next perfect storm. 馃槅

ahh makes sense @joshsleeper ! thanks for explaining :D do you happen to know how one could track this kind of thing (e.g. are there release notes somewhere that mention this?) np if not, thanks anyway for the info

image

for all the things people, including myself, love about Alpine Linux, I think it's a fairly small crew running that ship so their announcement processes aren't extensive. They have a "Latest Development" feed on their homepage that tracks package updates (which is really just a feed of commits), but I think that's about it?

https://www.alpinelinux.org/

I think it's mostly a side-effect of the fact that by design alpine doesn't maintain package _history_ really, so generally the only correct version of alpine packages to be using is the latest. if there are upgrades, you're supposed to have them full stop.

when a major release is being cut (e.g. 3.11, 3.12, etc.) they commit to pin to specific package versions (let's say something like python3.7, which might be 3.7.0 at time of release), but as bug and security fixes roll out they'll replace the python3.7 with 3.7.1 and 3.7.2, at which point there is no longer a way to explicitly install 3.7.0 in that release of alpine.

hopefully that's more helpful than man-splain-y. I've just had to dig into this before at my own company to understand why we had various odd issues with alpine that we never had with other distros.

It's worth noting that I think this _probably_ isn't a kaniko issue really so much as it's primarily _exposed_ by kaniko. k8s is what mounts those secrets there, and it just so happens that people aren't running apk upgrade in many contexts other than an image build!

I'm still undecided on if this should be fixed by alpine or k8s, but I'm guessing it'll end up being alpine since other distros aren't having similar issues... that I know of.

I think we've successfully working around this and it seems like GoogleContainerTools/kaniko#1297 probably won't have a solution for a while. Considering this resolved!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

pritidesai picture pritidesai  路  4Comments

silverlyra picture silverlyra  路  4Comments

hrishin picture hrishin  路  3Comments

csantanapr picture csantanapr  路  3Comments

objectiveous picture objectiveous  路  3Comments