Kaniko: Kaniko-Built Image Cannot Be Pulled from Repo Due To Symlink Error

Created on 7 Feb 2020 · 18Comments · Source: GoogleContainerTools/kaniko

Actual behavior
kaniko-built image that has been pushed to a repository is not extracted correctly upon download, but the same docker-built image that has been pushed to a repository is extracted correctly upon download.

Expected behavior
Image built with kaniko should be extracted correctly when pulled from a repository.

To Reproduce
Steps to reproduce the behavior:

Build the container using the dockerfile described below and push to a container registry (can be any)
Pull the image using a "docker pull" command
Error generated on pull

Output provided below is from 2 runs of the pull command, first, to pull the docker-built version, then to pull the kaniko-built version:

RESULTS OF PULL WHEN DOCKER-BUILT:

➜ docker pull myrepo/dihi-health-intelligence/mlaas/mlaas-ui:363168a0
363168a0: Pulling from dihi-health-intelligence/mlaas/mlaas-ui
d318c91bf2a8: Pull complete
e78b80ba2df3: Pull complete
57e393e12bdc: Pull complete
4cd4d3e2db54: Pull complete
4d2e313894a7: Pull complete
Digest: sha256:5a8c044582b7551bdedf91b62472523a84d364b169e3136619ca8f991bd7004d
Status: Downloaded newer image for myrepo/dihi-health-intelligence/mlaas/mlaas-ui:363168a0
myrepo/dihi-health-intelligence/mlaas/mlaas-ui:363168a0

RESULTS OF PULL WHEN KANIKO-BUILT:

➜ docker pull myrepo/dihi-health-intelligence/mlaas/mlaas-ui:0ae48f4b
0ae48f4b: Pulling from dihi-health-intelligence/mlaas/mlaas-ui
d318c91bf2a8: Pull complete
4ac345e3547d: Extracting [==================================================>] 353.7MB/353.7MB
b2a1d0b62606: Download complete
db78db57c5e7: Download complete
4c4383ef3686: Download complete
failed to register layer: Error processing tar file(exit status 1): symlink libBrokenLocale-2.30.so /lib64/libBrokenLocale.so.1: no such file or directory

Two users are currently reporting this problem. One of the users who is currently affected has provided us with the following information:

"[12:22 PM] SB
i get the same error. i'm looking into whether it has something to do with the part of the dockerfile that the error seems to be pointing at (making symlinks to /proc/self/fd/1 and 2)

[12:24 PM] SB
that fixed the scan. so something changed with either kaniko or the kaniko-build tagged runners so that '/proc/self/fd/1' and '/proc/self/fd/2' don't seem to exist

[12:26 PM] SB
though it works during the build and fails on pull, which makes me think that something in kaniko that describes the layer that contains /proc/self/fd isn't describing it sufficiently for the client to recreate that layer on pull"

Additional Information

Dockerfile
Please provide either the Dockerfile you're trying to build or one that can reproduce this error.

Our Dockerfile:

FROM fedora:31
ARG IMAGE_NAME='nodejs'
LABEL name=${IMAGE_NAME}
ARG NODEJS_VERSION=v12.14.1
ARG CI_COMMIT_SHA=unspecified
LABEL git_commit=${CI_COMMIT_SHA}
ARG CI_PROJECT_URL
LABEL git_repository_url=${CI_PROJECT_URL}

ENV APP_HOME=/opt/app-root/src

WORKDIR /tmp
RUN yum install -y findutils readline-devel gcc gcc-c++ make zlib-devel xz openssl openssl-devel git patch \
&& curl -L https://nodejs.org/dist/${NODEJS_VERSION}/node-${NODEJS_VERSION}-linux-x64.tar.xz > node-${NODEJS_VERSION}-linux-x64.tar.xz \
&& tar -Jxvf node-${NODEJS_VERSION}-linux-x64.tar.xz \
&& cd node-${NODEJS_VERSION}-linux-x64 \
&& mv bin/* /usr/local/bin \
&& mv include/* /usr/local/include/ \
&& mv lib/* /usr/local/lib \
&& cd /tmp \
&& rm -rf node-${NODEJS_VERSION}-linux-x64 node-${NODEJS_VERSION}-linux-x64.tar.xz

WORKDIR ${APP_HOME}
ENV PATH /${APP_HOME}/node_modules/.bin:$PATH
COPY package.json /${APP_HOME}/package.json
RUN npm install \
&& npm install react-scripts

Build Context
Please provide or clearly describe any files needed to build the Dockerfile (ADD/COPY commands)
Our Package.json (copied to image via COPY command):

{
"name": "workdir",
"version": "0.1.0",
"private": true,
"dependencies": {
"@testing-library/jest-dom": "^4.2.4",
"@testing-library/react": "^9.4.0",
"@testing-library/user-event": "^7.2.1",
"axios": "^0.19.2",
"react": "^16.12.0",
"react-dom": "^16.12.0",
"react-router-dom": "^5.1.2",
"react-scripts": "3.3.1"
},
"scripts": {
"start": "react-scripts start",
"build": "react-scripts build",
"test": "react-scripts test",
"eject": "react-scripts eject"
},
"eslintConfig": {
"extends": "react-app"
},
"browserslist": {
"production": [
">0.2%",
"not dead",
"not op_mini all"
],
"development": [
"last 1 chrome version",
"last 1 firefox version",
"last 1 safari version"
]
},
"devDependencies": {
"jest-localstorage-mock": "^2.4.0"
}
}

Kaniko Image (fully qualified with digest)
gcr.io/kaniko-project/executor:debug-v0.17.1

Triage Notes for the Maintainers

| Description | Yes/No |
|----------------|---------------|
| Please check if this a new feature you are proposing |
- - [ ]
|
| Please check if the build works in docker but not in kaniko |
- - [X ]
|
| Please check if this error is seen when you use --cache flag |
- - [ ]
|
| Please check if your dockerfile is a multistage dockerfile |
- - [ ]
|

arefilesystems fixed-needs-verfication in progress kinbug prioritp0

Source

vguaglione

👍1

Most helpful comment

@simeonberkley @cvgw I successfully tested debug-a1af057f997316bfb1c4d2d82719d78481a02a79 with caching turned on with the sample dockerfile provided here in this thread. @tejal29 Yes, cache invalidation would be a nice feature here otherwise there will likely be some confusion within our user base. Thank you both for helping with this issue.

vguaglione on 28 Feb 2020

👍2 🎉1

All 18 comments

downgrading to debug-v0.16.0 fixed my issue (symlinking to /proc/self/fd/1 and 2). let me know if it would help for me to try later versions. i first had trouble with whatever 'debug' was pointed at two days ago.

simeonberkley on 7 Feb 2020

@cvgw Based on Simeon's comment, we will downgrade to the debug version of v0.16.0 until latest is fixed.

vguaglione on 8 Feb 2020

I suspect this may be related to the issue in #1039

cvgw on 8 Feb 2020

@vguaglione would you mind giving tag 1039-fix-test or debug-1039-fix-test a try? I think that might resolve the issue, but I'd appreciate if you could give it a try. Thanks!

cvgw on 8 Feb 2020

I did some testing and it doesn't look like 1039-fix-test fixes this issue. I do believe that the issues are related.

My theory is that https://github.com/GoogleContainerTools/kaniko/blob/8d9e6b8ea54274f73517f11c113c13cd03d26349/pkg/snapshot/snapshot.go#L183

Causes the resolved link to also be added to the layer (i.e. adding /usr/lib/.build-id/c0/b98a... and /lib64/libBrokenLocale.so.1). In this case /lib64/libBrokenLocale.so.1 is not itself a link, but instead it's parent directory /lib64 is a link to /usr/lib64.

My belief is that this causes the links (i.e. /lib64/libBrokenLocale.so.1) to be directly added to the tar which breaks the existing link /lib64 => /usr/lib64.

Removing the code at https://github.com/GoogleContainerTools/kaniko/blob/8d9e6b8ea54274f73517f11c113c13cd03d26349/pkg/snapshot/snapshot.go#L183 appears to fix the issues, which would be inline with the proposed theory as only the resolved path and not the link path would be added.

If this is indeed the cause, I think the fix would be to check whether a given path itself is a link or if one of it's ancestors is actually the link.

I did some quick checking and on linux trying to create a link for a file whose ancestor is already a link causes an error which I think adds weight to the theory; in our case we are writing the links to the layer tar so the error isn't hit until the filesystem tries to resolve the layers on pull. e.g.

$ ls /usr/lib/bar
=> /usr/lib/bar/foo.txt
$ ln -s /usr/lib/bar barlink
$ ln -s /usr/lib/bar/foo.txt barlink/foo.txt
=> ERROR

cvgw on 9 Feb 2020

@cvgw Sorry, wasn't able to get back to this over the weekend. Feel free to add @simeonberkley when you need for us to verify any changes.

vguaglione on 10 Feb 2020

👍1

@cvgw Sorry, wasn't able to get back to this over the weekend. Feel free to add @simeonberkley when you need for us to verify any changes.

No worries @vguaglione. I didn't expect you to respond over the week; I just happened to have some free time to test it myself.

cvgw on 10 Feb 2020

👍1

@cvgw

My theory is that https://github.com/GoogleContainerTools/kaniko/blob/8d9e6b8ea54274f73517f11c113c13cd03d26349/pkg/snapshot/snapshot.go#L183
Causes the resolved link to also be added to the layer (i.e. adding /usr/lib/.build-id/c0/b98a... and /lib64/libBrokenLocale.so.1). In this case /lib64/libBrokenLocale.so.1 is not itself a link, but instead it's parent directory /lib64 is a link to /usr/lib64.
My belief is that this causes the links (i.e. /lib64/libBrokenLocale.so.1) to be directly added to the tar which breaks the existing link /lib64 => /usr/lib64.

Removing fileWithSymlinks might break #915 i.e. copying symlinks across multistage docker.

We need to add only copy the target files to tar
We need to save the target file to /kaniko/{stg_idx} for later use.

e.g. in case where

From busybox as foo
Run echo "hello" > /tmp/target.txt
RUN ln -s /tmp/target /tmp/link --- layer B

Copy --from=foo /tmp/link /tmp/copied

We should only add /tmp/link to the layer B.
We should save /tmp/link with contents of "hello" to /kaniko/0/tmp/link

Is my understanding correct?
I think in #971, we saved both /tmp/link and /tmp/target to /kaniko/0 and also added them to layer B

tejal29 on 11 Feb 2020

We've committed a change which I believe will fix this. If anyone feels like testing tags a1af057f997316bfb1c4d2d82719d78481a02a79 and debug-a1af057f997316bfb1c4d2d82719d78481a02a79 have the new code

cvgw on 25 Feb 2020

@cvgw Ok, I went ahead and tested this and it appears to fix the issue. @simeonberkley Can you also run your tests with this updated image and let Cole know if those tests are successful?

vguaglione on 26 Feb 2020

🎉2

@vguaglione, I was having the same error with kaniko:debug
ERROR: Job failed: image pull failed: rpc error: code = Unknown desc = Error committing the finished image: error adding layer with blob "sha256:db37edea0ded56df3b03d6f76eedd9cd1303d1f6e4920588aa28b37c94d47b8f": Error processing tar file(exit status 1): open /bin/nc.traditional: no such file or directory
I can confirm that tag debug-a1af057f997316bfb1c4d2d82719d78481a02a79 fixed the problem. What's the plan to push to the :debug tag? Is there a pull request I could follow?

gfvirga on 26 Feb 2020

👍1

@cvgw Let us know when this fix becomes available via the standard debug tag. Thanks.

vguaglione on 27 Feb 2020

👍1

@cvgw Ok, I went ahead and tested this and it appears to fix the issue. @simeonberkley Can you also run your tests with this updated image and let Cole know if those tests are successful?

I started using the '--cache' option, which isn't going so well currently. I have a couple of projects doing automated builds overnight, and they fail about half the time so far, but can be re-run after deleting the cache. This was with the 0.16.0 image. I tried out debug-a1af057f997316bfb1c4d2d82719d78481a02a79 and 0.16.0, and they both error out with:
"""
INFO[0074] Found cached layer, extracting to filesystem
error building image: error building stage: failed to execute command: extracting fs from image: removing whiteout .wh.dev: unlinkat //dev/pts/ptmx: operation not permitted
ERROR: Job failed: command terminated with exit code 1
"""

I can switch back to building without cache, which I think will solve my problem, but will make the overnight builds superfluous because they were put in place to bump the cache (ttl 24h) so that we can take advantage of the speed up while keeping things up-to-date (and running a vuln scan).

simeonberkley on 27 Feb 2020

@simeonberkley @cvgw I didn't try the cache option for my tests, FYI. We have tested some of our images with caching turned on in the past, but we haven't tried it with this specific fix version.

vguaglione on 27 Feb 2020

@cvgw Ok, I went ahead and tested this and it appears to fix the issue. @simeonberkley Can you also run your tests with this updated image and let Cole know if those tests are successful?

I started using the '--cache' option, which isn't going so well currently. I have a couple of projects doing automated builds overnight, and they fail about half the time so far, but can be re-run after deleting the cache. This was with the 0.16.0 image. I tried out debug-a1af057f997316bfb1c4d2d82719d78481a02a79 and 0.16.0, and they both error out with:
"""
INFO[0074] Found cached layer, extracting to filesystem
error building image: error building stage: failed to execute command: extracting fs from image: removing whiteout .wh.dev: unlinkat //dev/pts/ptmx: operation not permitted
ERROR: Job failed: command terminated with exit code 1
"""

I can switch back to building without cache, which I think will solve my problem, but will make the overnight builds superfluous because they were put in place to bump the cache (ttl 24h) so that we can take advantage of the speed up while keeping things up-to-date (and running a vuln scan).

If you have a cache that is persisted between versions you might try clearing it.

That is to say, don't try to re-use a cache from v0.16.0 with v0.17.0

@tejal29 we might want to consider including the commit sha of kaniko in the cache key so that the cache gets invalidated with a new kaniko version.

cvgw on 27 Feb 2020

👍1

We found the same issue. Clearing all cache between versions helped.

On Thu, Feb 27, 2020, 5:46 PM Cole Wippern notifications@github.com wrote:

@cvgw https://github.com/cvgw Ok, I went ahead and tested this and it
appears to fix the issue. @simeonberkley
https://github.com/simeonberkley Can you also run your tests with this
updated image and let Cole know if those tests are successful?

I started using the '--cache' option, which isn't going so well currently.
I have a couple of projects doing automated builds overnight, and they fail
about half the time so far, but can be re-run after deleting the cache.
This was with the 0.16.0 image. I tried out
debug-a1af057f997316bfb1c4d2d82719d78481a02a79 and 0.16.0, and they both
error out with:
"""
INFO[0074] Found cached layer, extracting to filesystem
error building image: error building stage: failed to execute command:
extracting fs from image: removing whiteout .wh.dev: unlinkat
//dev/pts/ptmx: operation not permitted
ERROR: Job failed: command terminated with exit code 1
"""

I can switch back to building without cache, which I think will solve my
problem, but will make the overnight builds superfluous because they were
put in place to bump the cache (ttl 24h) so that we can take advantage of
the speed up while keeping things up-to-date (and running a vuln scan).

If you have a cache that is persisted between versions you might try
clearing it.

That is to say, don't try to re-use a cache from v0.16.0 with v0.17.0

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/GoogleContainerTools/kaniko/issues/1038?email_source=notifications&email_token=AABBQ6N463XQMCC7Y3HGFETRFA7FPA5CNFSM4KRSOIYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENGIGYA#issuecomment-592216928,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AABBQ6LUJRL3WTUJ6JCN2G3RFA7FPANCNFSM4KRSOIYA
.