Actual behavior
kaniko-built image that has been pushed to a repository is not extracted correctly upon download, but the same docker-built image that has been pushed to a repository is extracted correctly upon download.
Expected behavior
Image built with kaniko should be extracted correctly when pulled from a repository.
To Reproduce
Steps to reproduce the behavior:
Output provided below is from 2 runs of the pull command, first, to pull the docker-built version, then to pull the kaniko-built version:
RESULTS OF PULL WHEN DOCKER-BUILT:
โ docker pull myrepo/dihi-health-intelligence/mlaas/mlaas-ui:363168a0
363168a0: Pulling from dihi-health-intelligence/mlaas/mlaas-ui
d318c91bf2a8: Pull complete
e78b80ba2df3: Pull complete
57e393e12bdc: Pull complete
4cd4d3e2db54: Pull complete
4d2e313894a7: Pull complete
Digest: sha256:5a8c044582b7551bdedf91b62472523a84d364b169e3136619ca8f991bd7004d
Status: Downloaded newer image for myrepo/dihi-health-intelligence/mlaas/mlaas-ui:363168a0
myrepo/dihi-health-intelligence/mlaas/mlaas-ui:363168a0
RESULTS OF PULL WHEN KANIKO-BUILT:
โ docker pull myrepo/dihi-health-intelligence/mlaas/mlaas-ui:0ae48f4b
0ae48f4b: Pulling from dihi-health-intelligence/mlaas/mlaas-ui
d318c91bf2a8: Pull complete
4ac345e3547d: Extracting [==================================================>] 353.7MB/353.7MB
b2a1d0b62606: Download complete
db78db57c5e7: Download complete
4c4383ef3686: Download complete
failed to register layer: Error processing tar file(exit status 1): symlink libBrokenLocale-2.30.so /lib64/libBrokenLocale.so.1: no such file or directory
Two users are currently reporting this problem. One of the users who is currently affected has provided us with the following information:
"[12:22 PM] SB
i get the same error. i'm looking into whether it has something to do with the part of the dockerfile that the error seems to be pointing at (making symlinks to /proc/self/fd/1 and 2)
[12:24 PM] SB
that fixed the scan. so something changed with either kaniko or the kaniko-build tagged runners so that '/proc/self/fd/1' and '/proc/self/fd/2' don't seem to exist
[12:26 PM] SB
though it works during the build and fails on pull, which makes me think that something in kaniko that describes the layer that contains /proc/self/fd isn't describing it sufficiently for the client to recreate that layer on pull"
Additional Information
Our Dockerfile:
FROM fedora:31
ARG IMAGE_NAME='nodejs'
LABEL name=${IMAGE_NAME}
ARG NODEJS_VERSION=v12.14.1
ARG CI_COMMIT_SHA=unspecified
LABEL git_commit=${CI_COMMIT_SHA}
ARG CI_PROJECT_URL
LABEL git_repository_url=${CI_PROJECT_URL}
ENV APP_HOME=/opt/app-root/src
WORKDIR /tmp
RUN yum install -y findutils readline-devel gcc gcc-c++ make zlib-devel xz openssl openssl-devel git patch \
&& curl -L https://nodejs.org/dist/${NODEJS_VERSION}/node-${NODEJS_VERSION}-linux-x64.tar.xz > node-${NODEJS_VERSION}-linux-x64.tar.xz \
&& tar -Jxvf node-${NODEJS_VERSION}-linux-x64.tar.xz \
&& cd node-${NODEJS_VERSION}-linux-x64 \
&& mv bin/* /usr/local/bin \
&& mv include/* /usr/local/include/ \
&& mv lib/* /usr/local/lib \
&& cd /tmp \
&& rm -rf node-${NODEJS_VERSION}-linux-x64 node-${NODEJS_VERSION}-linux-x64.tar.xz
WORKDIR ${APP_HOME}
ENV PATH /${APP_HOME}/node_modules/.bin:$PATH
COPY package.json /${APP_HOME}/package.json
RUN npm install \
&& npm install react-scripts
{
"name": "workdir",
"version": "0.1.0",
"private": true,
"dependencies": {
"@testing-library/jest-dom": "^4.2.4",
"@testing-library/react": "^9.4.0",
"@testing-library/user-event": "^7.2.1",
"axios": "^0.19.2",
"react": "^16.12.0",
"react-dom": "^16.12.0",
"react-router-dom": "^5.1.2",
"react-scripts": "3.3.1"
},
"scripts": {
"start": "react-scripts start",
"build": "react-scripts build",
"test": "react-scripts test",
"eject": "react-scripts eject"
},
"eslintConfig": {
"extends": "react-app"
},
"browserslist": {
"production": [
">0.2%",
"not dead",
"not op_mini all"
],
"development": [
"last 1 chrome version",
"last 1 firefox version",
"last 1 safari version"
]
},
"devDependencies": {
"jest-localstorage-mock": "^2.4.0"
}
}
Kaniko Image (fully qualified with digest)
gcr.io/kaniko-project/executor:debug-v0.17.1
Triage Notes for the Maintainers
| Description | Yes/No |
|----------------|---------------|
| Please check if this a new feature you are proposing |
--cache flag | downgrading to debug-v0.16.0 fixed my issue (symlinking to /proc/self/fd/1 and 2). let me know if it would help for me to try later versions. i first had trouble with whatever 'debug' was pointed at two days ago.
@cvgw Based on Simeon's comment, we will downgrade to the debug version of v0.16.0 until latest is fixed.
I suspect this may be related to the issue in #1039
@vguaglione would you mind giving tag 1039-fix-test or debug-1039-fix-test a try? I think that might resolve the issue, but I'd appreciate if you could give it a try. Thanks!
I did some testing and it doesn't look like 1039-fix-test fixes this issue. I do believe that the issues are related.
My theory is that https://github.com/GoogleContainerTools/kaniko/blob/8d9e6b8ea54274f73517f11c113c13cd03d26349/pkg/snapshot/snapshot.go#L183
Causes the resolved link to also be added to the layer (i.e. adding /usr/lib/.build-id/c0/b98a... and /lib64/libBrokenLocale.so.1). In this case /lib64/libBrokenLocale.so.1 is not itself a link, but instead it's parent directory /lib64 is a link to /usr/lib64.
My belief is that this causes the links (i.e. /lib64/libBrokenLocale.so.1) to be directly added to the tar which breaks the existing link /lib64 => /usr/lib64.
Removing the code at https://github.com/GoogleContainerTools/kaniko/blob/8d9e6b8ea54274f73517f11c113c13cd03d26349/pkg/snapshot/snapshot.go#L183 appears to fix the issues, which would be inline with the proposed theory as only the resolved path and not the link path would be added.
If this is indeed the cause, I think the fix would be to check whether a given path itself is a link or if one of it's ancestors is actually the link.
I did some quick checking and on linux trying to create a link for a file whose ancestor is already a link causes an error which I think adds weight to the theory; in our case we are writing the links to the layer tar so the error isn't hit until the filesystem tries to resolve the layers on pull. e.g.
$ ls /usr/lib/bar
=> /usr/lib/bar/foo.txt
$ ln -s /usr/lib/bar barlink
$ ln -s /usr/lib/bar/foo.txt barlink/foo.txt
=> ERROR
@cvgw Sorry, wasn't able to get back to this over the weekend. Feel free to add @simeonberkley when you need for us to verify any changes.
@cvgw Sorry, wasn't able to get back to this over the weekend. Feel free to add @simeonberkley when you need for us to verify any changes.
No worries @vguaglione. I didn't expect you to respond over the week; I just happened to have some free time to test it myself.
@cvgw
My theory is that https://github.com/GoogleContainerTools/kaniko/blob/8d9e6b8ea54274f73517f11c113c13cd03d26349/pkg/snapshot/snapshot.go#L183
Causes the resolved link to also be added to the layer (i.e. adding/usr/lib/.build-id/c0/b98a...and/lib64/libBrokenLocale.so.1). In this case/lib64/libBrokenLocale.so.1is not itself a link, but instead it's parent directory/lib64is a link to/usr/lib64.
My belief is that this causes the links (i.e./lib64/libBrokenLocale.so.1) to be directly added to the tar which breaks the existing link/lib64 => /usr/lib64.
Removing fileWithSymlinks might break #915 i.e. copying symlinks across multistage docker.
e.g. in case where
From busybox as foo
Run echo "hello" > /tmp/target.txt
RUN ln -s /tmp/target /tmp/link --- layer B
Copy --from=foo /tmp/link /tmp/copied
We should only add /tmp/link to the layer B.
We should save /tmp/link with contents of "hello" to /kaniko/0/tmp/link
Is my understanding correct?
I think in #971, we saved both /tmp/link and /tmp/target to /kaniko/0 and also added them to layer B
We've committed a change which I believe will fix this. If anyone feels like testing tags a1af057f997316bfb1c4d2d82719d78481a02a79 and debug-a1af057f997316bfb1c4d2d82719d78481a02a79 have the new code
@cvgw Ok, I went ahead and tested this and it appears to fix the issue. @simeonberkley Can you also run your tests with this updated image and let Cole know if those tests are successful?
@vguaglione, I was having the same error with kaniko:debug
ERROR: Job failed: image pull failed: rpc error: code = Unknown desc = Error committing the finished image: error adding layer with blob "sha256:db37edea0ded56df3b03d6f76eedd9cd1303d1f6e4920588aa28b37c94d47b8f": Error processing tar file(exit status 1): open /bin/nc.traditional: no such file or directory
I can confirm that tag debug-a1af057f997316bfb1c4d2d82719d78481a02a79 fixed the problem. What's the plan to push to the :debug tag? Is there a pull request I could follow?
@cvgw Let us know when this fix becomes available via the standard debug tag. Thanks.
@cvgw Ok, I went ahead and tested this and it appears to fix the issue. @simeonberkley Can you also run your tests with this updated image and let Cole know if those tests are successful?
I started using the '--cache' option, which isn't going so well currently. I have a couple of projects doing automated builds overnight, and they fail about half the time so far, but can be re-run after deleting the cache. This was with the 0.16.0 image. I tried out debug-a1af057f997316bfb1c4d2d82719d78481a02a79 and 0.16.0, and they both error out with:
"""
INFO[0074] Found cached layer, extracting to filesystem
error building image: error building stage: failed to execute command: extracting fs from image: removing whiteout .wh.dev: unlinkat //dev/pts/ptmx: operation not permitted
ERROR: Job failed: command terminated with exit code 1
"""
I can switch back to building without cache, which I think will solve my problem, but will make the overnight builds superfluous because they were put in place to bump the cache (ttl 24h) so that we can take advantage of the speed up while keeping things up-to-date (and running a vuln scan).
@simeonberkley @cvgw I didn't try the cache option for my tests, FYI. We have tested some of our images with caching turned on in the past, but we haven't tried it with this specific fix version.
@cvgw Ok, I went ahead and tested this and it appears to fix the issue. @simeonberkley Can you also run your tests with this updated image and let Cole know if those tests are successful?
I started using the '--cache' option, which isn't going so well currently. I have a couple of projects doing automated builds overnight, and they fail about half the time so far, but can be re-run after deleting the cache. This was with the 0.16.0 image. I tried out debug-a1af057f997316bfb1c4d2d82719d78481a02a79 and 0.16.0, and they both error out with:
"""
INFO[0074] Found cached layer, extracting to filesystem
error building image: error building stage: failed to execute command: extracting fs from image: removing whiteout .wh.dev: unlinkat //dev/pts/ptmx: operation not permitted
ERROR: Job failed: command terminated with exit code 1
"""I can switch back to building without cache, which I think will solve my problem, but will make the overnight builds superfluous because they were put in place to bump the cache (ttl 24h) so that we can take advantage of the speed up while keeping things up-to-date (and running a vuln scan).
If you have a cache that is persisted between versions you might try clearing it.
That is to say, don't try to re-use a cache from v0.16.0 with v0.17.0
@tejal29 we might want to consider including the commit sha of kaniko in the cache key so that the cache gets invalidated with a new kaniko version.
We found the same issue. Clearing all cache between versions helped.
On Thu, Feb 27, 2020, 5:46 PM Cole Wippern notifications@github.com wrote:
@cvgw https://github.com/cvgw Ok, I went ahead and tested this and it
appears to fix the issue. @simeonberkley
https://github.com/simeonberkley Can you also run your tests with this
updated image and let Cole know if those tests are successful?I started using the '--cache' option, which isn't going so well currently.
I have a couple of projects doing automated builds overnight, and they fail
about half the time so far, but can be re-run after deleting the cache.
This was with the 0.16.0 image. I tried out
debug-a1af057f997316bfb1c4d2d82719d78481a02a79 and 0.16.0, and they both
error out with:
"""
INFO[0074] Found cached layer, extracting to filesystem
error building image: error building stage: failed to execute command:
extracting fs from image: removing whiteout .wh.dev: unlinkat
//dev/pts/ptmx: operation not permitted
ERROR: Job failed: command terminated with exit code 1
"""I can switch back to building without cache, which I think will solve my
problem, but will make the overnight builds superfluous because they were
put in place to bump the cache (ttl 24h) so that we can take advantage of
the speed up while keeping things up-to-date (and running a vuln scan).If you have a cache that is persisted between versions you might try
clearing it.That is to say, don't try to re-use a cache from v0.16.0 with v0.17.0
โ
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/GoogleContainerTools/kaniko/issues/1038?email_source=notifications&email_token=AABBQ6N463XQMCC7Y3HGFETRFA7FPA5CNFSM4KRSOIYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENGIGYA#issuecomment-592216928,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AABBQ6LUJRL3WTUJ6JCN2G3RFA7FPANCNFSM4KRSOIYA
.
@simeonberkley @cvgw I successfully tested debug-a1af057f997316bfb1c4d2d82719d78481a02a79 with caching turned on with the sample dockerfile provided here in this thread. @tejal29 Yes, cache invalidation would be a nice feature here otherwise there will likely be some confusion within our user base. Thank you both for helping with this issue.
Thank you everyone for reporting your findings and helping to test. I'm gonna close this one, but feel free to reopen if needed.
Most helpful comment
@simeonberkley @cvgw I successfully tested debug-a1af057f997316bfb1c4d2d82719d78481a02a79 with caching turned on with the sample dockerfile provided here in this thread. @tejal29 Yes, cache invalidation would be a nice feature here otherwise there will likely be some confusion within our user base. Thank you both for helping with this issue.