Actual behavior
Hey we鈥檙e having some problems with kaniko 0.19 and python. We have a dockerfile that looks vaguely like this:
FROM python:3.7.4-slim AS service-base
RUN apt-get update \
&& apt-get --assume-yes install curl gpg libcurl4-openssl-dev libssl-dev gcc git build-essential netcat libsnappy-dev
RUN pip install poetry=="1.0.2"
COPY ./_packages /api/_packages
COPY ./config /config
WORKDIR /service/cat
COPY . /service/cat
RUN rm -rf _packages
RUN rm -rf config
RUN \
poetry config virtualenvs.create false && \
poetry install
For reasons independent of us, we can鈥檛 actually re-organize the sourcecode to remove the RUN rm lines.
What we鈥檙e seeing is that if our cache ECR repo is empty, then the image is fine.
We then build an image using the cache, but change one of later layers (for example adding a file to /service/cat)
The build then completes, but upon pulling the image from ECR to a kubelet, we see:
Failed to pull image
"{repo_url}": rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): file exists
Expected behavior
Subsequent builds using cache should produce an image that can be ran on kubernetes
To Reproduce
See above
Additional Information
Triage Notes for the Maintainers
| Description | Yes/No |
|----------------|---------------|
| Please check if this a new feature you are proposing |
--cache flag | I can confirm that this issue also happens randomly to me when caching is enabled. It happens in non-python project as well and the only workaround I found so far is to remove every cache image and re-build the image. The error message when pulling the image is different on OSX and on Linux:
ApplyLayer duplicates of file paths not supported stdout: {"layerSize":283880101}ERROR: failed to register layer: Error processing tar file(exit status 1): failed to mknod("/etc/alternatives/pager.1.gz", S_IFCHR, 0): file exists (file may vary)I think it started occurring in v0.17.0 for me, but can't tell for sure.
Thanks @gilbsgilbs I will take a look into this.
seems related/equal to #1158 -
we are also experiencing this issue and we are building our containers like this:
FROM continuumio/miniconda3:latest
COPY ./environment.yml /app/environment.yml
RUN conda env update -n base -f /app/environment.yml && conda clean --all -y -q && rm -rf /opt/conda/pkgs
COPY ./ourproject.a /app/ourproject.a
COPY ./ourproject.b /app/ourproject.b
COPY ./ourproject.c /app/ourproject.c
COPY ./ourproject.d /app/ourproject.d
COPY ./ourproject.e /app/ourproject.e
COPY ./ourproject.f /app/ourproject.f
ARG VERSION=docker.dev
ENV DOCKER_BUILD_VERSION $VERSION
RUN SETUPTOOLS_SCM_PRETEND_VERSION=$DOCKER_BUILD_VERSION pip install -e /app/ourproject.a && SETUPTOOLS_SCM_PRETEND_VERSION=$DOCKER_BUILD_VERSION pip install -e /app/ourproject.b -e /app/ourproject.c -e /app/ourproject.d -e /app/ourproject.e -e /app/ourproject.f
WORKDIR /work
our kaniko is invoked like this in our .gitlab-ci.yml (basically like it is recommended by gitlab)
.kaniko_job_template: &kaniko_job
image:
name: gcr.io/kaniko-project/executor:debug
entrypoint: [""]
timeout: 1h
setup:anaconda docker image:
stage: setup
<<: *kaniko_job
script:
- 'echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.json'
- '/kaniko/warmer --cache-dir=/cache --image=continuumio/miniconda3:latest --verbosity=debug || true'
- '/kaniko/executor version'
- '/kaniko/executor --context $CI_PROJECT_DIR --dockerfile $CI_PROJECT_DIR/dockerfiles/conda/Dockerfile --cache=true --cache-dir=/cache --destination $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA --destination $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG-latest --verbosity=debug --cleanup'
failed to register layer: Error processing tar file(exit status 1): failed to mknod("/opt/conda/conda-meta/certifi-2019.11.28-py37_0.json", S_IFCHR, 0): file existsWe're also seeing that corrupted images increase in size by O(duplicate of one of the layers).
Have you managed to work out which docker command the additional layer corresponds to? We have suspected that it has to do with layers that don't change the image size (run rm or run mv ) but haven't managed to isolate the issue yet.
Re version: We have used 0.16 briefly, but haven't noticed this issue until 0.17 and up (we upgraded due to other bugs). We assumed that this particular issue is due to our dockerfiles being a bit "loose" (haven't linted them) so we just tried to upgrade our way out of it, but now we're close to target date for our project and it's the last blocker.
Also for context.
It's not entirely clear to me how to cause this to fail reliably.
I've noticed that given a good image and its layers in cache, sometimes adding/removing a file causes the subsequent one to build or not. Tag/envvar changes don't cause problems as @hochreitercorpuls
pointed out.
This is how we invoke kaniko:
/kaniko/executor "--context" "dir:///garden-build/catproj/" "--dockerfile" "cat.Dockerfile" "--destination" "myaccount.dkr.ecr.eu-west-2.amazonaws.com/garden/cat-27f4aa058f" "--cache=true" "--build-arg" "BASE_IMAGE=myaccount.dkr.ecr.eu-west-2.amazonaws.com/base-image:v-8d42ac5948";
What we're seeing is that even though the broken layer is in the cache registry, kaniko doesn't detect it and doesn't fail to build. Instead pushes an image with that broken layer in place
Also running into the error:
Failed to pull image "<imagename>": rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): file exists
Building with image: gcr.io/kaniko-project/executor:debug-v0.18.0
Command:
/kaniko/executor --cache=true \
--context ${WORKSPACE} \
--destination=${imageName} \
--dockerfile ${WORKSPACE}/${dockerfileName} \
--log-format text
Dockerfile:
# https://github.com/ContinuumIO/docker-images/blob/master/miniconda3/debian/Dockerfile
FROM continuumio/miniconda3:4.7.12
SHELL ["/bin/bash", "-euo", "pipefail", "-c"]
USER root
WORKDIR /source/model
COPY ./environment.yml .
RUN conda env update -f environment.yml \
&& conda clean --all
COPY . .
CMD ["python3", "model.py"]
Is there anything else we can provide to help with this?
@swist Sorry with the current situation I have been able to get to this sooner.
I spent a lot of time trying to duplicate this bug at my end with the docker file you provided.
I did the following actions
--cache=true and push to GCR registryenvironment.yml to add a simple dependency. name: stats-1
dependencies:
- numpy
Is there anything else i am missing?
I also deployed a pod with the image produced on GKE and the pod was running fine. I am looking into code now to see what more can i do.
We're pretty sure that adding a dependency is going to invalidate the broken layer - it is going force a reinstall. What we have found has sometimes fails is when you add an extra empty file in the context so that it gets copied in just before the last layer
I'll try to get you a minimal reproducible example, unfortunately the only thing I have currently is my org's code.
According to community slack, something like this should do:
I'm having this issue with a non-multistage build for a python application with --cache=true set and no other cache variables present. The resulting image cannot be pulled:
...
e645b11c20c7: Pull complete
f99c55bbe43f: Pull complete
a601f4aae6a4: Pull complete
0eab66b1bfee: Extracting [==================================================>] 658.4MB/658.4MB
1a62f58fd1c4: Download complete
c885b0b9491c: Download complete
1bb9ef84f671: Download complete
7d64f9e9bf99: Download complete
b09c7a2df20f: Download complete
55407e0fe961: Download complete
failed to register layer: Error processing tar file(exit status 1): failed to mknod("/var/cache/apt/pkgcache.bin", S_IFCHR, 0): file exists
It's uncleared what triggered the issue, though clearing the cache 'fixes' it. if additional information helps please ask and I can try and provide.
Only solution to still use cache layers in a Docker registry is to rollback to 0.16.0 from any version from 0.17.0 to 0.19.0, this is a major regression and should be p0.
Still happens for us. Take any multistage build. Add/remove line that sets ENV=$BUILDARG
@tejal29 do you have any more context as to how this is happening? Were you able to reproduce on your setup?
We're seeing this problem also. It seems to happen whenever we build this Dockerfile:
FROM ubuntu:18.04
MAINTAINER "XXX"
RUN apt-get update
RUN apt-get install -y python3 python3-dev python3-pip
COPY ./ ./app
WORKDIR ./app
RUN pip3 install -r requirements.txt
ENTRYPOINT [ "python3" ]
CMD [ "main.py" ]
Still happens for us. Take any multistage build. Add/remove line that sets
ENV=$BUILDARG@tejal29 do you have any more context as to how this is happening? Were you able to reproduce on your setup?
yes. i was able to reproduce this when i ran the above dockerfile 3rd time.
We are also hitting this problem
c935c89cdb3c: Pull complete
59bb444f4818: Pull complete
f14fac6c6413: Extracting [==================================================>] 467.1MB/467.1MB
2431d9999311: Download complete
image: gcr.io/kaniko-project/executor:debug
command:
/kaniko/executor --context=${context} --dockerfile=${Dockerfile} --destination=${image} --cache=true --cache-repo=${IMAGE_CACHE}
More detail in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/28988
Having previously ran into the same problem (Python based, multi-step Dockerfile), i've tried to use the files here to reproduce the issue but in vain.
@cx1111 - I've tried your Dockerfile (the one based on continuumio/miniconda3:4.7.12) on both v0.18.0 and v0.19.0, with an environment.yml file of a single dependency pytz.
The bug did not reproduce, but something strange I did notice was that in version v0.18.0 - inserting a dummy command just before the CMD (RUN python3 --version), caused the final image size to grow by few hundreds of MB. I'm guessing that's an unrelated bug that was fixed.
Anyway, could you share your environments.yml file?
@rhs - same here, didn't manage to reproduce it. Any chance you can share the requirements.txt file? I tried it with a pretty small set of Python requirements, so I don't know if it might be related.
@swist I have tried to use your Dockerfile with a very basic pyproject.toml file (copied from here), and Kaniko (v0.18.0+) crashes with a panic. This seem to be related to a very recent issue with Go 1.14 based images (like Kaniko), discussed here: https://github.com/golang/go/issues/37436 https://github.com/golang/go/wiki/LinuxKernelSignalVectorBug
I'll created a separate issue here: https://github.com/GoogleContainerTools/kaniko/issues/1202
I also encountered this problem. I'm using Google Cloud Build. It builds and works completely fine with docker build-docker push or if there's no cache. If there is a cache then it fails with following error:
rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): failed to mknod("/tmp/v8-compile-cache-0", S_IFCHR, 0): file exists
Dockerfile:
FROM node:12-slim
WORKDIR /app
EXPOSE 8080
CMD ["node", "app-client.js"]
RUN apt-get update && apt-get install --yes --no-install-recommends \
ca-certificates \
curl \
gnupg \
&& echo "deb http://packages.cloud.google.com/apt gcsfuse-stretch main" \
| tee /etc/apt/sources.list.d/gcsfuse.list \
&& curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
&& apt-get update \
&& apt-get install --yes gcsfuse \
&& apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
COPY package*.json ./
RUN npm ci
COPY . .
cloudbuild.yaml
steps:
- id: Build and push container
name: 'gcr.io/kaniko-project/executor:latest'
args: [
'--dockerfile=Dockerfile',
'--destination=$_IMAGE_NAME:$COMMIT_SHA',
'--destination=$_IMAGE_NAME:$BRANCH_NAME',
'--cache=true',
]
options:
substitutionOption: ALLOW_LOOSE
substitutions:
_IMAGE_NAME: gcr.io/$PROJECT_ID/project_name
_DOCKERFILE_NAME: Dockerfile
_DOCKERFILE_DIR: ''
Finally managed to reproduce it consistently and figured out that it has to do with incorrect whiteout of certain files. I am still not very familiar with the logic in that area, but seeing that it was recently refactored in #1069, perhaps @tejal29 or @cvgw have a better intuition on what may have gone wrong?
To Reproduce:
If you run on certain Linux kernels, you may fail to build it with the latest kaniko image (see issue 1202).
Run make images and build a local image from the source code. That will fix it.
Dockerfile
FROM python:3.7.4-slim AS service-base
RUN apt-get update \
&& apt-get --assume-yes install curl gpg libcurl4-openssl-dev libssl-dev gcc git build-essential netcat libsnappy-dev
RUN pip install poetry=="1.0.2"
COPY ./_packages /api/_packages
COPY ./config /config
WORKDIR /service/cat
COPY . /service/cat
RUN rm -rf _packages
RUN rm -rf config
RUN \
poetry config virtualenvs.create false && \
poetry install
Context:
_packages/
foo.txt
config/
foo.txt
pyproject.toml
pyproject.toml is the only file whose contents are meaningful. The text files are just "hello world".
[tool.poetry]
name = "poetry-demo"
version = "0.1.0"
description = ""
authors = ["S茅bastien Eustace <[email protected]>"]
[tool.poetry.dependencies]
python = "*"
[tool.poetry.dev-dependencies]
pytest = "^3.4"
Build command to GCR:
docker run --rm -it \
--volume "$HOME"/.config/gcloud:/root/.config/gcloud \
--volume $PWD:/build gcr.io/kaniko-project/executor:latest \
--context=/build \
--dockerfile=Dockerfile \
--destination=us.gcr.io/$YOUR_PROJECT/kaniko-bug \
--cache=true \
Build it once, and you'll see an image on GCR with size of ~190MB that can be pulled.
edit pyproject.toml and add a new dependency f.e pytz = "2020.1".
Build it again, and you'll see a new image, this time with size of ~320MB, that fails to be pulled:
ERROR: failed to register layer: Error processing tar file(exit status 1): failed to mknod("/etc/alternatives/pager.1.gz", S_IFCHR, 0): file exists
Some observations:
resolve.go), you can see that at some stage, pager.1.gz is whited-out and appears as .wh.pager.1.gz when files are resolved. IIUC this is what the eventual error says - it tries to create a file that already exists on the lower layers of the overlay FS.snapshot.go, making it skip pager.1.gz: if s.l.MaybeAddWhiteout(path) {
if path != "/etc/alternatives/pager.1.gz" {
logrus.Debugf("Adding whiteout for %s", path)
filesToWhiteOut = append(filesToWhiteOut, path)
}
The produced image is still with the wrong size (320MB, instead of 190MB), but it actually can be pulled normally. Which indicates that the bug that doubled layer's size is still not resolved.
We're continuously bumping into this as well and it makes the cache completely unusable. Is there anything we can do to help prioritise this?
Thanks for all your effort.
I'm experiencing the same issue on 0.19.0. It works fine on a small Node.js image, but on a more complicated PHP image it dies. Trying now to revert to 0.16.0 to see if that will solve it for me.
Hey folks, sorry for the delay. I spent yesterday, debugging this issue and here are my findings.
kaniko/executor -f Dockerfile --cache=true --destination=<your_prj>Note: This does not happen consistently.
After doing some debugging and from error logs, the cases where docker pull <image> failed due
Error processing tar file(exit status 1): failed to XXXXXFILE : file exists
There was 2 entries for the file appeared in the layer twice in the first uncached layer or across layers.
The reason for this, kaniko takes a snapshot of entire FS before running any commands.
However, in case when --cache=true, this caused a bug where if with a dockerfile with 5 layers, the command corresponding to layer3 was changed, Kaniko
I fixed this in PR #1214 by moving the Step 0 i.e. initial snapshot when there is first layer which is not chached.
I verified Dockerfile given by @cx1111 and example provided by @dani29
I have pushed this fix in the following images
docker push gcr.io/kaniko-project/executor:debug-edge
docker push gcr.io/kaniko-project/executor:edge
Can someone give it a try?
I'm experiencing the same issue on 0.19.0. It works fine on a small Node.js image, but on a more complicated PHP image it dies. Trying now to revert to 0.16.0 to see if that will solve it for me.
@nethunter Can you please try:
docker push gcr.io/kaniko-project/executor:debug-edge
docker push gcr.io/kaniko-project/executor:edge
@mitchfriedman Sorry things have been slow. But please try
docker push gcr.io/kaniko-project/executor:debug-edge
docker push gcr.io/kaniko-project/executor:edge
gcr.io/kaniko-project/executor:edge seems to have fixed it for me. (However, I'm not sure how to invalidate the older layers which are still cached.)
@IlyaSemenov, thanks for verifying. Would have to delete the remote cache.
Unfortunately gcr.io/kaniko-project/executor:edge haven't solved the problem for me.
I removed remote cache and switched to edge. After couple of builds I'm getting the same error:
rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): failed to mknod("/tmp/v8-compile-cache-0", S_IFCHR, 0): file exists
@TimShilov are you making code changes before running each build? Also are you by any chance reusing the same kaniko pod?
@tejal29 I was making dummy changes (just whitespace) in one file between commits.
I'm building images with Google Cloud Build so I'm not 100% sure if it gets reused or not but I assume the answer is no.
Just tried one more time just to make sure I didn't missed anything. Removed remote cache, first image worked fine. After second build I got the error again.
Version 0.16.0 works without any problems.
Is there anything else I can do or try to help?
ok. let me try with GCB build. I am trying to find a small package-lock.json to reproduce your issue.
@TimShilov, i tried your dockerfile and it did work for me.
This is what i did:
Can you please check if there is a substantial increase in the size of the 2 images (like almost doubling)?
I am going to do the build one more time!
confirming, i saw the error on 4th build.
@TimShilov I finally got your issue fixed. the condition to detect if initial snapshot should be taked was not complete.
I will do some more testing and give you an image to try soon
@TimShilov , Can you try with this image:
gcr.io/kaniko-project/executor:debug-edge
gcr.io/kaniko-project/executor:edge
@tejal29 Tried again with gcr.io/kaniko-project/executor:edge. I made 4 builds and had no problems so far!
I will keep using edge for a couple of days to see if problem will return but so far seems fixed to me.
Thanks for your help! 馃憤
Thanks @TimShilov for your help verifying this build.
This will in v0.20.0 release
@tejal29 Thanks for fixing this, really appreciate it! I wanted to try using gcr.io/kaniko-project/executor:debug-edge, but it is now failing when I attempt to create a /kaniko/.docker/config.json
$ echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.json
/busybox/sh: eval: line 147: can't create /kaniko/.docker/config.json: nonexistent directory
Previously I was using gcr.io/kaniko-project/executor:debug-v0.16.0 and there was no nonexistent directory error. Is this an issue specific to edge images, or something has changed between 0.16.0 and edge
@tejal29 Thanks for fixing this, really appreciate it! I wanted to try using
gcr.io/kaniko-project/executor:debug-edge, but it is now failing when I attempt to create a/kaniko/.docker/config.json$ echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.json /busybox/sh: eval: line 147: can't create /kaniko/.docker/config.json: nonexistent directoryPreviously I was using
gcr.io/kaniko-project/executor:debug-v0.16.0and there was nononexistent directoryerror. Is this an issue specific toedgeimages, or something has changed between 0.16.0 and edge
It is NOT fixed in debug-v0.20.0
so you'll need a
$ mkdir -p /kaniko/.docker
before your echo. And gitlab needs to change its documentation.
Most helpful comment
We're continuously bumping into this as well and it makes the cache completely unusable. Is there anything we can do to help prioritise this?
Thanks for all your effort.