Kaniko: Repeated builds using cache produce broken images

Created on 27 Mar 2020 · 38Comments · Source: GoogleContainerTools/kaniko

Actual behavior
Hey we’re having some problems with kaniko 0.19 and python. We have a dockerfile that looks vaguely like this:

FROM python:3.7.4-slim AS service-base
RUN apt-get update \
  && apt-get --assume-yes install curl gpg  libcurl4-openssl-dev libssl-dev gcc git build-essential netcat libsnappy-dev

RUN pip install poetry=="1.0.2" 
COPY ./_packages /api/_packages
COPY ./config /config

WORKDIR /service/cat
COPY . /service/cat
RUN rm -rf _packages
RUN rm -rf config

RUN \
  poetry config virtualenvs.create false && \
  poetry install

For reasons independent of us, we can’t actually re-organize the sourcecode to remove the RUN rm lines.
What we’re seeing is that if our cache ECR repo is empty, then the image is fine.
We then build an image using the cache, but change one of later layers (for example adding a file to /service/cat)
The build then completes, but upon pulling the image from ECR to a kubelet, we see:
Failed to pull image

"{repo_url}": rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): file exists

Expected behavior
Subsequent builds using cache should produce an image that can be ran on kubernetes

To Reproduce
See above

Additional Information

Dockerfile
See above
Build Context
Set up a poetry project analogous to the issue here https://github.com/python-poetry/poetry/issues/1757
gcr.io/kaniko-project/executor@sha256:0d0e34396f47ec6d5fd75aebb9772147a78d96ed2bbb16ec892bd178efdc8307

Triage Notes for the Maintainers

| Description | Yes/No |
|----------------|---------------|
| Please check if this a new feature you are proposing |

- [ ]

|
| Please check if the build works in docker but not in kaniko |

- [x]

|
| Please check if this error is seen when you use --cache flag |

- [x]

|
| Please check if your dockerfile is a multistage dockerfile |

- [ ]

arecaching

Source

swist

👍14

Most helpful comment

We're continuously bumping into this as well and it makes the cache completely unusable. Is there anything we can do to help prioritise this?

Thanks for all your effort.

mitchfriedman on 23 Apr 2020

👍17

All 38 comments

I can confirm that this issue also happens randomly to me when caching is enabled. It happens in non-python project as well and the only workaround I found so far is to remove every cache image and re-build the image. The error message when pulling the image is different on OSX and on Linux:

Linux: ApplyLayer duplicates of file paths not supported stdout: {"layerSize":283880101}
OSX (described by OP): ERROR: failed to register layer: Error processing tar file(exit status 1): failed to mknod("/etc/alternatives/pager.1.gz", S_IFCHR, 0): file exists (file may vary)

I think it started occurring in v0.17.0 for me, but can't tell for sure.

gilbsgilbs on 27 Mar 2020

Thanks @gilbsgilbs I will take a look into this.

tejal29 on 27 Mar 2020

seems related/equal to #1158 -
we are also experiencing this issue and we are building our containers like this:

FROM continuumio/miniconda3:latest

COPY ./environment.yml /app/environment.yml
RUN conda env update -n base -f /app/environment.yml && conda clean --all -y -q && rm -rf /opt/conda/pkgs

COPY ./ourproject.a /app/ourproject.a
COPY ./ourproject.b /app/ourproject.b
COPY ./ourproject.c /app/ourproject.c
COPY ./ourproject.d /app/ourproject.d
COPY ./ourproject.e /app/ourproject.e
COPY ./ourproject.f /app/ourproject.f

ARG VERSION=docker.dev
ENV DOCKER_BUILD_VERSION $VERSION

RUN SETUPTOOLS_SCM_PRETEND_VERSION=$DOCKER_BUILD_VERSION pip install -e /app/ourproject.a && SETUPTOOLS_SCM_PRETEND_VERSION=$DOCKER_BUILD_VERSION pip install -e /app/ourproject.b -e /app/ourproject.c -e /app/ourproject.d -e /app/ourproject.e -e /app/ourproject.f

WORKDIR /work

our kaniko is invoked like this in our .gitlab-ci.yml (basically like it is recommended by gitlab)

.kaniko_job_template: &kaniko_job
  image:
    name: gcr.io/kaniko-project/executor:debug
    entrypoint: [""]
  timeout: 1h

setup:anaconda docker image:
  stage: setup
  <<: *kaniko_job
  script:
    - 'echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.json'
    - '/kaniko/warmer --cache-dir=/cache --image=continuumio/miniconda3:latest --verbosity=debug || true'
    - '/kaniko/executor version'
    - '/kaniko/executor --context $CI_PROJECT_DIR --dockerfile $CI_PROJECT_DIR/dockerfiles/conda/Dockerfile --cache=true --cache-dir=/cache --destination $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA --destination $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG-latest --verbosity=debug --cleanup'

We use the warmer to prepopulate the cache with the base image. Which works fine.
We used it to prepopulate the cache with an earlier version of the to be built image too - right now it is disabled due to previous debugging efforts.
What we experience with cache being enabled is, that the first build (without any existant cache) works perfectly fine.
Also subsequent builds (with cache enabled) work just as expected, as long as those do not contain any changes within the code - if it's rather adding a tag to an existant image instead of building a new one.
Any subsequent run with a new commit containing code changes in any of the copied and pip-installed project parts leads to a corrupted image with its size doubled due to a corrupted layer (excluding base image size).
Trying to download the "additional" layer from cache repository registry results in an error just like @gilbsgilbs metioned. failed to register layer: Error processing tar file(exit status 1): failed to mknod("/opt/conda/conda-meta/certifi-2019.11.28-py37_0.json", S_IFCHR, 0): file exists

hochreitercorpuls on 28 Mar 2020

We're also seeing that corrupted images increase in size by O(duplicate of one of the layers).

Have you managed to work out which docker command the additional layer corresponds to? We have suspected that it has to do with layers that don't change the image size (run rm or run mv ) but haven't managed to isolate the issue yet.

Re version: We have used 0.16 briefly, but haven't noticed this issue until 0.17 and up (we upgraded due to other bugs). We assumed that this particular issue is due to our dockerfiles being a bit "loose" (haven't linted them) so we just tried to upgrade our way out of it, but now we're close to target date for our project and it's the last blocker.

Also for context.
It's not entirely clear to me how to cause this to fail reliably.
I've noticed that given a good image and its layers in cache, sometimes adding/removing a file causes the subsequent one to build or not. Tag/envvar changes don't cause problems as @hochreitercorpuls
pointed out.

This is how we invoke kaniko:

/kaniko/executor "--context" "dir:///garden-build/catproj/" "--dockerfile" "cat.Dockerfile" "--destination" "myaccount.dkr.ecr.eu-west-2.amazonaws.com/garden/cat-27f4aa058f" "--cache=true" "--build-arg" "BASE_IMAGE=myaccount.dkr.ecr.eu-west-2.amazonaws.com/base-image:v-8d42ac5948";

What we're seeing is that even though the broken layer is in the cache registry, kaniko doesn't detect it and doesn't fail to build. Instead pushes an image with that broken layer in place

swist on 28 Mar 2020

Also running into the error:

Failed to pull image "<imagename>": rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): file exists

Building with image: gcr.io/kaniko-project/executor:debug-v0.18.0

Command:

/kaniko/executor --cache=true \
                 --context ${WORKSPACE} \
                 --destination=${imageName} \
                 --dockerfile ${WORKSPACE}/${dockerfileName} \
                 --log-format text

Dockerfile:

# https://github.com/ContinuumIO/docker-images/blob/master/miniconda3/debian/Dockerfile
FROM continuumio/miniconda3:4.7.12

SHELL ["/bin/bash", "-euo", "pipefail", "-c"]

USER root

WORKDIR /source/model

COPY ./environment.yml .

RUN conda env update -f environment.yml \
 && conda clean --all

COPY . .

CMD ["python3", "model.py"]

cx1111 on 30 Mar 2020

Is there anything else we can provide to help with this?

swist on 1 Apr 2020

@swist Sorry with the current situation I have been able to get to this sooner.
I spent a lot of time trying to duplicate this bug at my end with the docker file you provided.
I did the following actions

Build a kaniko image using --cache=true and push to GCR registry
I modified the environment.yml to add a simple dependency.

name: stats-1
dependencies:
  - numpy

Re-building the image. I verified the size of the cached layers and did not notice any duplication.
Added a python file in to the source context and rebuild. No changes in the size of cached layer or the image produced.

Is there anything else i am missing?

I also deployed a pod with the image produced on GKE and the pod was running fine. I am looking into code now to see what more can i do.

tejal29 on 2 Apr 2020

We're pretty sure that adding a dependency is going to invalidate the broken layer - it is going force a reinstall. What we have found has sometimes fails is when you add an extra empty file in the context so that it gets copied in just before the last layer

I'll try to get you a minimal reproducible example, unfortunately the only thing I have currently is my org's code.

swist on 2 Apr 2020

According to community slack, something like this should do:

run the same multistage build twice where the first non-target stage results in the same files

swist on 2 Apr 2020

I'm having this issue with a non-multistage build for a python application with --cache=true set and no other cache variables present. The resulting image cannot be pulled:

...
e645b11c20c7: Pull complete 
f99c55bbe43f: Pull complete 
a601f4aae6a4: Pull complete 
0eab66b1bfee: Extracting [==================================================>]  658.4MB/658.4MB
1a62f58fd1c4: Download complete 
c885b0b9491c: Download complete 
1bb9ef84f671: Download complete 
7d64f9e9bf99: Download complete 
b09c7a2df20f: Download complete 
55407e0fe961: Download complete 
failed to register layer: Error processing tar file(exit status 1): failed to mknod("/var/cache/apt/pkgcache.bin", S_IFCHR, 0): file exists

It's uncleared what triggered the issue, though clearing the cache 'fixes' it. if additional information helps please ask and I can try and provide.

jeffbehl on 6 Apr 2020

Only solution to still use cache layers in a Docker registry is to rollback to 0.16.0 from any version from 0.17.0 to 0.19.0, this is a major regression and should be p0.

Misteur-Z on 7 Apr 2020

Still happens for us. Take any multistage build. Add/remove line that sets ENV=$BUILDARG

@tejal29 do you have any more context as to how this is happening? Were you able to reproduce on your setup?

swist on 8 Apr 2020

We're seeing this problem also. It seems to happen whenever we build this Dockerfile:

FROM ubuntu:18.04
MAINTAINER "XXX"
RUN apt-get update
RUN apt-get install -y python3 python3-dev python3-pip
COPY ./ ./app
WORKDIR ./app
RUN pip3 install -r requirements.txt
ENTRYPOINT [ "python3" ]
CMD [ "main.py" ]

rhs on 8 Apr 2020

Still happens for us. Take any multistage build. Add/remove line that sets ENV=$BUILDARG

@tejal29 do you have any more context as to how this is happening? Were you able to reproduce on your setup?

yes. i was able to reproduce this when i ran the above dockerfile 3rd time.

tejal29 on 14 Apr 2020

We are also hitting this problem

c935c89cdb3c: Pull complete
59bb444f4818: Pull complete
f14fac6c6413: Extracting [==================================================>]  467.1MB/467.1MB
2431d9999311: Download complete

image: gcr.io/kaniko-project/executor:debug

command:
/kaniko/executor --context=${context} --dockerfile=${Dockerfile} --destination=${image} --cache=true --cache-repo=${IMAGE_CACHE}

More detail in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/28988

caalberts on 15 Apr 2020

👍1

Having previously ran into the same problem (Python based, multi-step Dockerfile), i've tried to use the files here to reproduce the issue but in vain.

@cx1111 - I've tried your Dockerfile (the one based on continuumio/miniconda3:4.7.12) on both v0.18.0 and v0.19.0, with an environment.yml file of a single dependency pytz.
The bug did not reproduce, but something strange I did notice was that in version v0.18.0 - inserting a dummy command just before the CMD (RUN python3 --version), caused the final image size to grow by few hundreds of MB. I'm guessing that's an unrelated bug that was fixed.
Anyway, could you share your environments.yml file?

@rhs - same here, didn't manage to reproduce it. Any chance you can share the requirements.txt file? I tried it with a pretty small set of Python requirements, so I don't know if it might be related.

dani29 on 18 Apr 2020

@swist I have tried to use your Dockerfile with a very basic pyproject.toml file (copied from here), and Kaniko (v0.18.0+) crashes with a panic. This seem to be related to a very recent issue with Go 1.14 based images (like Kaniko), discussed here: https://github.com/golang/go/issues/37436 https://github.com/golang/go/wiki/LinuxKernelSignalVectorBug

I'll created a separate issue here: https://github.com/GoogleContainerTools/kaniko/issues/1202

dani29 on 18 Apr 2020

I also encountered this problem. I'm using Google Cloud Build. It builds and works completely fine with docker build-docker push or if there's no cache. If there is a cache then it fails with following error:

rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): failed to mknod("/tmp/v8-compile-cache-0", S_IFCHR, 0): file exists

Dockerfile:

FROM node:12-slim
WORKDIR /app
EXPOSE 8080
CMD ["node", "app-client.js"]
RUN apt-get update && apt-get install --yes --no-install-recommends \
    ca-certificates \
    curl \
    gnupg \
  && echo "deb http://packages.cloud.google.com/apt gcsfuse-stretch main" \
    | tee /etc/apt/sources.list.d/gcsfuse.list \
  && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
  && apt-get update \
  && apt-get install --yes gcsfuse \
  && apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
COPY package*.json ./
RUN npm ci
COPY . .

cloudbuild.yaml

steps:
  - id: Build and push container
    name: 'gcr.io/kaniko-project/executor:latest'
    args: [
      '--dockerfile=Dockerfile',
      '--destination=$_IMAGE_NAME:$COMMIT_SHA',
      '--destination=$_IMAGE_NAME:$BRANCH_NAME',
      '--cache=true',
    ]

options:
  substitutionOption: ALLOW_LOOSE
substitutions:
  _IMAGE_NAME: gcr.io/$PROJECT_ID/project_name
  _DOCKERFILE_NAME: Dockerfile
  _DOCKERFILE_DIR: ''

TimShilov on 18 Apr 2020

Finally managed to reproduce it consistently and figured out that it has to do with incorrect whiteout of certain files. I am still not very familiar with the logic in that area, but seeing that it was recently refactored in #1069, perhaps @tejal29 or @cvgw have a better intuition on what may have gone wrong?

To Reproduce:

If you run on certain Linux kernels, you may fail to build it with the latest kaniko image (see issue 1202).
Run make images and build a local image from the source code. That will fix it.

Dockerfile

FROM python:3.7.4-slim AS service-base
RUN apt-get update \
  && apt-get --assume-yes install curl gpg  libcurl4-openssl-dev libssl-dev gcc git build-essential netcat libsnappy-dev

RUN pip install poetry=="1.0.2"
COPY ./_packages /api/_packages
COPY ./config /config

WORKDIR /service/cat
COPY . /service/cat
RUN rm -rf _packages
RUN rm -rf config

RUN \
  poetry config virtualenvs.create false && \
  poetry install

Context:

_packages/
    foo.txt
config/
    foo.txt
pyproject.toml

pyproject.toml is the only file whose contents are meaningful. The text files are just "hello world".

[tool.poetry]
name = "poetry-demo"
version = "0.1.0"
description = ""
authors = ["Sébastien Eustace <[email protected]>"]

[tool.poetry.dependencies]
python = "*"

[tool.poetry.dev-dependencies]
pytest = "^3.4"

Build command to GCR:

docker run --rm -it \
 --volume "$HOME"/.config/gcloud:/root/.config/gcloud \
 --volume $PWD:/build gcr.io/kaniko-project/executor:latest \
 --context=/build \
 --dockerfile=Dockerfile \
 --destination=us.gcr.io/$YOUR_PROJECT/kaniko-bug \
 --cache=true \

Build it once, and you'll see an image on GCR with size of ~190MB that can be pulled.
edit pyproject.toml and add a new dependency f.e pytz = "2020.1".
Build it again, and you'll see a new image, this time with size of ~320MB, that fails to be pulled:

ERROR: failed to register layer: Error processing tar file(exit status 1): failed to mknod("/etc/alternatives/pager.1.gz", S_IFCHR, 0): file exists

Some observations:

If run this script w/ debug verbosity (I recommend to comment out the noisy log on resolve.go), you can see that at some stage, pager.1.gz is whited-out and appears as .wh.pager.1.gz when files are resolved. IIUC this is what the eventual error says - it tries to create a file that already exists on the lower layers of the overlay FS.
If you add this funny hack in L185 of snapshot.go, making it skip pager.1.gz:

if s.l.MaybeAddWhiteout(path) {
      if path != "/etc/alternatives/pager.1.gz" {
       logrus.Debugf("Adding whiteout for %s", path)
       filesToWhiteOut = append(filesToWhiteOut, path)
}

The produced image is still with the wrong size (320MB, instead of 190MB), but it actually can be pulled normally. Which indicates that the bug that doubled layer's size is still not resolved.

dani29 on 18 Apr 2020

👍4 ❤1

We're continuously bumping into this as well and it makes the cache completely unusable. Is there anything we can do to help prioritise this?

Thanks for all your effort.

mitchfriedman on 23 Apr 2020

👍17

I'm experiencing the same issue on 0.19.0. It works fine on a small Node.js image, but on a more complicated PHP image it dies. Trying now to revert to 0.16.0 to see if that will solve it for me.

nethunter on 24 Apr 2020

Hey folks, sorry for the delay. I spent yesterday, debugging this issue and here are my findings.

Run kaniko/executor -f Dockerfile --cache=true --destination=<your_prj>
Now the cache is populated,
Make a change in your source files or dependencies.
Re-run . In this case, the image size gets doubled and docker image is not pullable.

Note: This does not happen consistently.

After doing some debugging and from error logs, the cases where docker pull <image> failed due

Error processing tar file(exit status 1): failed to  XXXXXFILE : file exists

There was 2 entries for the file appeared in the layer twice in the first uncached layer or across layers.
The reason for this, kaniko takes a snapshot of entire FS before running any commands.
However, in case when --cache=true, this caused a bug where if with a dockerfile with 5 layers, the command corresponding to layer3 was changed, Kaniko

Step 0: will first snapshot the entire FS
Step 1, 2: will pull layers 1 and 2 from cache and extract them
Step 3: execute command 2 and try to find changed files since last snapshot.
In this case, the snapshotter recognizes all files from layer1, layer2 and layer 3 as changed.

I fixed this in PR #1214 by moving the Step 0 i.e. initial snapshot when there is first layer which is not chached.

I verified Dockerfile given by @cx1111 and example provided by @dani29

I have pushed this fix in the following images

docker push gcr.io/kaniko-project/executor:debug-edge
docker push gcr.io/kaniko-project/executor:edge

Can someone give it a try?

tejal29 on 1 May 2020

I'm experiencing the same issue on 0.19.0. It works fine on a small Node.js image, but on a more complicated PHP image it dies. Trying now to revert to 0.16.0 to see if that will solve it for me.

@nethunter Can you please try:

docker push gcr.io/kaniko-project/executor:debug-edge
docker push gcr.io/kaniko-project/executor:edge

@mitchfriedman Sorry things have been slow. But please try

docker push gcr.io/kaniko-project/executor:debug-edge
docker push gcr.io/kaniko-project/executor:edge

tejal29 on 1 May 2020

gcr.io/kaniko-project/executor:edge seems to have fixed it for me. (However, I'm not sure how to invalidate the older layers which are still cached.)

IlyaSemenov on 1 May 2020

@IlyaSemenov, thanks for verifying. Would have to delete the remote cache.

tejal29 on 1 May 2020

Unfortunately gcr.io/kaniko-project/executor:edge haven't solved the problem for me.
I removed remote cache and switched to edge. After couple of builds I'm getting the same error:

rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): failed to mknod("/tmp/v8-compile-cache-0", S_IFCHR, 0): file exists

TimShilov on 1 May 2020

@TimShilov are you making code changes before running each build? Also are you by any chance reusing the same kaniko pod?

tejal29 on 1 May 2020

@tejal29 I was making dummy changes (just whitespace) in one file between commits.
I'm building images with Google Cloud Build so I'm not 100% sure if it gets reused or not but I assume the answer is no.

Just tried one more time just to make sure I didn't missed anything. Removed remote cache, first image worked fine. After second build I got the error again.
Version 0.16.0 works without any problems.

Is there anything else I can do or try to help?

TimShilov on 1 May 2020

ok. let me try with GCB build. I am trying to find a small package-lock.json to reproduce your issue.

tejal29 on 1 May 2020

@TimShilov, i tried your dockerfile and it did work for me.
This is what i did:

Run kaniko build on this sample node js app -> https://github.com/heroku/node-js-sample
Changed a file index.js
Re-ran kaniko build.
I did a docker pull on the 2nd image.

Can you please check if there is a substantial increase in the size of the 2 images (like almost doubling)?

tejal29 on 1 May 2020

I am going to do the build one more time!

tejal29 on 1 May 2020

confirming, i saw the error on 4th build.

tejal29 on 1 May 2020

@TimShilov I finally got your issue fixed. the condition to detect if initial snapshot should be taked was not complete.

In my earlier branch, i took the initial snapshot for first non cached command.
However, for metadata commands this condition would return true. In a dockerfile if you have a metadata command before a cached command, then the code went back to previous case.
Your dockerfile demonstrated this case.
I have the changed the conditions to
First non cached command which
not a metadata manipulation command
not a command which exactly knows which files changed e.g. "COPY"

I will do some more testing and give you an image to try soon

tejal29 on 1 May 2020

@TimShilov , Can you try with this image:

 gcr.io/kaniko-project/executor:debug-edge
gcr.io/kaniko-project/executor:edge

tejal29 on 2 May 2020

@tejal29 Tried again with gcr.io/kaniko-project/executor:edge. I made 4 builds and had no problems so far!
I will keep using edge for a couple of days to see if problem will return but so far seems fixed to me.

Thanks for your help! 👍

TimShilov on 2 May 2020

Thanks @TimShilov for your help verifying this build.
This will in v0.20.0 release

tejal29 on 4 May 2020

@tejal29 Thanks for fixing this, really appreciate it! I wanted to try using gcr.io/kaniko-project/executor:debug-edge, but it is now failing when I attempt to create a /kaniko/.docker/config.json

 $ echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.json
 /busybox/sh: eval: line 147: can't create /kaniko/.docker/config.json: nonexistent directory

Previously I was using gcr.io/kaniko-project/executor:debug-v0.16.0 and there was no nonexistent directory error. Is this an issue specific to edge images, or something has changed between 0.16.0 and edge

caalberts on 4 May 2020

@tejal29 Thanks for fixing this, really appreciate it! I wanted to try using gcr.io/kaniko-project/executor:debug-edge, but it is now failing when I attempt to create a /kaniko/.docker/config.json
 $ echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.json
 /busybox/sh: eval: line 147: can't create /kaniko/.docker/config.json: nonexistent directory
Previously I was using gcr.io/kaniko-project/executor:debug-v0.16.0 and there was no nonexistent directory error. Is this an issue specific to edge images, or something has changed between 0.16.0 and edge

It is NOT fixed in debug-v0.20.0
so you'll need a