Kaniko: Number of layers inconsistent with kaniko builds that copy/copy the same files multiple times

Created on 24 Jul 2018  路  18Comments  路  Source: GoogleContainerTools/kaniko

The integration test run for https://github.com/GoogleContainerTools/kaniko/pull/249 failed even though the PR is fixing a typo in a README.

Seems like we either have a flakey test or the test is broken! :O

++ go test
--- FAIL: TestLayers (49.18s)
    --- FAIL: TestLayers/test_layer_dockerfiles/Dockerfile_test_copy_reproducible (2.70s)
        integration_test.go:250: incorrect offset between layers of gcr.io/kaniko-test/docker-dockerfile_test_copy_reproducible and gcr.io/kaniko-test/kaniko-dockerfile_test_copy_reproducible: expected 0 but got 1
FAIL
kinbug

All 18 comments

Looking back through kokoro history I can see a bunch of similar failures:

https://github.com/GoogleContainerTools/kaniko/pull/249:

++ go test
--- FAIL: TestLayers (49.18s)
    --- FAIL: TestLayers/test_layer_dockerfiles/Dockerfile_test_copy_reproducible (2.70s)
        integration_test.go:250: incorrect offset between layers of gcr.io/kaniko-test/docker-dockerfile_test_copy_reproducible and gcr.io/kaniko-test/kaniko-dockerfile_test_copy_reproducible: expected 0 but got 1

https://github.com/GoogleContainerTools/kaniko/pull/243 (twice, both with same commit):

++ go test
--- FAIL: TestLayers (49.67s)
    --- FAIL: TestLayers/test_layer_dockerfiles/Dockerfile_test_copy (2.74s)
        integration_test.go:250: incorrect offset between layers of gcr.io/kaniko-test/docker-dockerfile_test_copy and gcr.io/kaniko-test/kaniko-dockerfile_test_copy: expected 0 but got 1
++ go test
--- FAIL: TestLayers (48.30s)
    --- FAIL: TestLayers/test_layer_dockerfiles/Dockerfile_test_copy_reproducible (2.55s)
        integration_test.go:250: incorrect offset between layers of gcr.io/kaniko-test/docker-dockerfile_test_copy_reproducible and gcr.io/kaniko-test/kaniko-dockerfile_test_copy_reproducible: expected 0 but got 1



md5-6e74fdac27b00ea981846ff574dcaf4e



++ go test
--- FAIL: TestLayers (53.94s)
    --- FAIL: TestLayers/test_layer_dockerfiles/Dockerfile_test_copy (2.91s)
        integration_test.go:250: incorrect offset between layers of gcr.io/kaniko-test/docker-dockerfile_test_copy and gcr.io/kaniko-test/kaniko-dockerfile_test_copy: expected 0 but got 1



md5-45324bca7dd5163f0645bd26b98945fd



++ go test
--- FAIL: TestLayers (49.92s)
    --- FAIL: TestLayers/test_layer_dockerfiles/Dockerfile_test_copy_bucket (2.61s)
        integration_test.go:250: incorrect offset between layers of gcr.io/kaniko-test/docker-dockerfile_test_copy_bucket and gcr.io/kaniko-test/kaniko-dockerfile_test_copy_bucket: expected 0 but got 1

Hmm weird yah I think it might be a flaky test -- does docker always build images with the same number of layers? Only reason I can think of as to why this might be happening.

image

Another example from #256 - I'LL GET TO YOU YET

YOUR DAYS ARE NUMBERED JUST LIKE YOUR LAYERS

In a recent run, this failed twice:

image

Super weird tho b/c the "reproducible" build, which uses the same image (and actually, no other changes) didn't fail :thinking:

But sometimes the reproducible image fails:

image

It's always the images testing copying tho.

Managed to reproduce it locally, so it's not a docker version issue:

image

Interestingly, it looks like it's the kaniko built image that is short a layer:

--- FAIL: TestLayers (59.86s)
    --- FAIL: TestLayers/test_layer_Dockerfile_test_copy_bucket (19.43s)
        integration_test.go:210: Difference in number of layers in each image is 1 but should be 0. us.gcr.io/christiewilson-catfactory/docker-dockerfile_test_copy_bucket has 18 layers and us.gcr.io/christiewilson-catfactory/kaniko-dockerfile_test_copy_bucket has 17 layers

I added some extra debug output and the correct number of layers for that image should be 18

Looking back at one of the images kaniko built which only had 17 layers, when I compare that image to one with 18 layers using container-diff, container-diff reports no differences!! It seems like for some reason two layers are being squashed. Either that or there is a layer that does nothing?

~thought I had reproduced this with some other images~

lol nope, that was a bug I introduced into the test XD

When reproduced on my machine, it looks like the difference is that in the 17 layer case, we get this:

time="2018-07-31T23:21:46Z" level=info msg="cmd: copy [context/b*]"
time="2018-07-31T23:21:46Z" level=info msg="dest: /baz/"
time="2018-07-31T23:21:46Z" level=info msg="Creating directory /baz"
time="2018-07-31T23:21:46Z" level=info msg="Creating directory /baz/bam"
time="2018-07-31T23:21:46Z" level=info msg="Copying file /workspace/context/bar/bam/bat to /baz/bam/bat"
time="2018-07-31T23:21:46Z" level=info msg="Copying file /workspace/context/bar/bat to /baz/bat"
time="2018-07-31T23:21:46Z" level=info msg="Copying file /workspace/context/bar/baz to /baz/baz"
time="2018-07-31T23:21:46Z" level=info msg="Taking snapshot of files [/baz/ /baz/bam /baz/bam/bat /baz/bat /baz/baz]..."
time="2018-07-31T23:21:46Z" level=info msg="No files were changed, appending empty layer to config."
time="2018-07-31T23:21:46Z" level=info msg="cmd: copy [context/foo context/bar/ba?]"

And in the 18 layer case, we get:

time="2018-07-31T23:22:05Z" level=info msg="cmd: copy [context/b*]"
time="2018-07-31T23:22:05Z" level=info msg="dest: /baz/"
time="2018-07-31T23:22:05Z" level=info msg="Creating directory /baz"
time="2018-07-31T23:22:05Z" level=info msg="Creating directory /baz/bam"
time="2018-07-31T23:22:05Z" level=info msg="Copying file /workspace/context/bar/bam/bat to /baz/bam/bat"
time="2018-07-31T23:22:05Z" level=info msg="Copying file /workspace/context/bar/bat to /baz/bat"
time="2018-07-31T23:22:05Z" level=info msg="Copying file /workspace/context/bar/baz to /baz/baz"
time="2018-07-31T23:22:05Z" level=info msg="Taking snapshot of files [/baz/ /baz/bam /baz/bam/bat /baz/bat /baz/baz]..."
time="2018-07-31T23:22:05Z" level=info msg="cmd: copy [context/foo context/bar/ba?]"

Hmm so I guess sometimes kaniko thinks files have changed and sometimes not -- It's probably because this command and this command are doing the same thing (files probably shouldn't have changed after running the second command which I guess is why the Docker images have 17 layers as well)

I'd probably take a look and make sure snapshotting is happening correctly, here is where kaniko decides if a file has changed and should be added!

Thanks @priyawadhwa ! I think that MaybeAdd function you linked to might be the culprit :thinking:

I've found a way to reproduce this pretty reliably, with this Dockerfile:

FROM alpine@sha256:5ce5f501c457015c4b91f91a15ac69157d9b06f1a75cf9107bf2b62e0843983a
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/

The weirdest thing is that it's not consistent, the number of layers missing varies, e.g. between these two runs with the above Dockerfile:

--- FAIL: TestLayers (18.52s)
    --- FAIL: TestLayers/test_layer_Dockerfile_test_copy_fail (18.52s)
        integration_test.go:220: Difference in number of layers in each image is 10 but should be 0. Image 1: Image: [us.gcr.io/christiewilson-catfactory/docker-dockerfile_test_copy_fail] Digest: [50ab99551cba99eb8b5a825198a050eff931b5b0128cd4b4588e75c8c49a626a] Number of Layers: [15], Image 2: Image: [us.gcr.io/christiewilson-catfactory/kaniko-dockerfile_test_copy_fail] Digest: [adab18e80865003107b1b251913ec026ba300579f559919f2468b072507b13fa] Number of Layers: [5]
FAIL
--- FAIL: TestLayers (17.29s)
    --- FAIL: TestLayers/test_layer_Dockerfile_test_copy_fail (17.29s)
        integration_test.go:220: Difference in number of layers in each image is 1 but should be 0. Image 1: Image: [us.gcr.io/christiewilson-catfactory/docker-dockerfile_test_copy_fail] Digest: [50ab99551cba99eb8b5a825198a050eff931b5b0128cd4b4588e75c8c49a626a] Number of Layers: [15], Image 2: Image: [us.gcr.io/christiewilson-catfactory/kaniko-dockerfile_test_copy_fail] Digest: [f002e2b9f606e631bb3c3282ce97be2f50f7e88c03a4f087bbdf50799f08db40] Number of Layers: [14]
FAIL

It seems like copying a dir multiple times is the problem.

I also tried copying just one file multiple times:

FROM alpine@sha256:5ce5f501c457015c4b91f91a15ac69157d9b06f1a75cf9107bf2b62e0843983a
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo

And similarly, the number of layers was different (b/c MaybeAdd would have correctly determined no new layer should be added I guess) ~but it was _consistent_ whereas in the dir case it's inconsistent.~
I tried the same thing again with more COPY statements and it turns out the results were just as inconsistent, it just doesn't happen as often.

@priyawadhwa does the log message "No files were changed, appending empty layer to config" mean that an actual layer should be added in that case? That doesn't seem to be the behavior (no layer is added from what I can tell)

does the log message "No files were changed, appending empty layer to config" mean that an actual layer should be added in that case? That doesn't seem to be the behavior (no layer is added from what I can tell)

Yah that's correct, that log message means that an actual layer won't be added, but an "empty layer" will be added to the config just to show that a command was run.

Okay so it looks like what is happening is:

  1. When we check if we should add a file to a layer
  2. We call MaybeAdd
  3. Which, if the file has already been added, by default compares the hashes of the files using the files' mtimes as an input

It turns out that the mtime can lag. At first I thought this was a docker thing, or specific to the base images we are using, but I was also able to reproduce it on my ubuntu host with (thanks @dlorenc for this one liner):

for i in $(seq 100); do echo "hey $i" > foo2 && stat foo2 | grep Modify; done;

Sometimes the mtime will lag. This means the contents of the file may have actually changed, but the mtime will make it look like the files are the same. This behaviour is not consistent, which is why the number of layers kaniko was building would vary.

However if you introduce a call to sync before each stat, the mtime seems to be consistently updated:

for i in $(seq 100); do echo "hey $i" > foo2 && sync && stat foo2 | grep Modify; done;

This does make the complete for loop take ~1 order of magnitude longer, however if we call this once per layer, I think we're looking at only < 100 ms extra time per layer.

Unfortunately even after adding a sync before checking hashes, it looks like the issue is still happening! This might have something to do with the fact that sync may return before the writes are actually completed (https://linux.die.net/man/2/sync).

Other ideas for how to fix this:

  1. Add more sync calls :sweat_smile: e.g. when kaniko starts up, and/or before the tests run (maybe all the test setup is causing a lot of data to get flushed when sync is called)
  2. Radical idea: always include files that are COPY/ADD-ed: the way COPY is implemented the file will _always_ be truncated and re-written, even if it's different, so we should really be included a layer for this every time (unless we want to start looking at file contents which I'm guessing we don't b/c it's much too slow)

What do you think @priyawadhwa ?

From the latest failures in #273:

time="2018-08-02T23:53:10Z" level=info msg="Copying file /workspace/context/foo to /foo"
time="2018-08-02T23:53:10Z" level=info msg="Taking snapshot of files [/foo]..."
time="2018-08-02T23:53:10Z" level=info msg="File / Mode drwxr-xr-x ModTime 2018-08-02 23:53:10.775057552 +0000 UTC IsRegular %!s(bool=false)\n"
time="2018-08-02T23:53:10Z" level=info msg="File /foo Mode -rwxrwxr-x ModTime 2018-08-02 23:53:10.843057552 +0000 UTC IsRegular %!s(bool=true)\n"
time="2018-08-02T23:53:10Z" level=info msg="cmd: copy [context/foo]"
time="2018-08-02T23:53:10Z" level=info msg="dest: /foo"
time="2018-08-02T23:53:10Z" level=info msg="Copying file /workspace/context/foo to /foo"
time="2018-08-02T23:53:10Z" level=info msg="Taking snapshot of files [/foo]..."
time="2018-08-02T23:53:10Z" level=info msg="File / Mode drwxr-xr-x ModTime 2018-08-02 23:53:10.775057552 +0000 UTC IsRegular %!s(bool=false)\n"
time="2018-08-02T23:53:10Z" level=info msg="File /foo Mode -rwxrwxr-x ModTime 2018-08-02 23:53:10.843057552 +0000 UTC IsRegular %!s(bool=true)\n"
time="2018-08-02T23:53:10Z" level=info msg="No files were changed, appending empty layer to config."

The modtime is the same :''''(

Talked with @priyawadhwa and we're gonna try out option 2!

As discussed let's go for option 2!! :smiley:

Aw yeah, eventually consistent github commenting :sunglasses:

Was this page helpful?
0 / 5 - 0 ratings