Kind: Very infrequent "failed to load image: exit status 1" errors

Created on 4 Oct 2019 · 15Comments · Source: kubernetes-sigs/kind

What happened:
We occasionally are seeing issues load kind images. From a rough grep I think this is impacting roughly 3% of our PRs. Note that each PR is running ~20 tests and load ~10 images, and maybe be rerun many times due to test failures or new commits, etc. So this number actually means kind load is likely only failing .003% of the time I guess?

What you expected to happen:

Ideally, kind load is more robust and doesn't experience errors. But if that is not feasible it may be nice to have some better logging/error messages, maybe retries possibly? I'm not too sure as I don't yet understand the root cause.

How to reproduce it (as minimally and precisely as possible):

Its very intermittent failures, so I am not sure we can reproduce it easily. I can however, point you to a bunch of logs:

We do everything with loglevel=debug and dump kind logs in Artifacts so hopefully we have everything there. I didn't really look through the logs much as I don't know what to look for, but happy to look deeper if I am pointed in the right direction.

Anything else we need to know?:

As mentioned, this is pretty rare. A 99.999% pass rate is pretty solid, so I wouldn't be too disappointing if nothing could be done here.

Environment:

kind version: (use kind version): 0.5.1
Kubernetes version: (use kubectl version): Running Kind on GKE 1.13. I think all of these are spinning up 1.15 clusters
Docker version: (use docker info): 18.06.1
OS (e.g. from /etc/os-release): cOS

kinbug

Source

howardjohn

All 15 comments

But if that is not feasible it may be nice to have some better logging/error messages, maybe retries possibly? I'm not too sure as I don't yet understand the root cause.

Retries is a good idea to explore, though ideally this doesn't fail :/

Logging is substantially more powerful in HEAD, -v 1 or greater will result in a stack trace being logged on failure, along with the command output if the failure's cause is executing a command.

Have we experimented yet with combining the 10 images with docker save ... and then using kind load image-archive instead of kind load docker-image?

BenTheElder on 6 Oct 2019

I will try the save every thing to one tar approach this week and see if
things improve. Thanks!

On Sat, Oct 5, 2019, 3:03 PM Benjamin Elder notifications@github.com
wrote:

But if that is not feasible it may be nice to have some better
logging/error messages, maybe retries possibly? I'm not too sure as I don't
yet understand the root cause.

Retries is a good idea to explore, though ideally this doesn't fail :/

Logging is substantially more powerful in HEAD, -v 1 or greater will
result in a stack trace being logged on failure, along with the command
output if the failure's cause is executing a command.

Have we experimented yet with combining the 10 images with docker save ...
and then using kind load image-archive instead of kind load docker-image?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes-sigs/kind/issues/921?email_source=notifications&email_token=AAEYGXIBVTQLLDSMNKIIYNTQNEFKXA5CNFSM4I5TK6B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAN4YQQ#issuecomment-538692674,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEYGXISGBVXYW7NNPVKRILQNEFKXANCNFSM4I5TK6BQ
.

howardjohn on 6 Oct 2019

The containerd file contains several errors about failing to load the images

https://storage.googleapis.com/istio-prow/pr-logs/pull/istio_istio/17569/e2e-simpleTests_istio/1326/artifacts/kind/istio-testing-control-plane/containerd.log

Maybe parallelizing 10 images to load in containerd is too much and we should cap the maximum?

aojea on 12 Oct 2019

Seems plausible, I'll drop the parallelism a bit and see what happens.

howardjohn on 12 Oct 2019

it's also possible we may have picked up a containerd fix with the new containerd infra, we should have much more recent containerd builds going forward (EG in HEAD of kind we're on the latest stable release + backports)

BenTheElder on 15 Oct 2019

I switched to loading images 1 at a time instead of in parallel, and at the same time our testing load roughly doubled due to an incoming release. Load failures seem about the same if not a little bit worse.

Sounds like the next step would be to try the newer versions of kind. I was planning to wait for v0.6.0, would you suggest we just switch to master now?

howardjohn on 17 Oct 2019

👀2

if you pin a particular commit master as of this moment is probably a good choice, I'm intending to get v0.6.0 soon-ish but possibly not fast enough to resolve this.

how horrendous would it be to add a retry? it would flake to being slower but possibly succeed instead of totally failing?

BenTheElder on 17 Oct 2019

considering a similar trade-off for #949

BenTheElder on 17 Oct 2019

Retry is a good option, seems a worthwhile tradeoff. I'll try that out and if I see it again update to some commit on master. Thanks!

howardjohn on 17 Oct 2019

👍1

Quick update - 6 days ago we added retries. I think the logic on my change is wrong and it doesn't always retry them, but anecdotally things seemed to improve. We still have not updated past v0.5.1.

So I think for now this is mostly mitigated

howardjohn on 23 Oct 2019

Actually just realized in the one test that I have seen an error, the retry was broken. So I think I have never seen it fail with 3 retries

howardjohn on 24 Oct 2019

ACK, thanks!
I'll send you a ping when 0.6 is ready & out, continuing to focus on identifying any stability weakpoints and eliminating them. I think the last known one now is log export issues which I'll tackle shortly. We have a lot more CI signal and better ability to ship the lastest containerd improvements now which should help.

BenTheElder on 24 Oct 2019

Forgot to ping with this coming out in the kuebcon chaos :/

Any sign of this with 0.6 images?
We're keeping up to date with containerd's latest 1.3.X versions now as we develop kind.

BenTheElder on 6 Dec 2019

I have not seen any issues loading images in at least a month, but we do have retries now that may be masking errors

howardjohn on 6 Dec 2019

going to close this for now, but keeping an eye out for signs of this..

BenTheElder on 10 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Unable to mount docker socket for DooD config

philipstaffordwood · 4Comments

Kind not added to PATH for MacOS

ivanayov · 4Comments

Amazon ECR images are accessible in KIND?

mithunvikram · 3Comments

jobs are broken on kubernetes master (build)

BenTheElder · 4Comments

Failure in joining worker nodes while creating kind cluster

leelavg · 3Comments