Kind: Very infrequent "failed to load image: exit status 1" errors

Created on 4 Oct 2019  Â·  15Comments  Â·  Source: kubernetes-sigs/kind

What happened:
We occasionally are seeing issues load kind images. From a rough grep I think this is impacting roughly 3% of our PRs. Note that each PR is running ~20 tests and load ~10 images, and maybe be rerun many times due to test failures or new commits, etc. So this number actually means kind load is likely only failing .003% of the time I guess?

What you expected to happen:

Ideally, kind load is more robust and doesn't experience errors. But if that is not feasible it may be nice to have some better logging/error messages, maybe retries possibly? I'm not too sure as I don't yet understand the root cause.

How to reproduce it (as minimally and precisely as possible):

Its very intermittent failures, so I am not sure we can reproduce it easily. I can however, point you to a bunch of logs:

We do everything with loglevel=debug and dump kind logs in Artifacts so hopefully we have everything there. I didn't really look through the logs much as I don't know what to look for, but happy to look deeper if I am pointed in the right direction.

Anything else we need to know?:

As mentioned, this is pretty rare. A 99.999% pass rate is pretty solid, so I wouldn't be too disappointing if nothing could be done here.

Environment:

  • kind version: (use kind version): 0.5.1
  • Kubernetes version: (use kubectl version): Running Kind on GKE 1.13. I think all of these are spinning up 1.15 clusters
  • Docker version: (use docker info): 18.06.1
  • OS (e.g. from /etc/os-release): cOS
kinbug

All 15 comments

But if that is not feasible it may be nice to have some better logging/error messages, maybe retries possibly? I'm not too sure as I don't yet understand the root cause.

Retries is a good idea to explore, though ideally this doesn't fail :/

Logging is substantially more powerful in HEAD, -v 1 or greater will result in a stack trace being logged on failure, along with the command output if the failure's cause is executing a command.

Have we experimented yet with combining the 10 images with docker save ... and then using kind load image-archive instead of kind load docker-image?

I will try the save every thing to one tar approach this week and see if
things improve. Thanks!

On Sat, Oct 5, 2019, 3:03 PM Benjamin Elder notifications@github.com
wrote:

But if that is not feasible it may be nice to have some better
logging/error messages, maybe retries possibly? I'm not too sure as I don't
yet understand the root cause.

Retries is a good idea to explore, though ideally this doesn't fail :/

Logging is substantially more powerful in HEAD, -v 1 or greater will
result in a stack trace being logged on failure, along with the command
output if the failure's cause is executing a command.

Have we experimented yet with combining the 10 images with docker save ...
and then using kind load image-archive instead of kind load docker-image?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes-sigs/kind/issues/921?email_source=notifications&email_token=AAEYGXIBVTQLLDSMNKIIYNTQNEFKXA5CNFSM4I5TK6B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAN4YQQ#issuecomment-538692674,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEYGXISGBVXYW7NNPVKRILQNEFKXANCNFSM4I5TK6BQ
.

The containerd file contains several errors about failing to load the images

https://storage.googleapis.com/istio-prow/pr-logs/pull/istio_istio/17569/e2e-simpleTests_istio/1326/artifacts/kind/istio-testing-control-plane/containerd.log

Maybe parallelizing 10 images to load in containerd is too much and we should cap the maximum?

Seems plausible, I'll drop the parallelism a bit and see what happens.

it's also possible we may have picked up a containerd fix with the new containerd infra, we should have much more recent containerd builds going forward (EG in HEAD of kind we're on the latest stable release + backports)

I switched to loading images 1 at a time instead of in parallel, and at the same time our testing load roughly doubled due to an incoming release. Load failures seem about the same if not a little bit worse.

Sounds like the next step would be to try the newer versions of kind. I was planning to wait for v0.6.0, would you suggest we just switch to master now?

if you pin a particular commit master as of this moment is probably a good choice, I'm intending to get v0.6.0 soon-ish but possibly not fast enough to resolve this.

how horrendous would it be to add a retry? it would flake to being slower but possibly succeed instead of totally failing?

considering a similar trade-off for #949

Retry is a good option, seems a worthwhile tradeoff. I'll try that out and if I see it again update to some commit on master. Thanks!

Quick update - 6 days ago we added retries. I think the logic on my change is wrong and it doesn't always retry them, but anecdotally things seemed to improve. We still have not updated past v0.5.1.

So I think for now this is mostly mitigated

Actually just realized in the one test that I have seen an error, the retry was broken. So I think I have never seen it fail with 3 retries

ACK, thanks!
I'll send you a ping when 0.6 is ready & out, continuing to focus on identifying any stability weakpoints and eliminating them. I think the last known one now is log export issues which I'll tackle shortly. We have a lot more CI signal and better ability to ship the lastest containerd improvements now which should help.

Forgot to ping with this coming out in the kuebcon chaos :/

Any sign of this with 0.6 images?
We're keeping up to date with containerd's latest 1.3.X versions now as we develop kind.

I have not seen any issues loading images in at least a month, but we do have retries now that may be masking errors

going to close this for now, but keeping an eye out for signs of this..

Was this page helpful?
0 / 5 - 0 ratings

Related issues

philipstaffordwood picture philipstaffordwood  Â·  4Comments

ivanayov picture ivanayov  Â·  4Comments

mithunvikram picture mithunvikram  Â·  3Comments

BenTheElder picture BenTheElder  Â·  4Comments

leelavg picture leelavg  Â·  3Comments