What happened:
We occasionally are seeing issues load kind images. From a rough grep I think this is impacting roughly 3% of our PRs. Note that each PR is running ~20 tests and load ~10 images, and maybe be rerun many times due to test failures or new commits, etc. So this number actually means kind load is likely only failing .003% of the time I guess?
What you expected to happen:
Ideally, kind load is more robust and doesn't experience errors. But if that is not feasible it may be nice to have some better logging/error messages, maybe retries possibly? I'm not too sure as I don't yet understand the root cause.
How to reproduce it (as minimally and precisely as possible):
Its very intermittent failures, so I am not sure we can reproduce it easily. I can however, point you to a bunch of logs:
We do everything with loglevel=debug and dump kind logs in Artifacts so hopefully we have everything there. I didn't really look through the logs much as I don't know what to look for, but happy to look deeper if I am pointed in the right direction.
Anything else we need to know?:
As mentioned, this is pretty rare. A 99.999% pass rate is pretty solid, so I wouldn't be too disappointing if nothing could be done here.
Environment:
kind version): 0.5.1kubectl version): Running Kind on GKE 1.13. I think all of these are spinning up 1.15 clusters docker info): 18.06.1/etc/os-release): cOSBut if that is not feasible it may be nice to have some better logging/error messages, maybe retries possibly? I'm not too sure as I don't yet understand the root cause.
Retries is a good idea to explore, though ideally this doesn't fail :/
Logging is substantially more powerful in HEAD, -v 1 or greater will result in a stack trace being logged on failure, along with the command output if the failure's cause is executing a command.
Have we experimented yet with combining the 10 images with docker save ... and then using kind load image-archive instead of kind load docker-image?
I will try the save every thing to one tar approach this week and see if
things improve. Thanks!
On Sat, Oct 5, 2019, 3:03 PM Benjamin Elder notifications@github.com
wrote:
But if that is not feasible it may be nice to have some better
logging/error messages, maybe retries possibly? I'm not too sure as I don't
yet understand the root cause.Retries is a good idea to explore, though ideally this doesn't fail :/
Logging is substantially more powerful in HEAD, -v 1 or greater will
result in a stack trace being logged on failure, along with the command
output if the failure's cause is executing a command.Have we experimented yet with combining the 10 images with docker save ...
and then using kind load image-archive instead of kind load docker-image?—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes-sigs/kind/issues/921?email_source=notifications&email_token=AAEYGXIBVTQLLDSMNKIIYNTQNEFKXA5CNFSM4I5TK6B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAN4YQQ#issuecomment-538692674,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEYGXISGBVXYW7NNPVKRILQNEFKXANCNFSM4I5TK6BQ
.
The containerd file contains several errors about failing to load the images
Maybe parallelizing 10 images to load in containerd is too much and we should cap the maximum?
Seems plausible, I'll drop the parallelism a bit and see what happens.
it's also possible we may have picked up a containerd fix with the new containerd infra, we should have much more recent containerd builds going forward (EG in HEAD of kind we're on the latest stable release + backports)
I switched to loading images 1 at a time instead of in parallel, and at the same time our testing load roughly doubled due to an incoming release. Load failures seem about the same if not a little bit worse.
Sounds like the next step would be to try the newer versions of kind. I was planning to wait for v0.6.0, would you suggest we just switch to master now?
if you pin a particular commit master as of this moment is probably a good choice, I'm intending to get v0.6.0 soon-ish but possibly not fast enough to resolve this.
how horrendous would it be to add a retry? it would flake to being slower but possibly succeed instead of totally failing?
considering a similar trade-off for #949
Retry is a good option, seems a worthwhile tradeoff. I'll try that out and if I see it again update to some commit on master. Thanks!
Quick update - 6 days ago we added retries. I think the logic on my change is wrong and it doesn't always retry them, but anecdotally things seemed to improve. We still have not updated past v0.5.1.
So I think for now this is mostly mitigated
Actually just realized in the one test that I have seen an error, the retry was broken. So I think I have never seen it fail with 3 retries
ACK, thanks!
I'll send you a ping when 0.6 is ready & out, continuing to focus on identifying any stability weakpoints and eliminating them. I think the last known one now is log export issues which I'll tackle shortly. We have a lot more CI signal and better ability to ship the lastest containerd improvements now which should help.
Forgot to ping with this coming out in the kuebcon chaos :/
Any sign of this with 0.6 images?
We're keeping up to date with containerd's latest 1.3.X versions now as we develop kind.
I have not seen any issues loading images in at least a month, but we do have retries now that may be masking errors
going to close this for now, but keeping an eye out for signs of this..