Pipelines: Pipeline steps that use alpine image can become stuck when DockerHub is overloaded: ImagePullBackOff: Back-off pulling image "alpine"

Created on 28 Jun 2020  路  18Comments  路  Source: kubeflow/pipelines

What steps did you take:

I'm running a pipeline I made where the first pipeline step loads from a cache. Approximately every other run, the step hangs.

What happened:

The step displays errors that oscillate between these three messages:

This step is in Pending state with this message: ErrImagePull: rpc error: code = Unknown desc = Error response from daemon: error parsing HTTP 429 response body: invalid character 'T' looking for beginning of value: "Too Many Requests (HAP429).\n"
This step is in Pending state with this message: ErrImagePull: rpc error: code = Unknown desc = Error response from daemon: unauthorized: authentication required
This step is in Pending state with this message: ImagePullBackOff: Back-off pulling image "alpine"

Environment:

Kubeflow pipelines standalone v0.5 on prem

KFP version: 0.5

KFP SDK version: 0.5.1

/area backend

areexecution_cache prioritp1 statutriaged

Most helpful comment

@Ark-kun I think this is of much higher priority because of docker hub rate limiting

All 18 comments

This is pretty strange. The cached steps just use the official alpine container image and exit immediately (see the Pod spec).
I have not experienced this error before.
Searching online, this seems to be an overall DockerHub issue. Maybe they're experiencing increasing load right now. https://github.com/docker/hub-feedback/issues/1907

This does not seem to be caused by caching (the only role of caching here is using the official alpine image).

As a workaround, maybe you can set up some on-prem image caching or mirroring so that DockerHub failures do not affect your workloads.

Gotcha, how would I do that?

I found couple of links that look relevant:
https://cloud.google.com/container-registry/docs/using-dockerhub-mirroring
https://docs.docker.com/registry/recipes/mirror/

P.S. The issue will probably go away soon as DockerHub recovers. Please tell us if it remains an issue.

How would I point the caching mechanism to use an image from our on-prem private docker image repo?

@Ark-kun shall we make the caching image configurable?

@Ark-kun shall we make the caching image configurable?

Great idea!
We can use env variable for that.

This would be great. I'm seeing drastic slowdowns in my pipeline runs because of this. Some seem to just be hanging forever... any chance y'all could provide a timeline for this feature?

I'm seeing drastic slowdowns in my pipeline runs because of this.

Can you try contacting the cluster administrator and asking why the official images cannot be accessed? You're probably having the same issue with other official images like python.

P.S. You can easily disable caching by setting some_task.execution_options.caching_strategy.max_cache_staleness = "P0D" on your first/root tasks.

Well for most things I definitely want to pull from the cache, since the step in question is lengthy. So I won't want to disable the cache.

Is this issue related to the fact that anonymous dockerhub pulls are rate-limited?

https://docs.docker.com/docker-hub/download-rate-limit/

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

This is still an issue for me, running KFP 1.0.4 standalone. Did anything happen of the idea of using env vars to configure the caching image?

/reopen

@JakeTheWise: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Ark-kun I think this is of much higher priority because of docker hub rate limiting

We should change the default image now

I had the same problem.
kubeflow's cached steps are using official alpine image. so docker hub's new rate limit policy make toomanyrequests error.

I solved it by

  1. Signup docker hub and add imagePullSecret using this docker hub account (create-an-imagepullsecret)
  2. option a) Patch default-editor serviceaccount of your working namespace - add imagePullSecret (add-image-pull-secret-to-service-account)
    option b) Or you set imagePullSecret in pipeline code(set_image_pull_secrets)

I think this is inevitable problem if kubeflow use alpine image and kubeflow admin don't set default imagePullSecret

assign

Was this page helpful?
0 / 5 - 0 ratings

Related issues

xinbinhuang picture xinbinhuang  路  3Comments

VindhyaSRajan picture VindhyaSRajan  路  3Comments

zijianjoy picture zijianjoy  路  3Comments

Svendegroote91 picture Svendegroote91  路  3Comments

Toeplitz picture Toeplitz  路  4Comments