Origin: Timeout when pulling Docker images taking more than 1 minute to extract

Created on 27 Feb 2017 · 23Comments · Source: openshift/origin

Version

$ oc version
oc v1.4.1+3f9807a
kubernetes v1.4.0+776c994

OpenShift/Kubernetes fails to pull images whose layers take more than one minute to extract.

$ oc get events -w
Pod                                                   Normal    Scheduled           {default-scheduler }             Successfully assigned gitlab-ee-1-3jso0 to oonodedev-001
Pod                     spec.containers{gitlab-ee}    Normal    Pulling             {kubelet oonodedev-001}   pulling image "gitlab/gitlab-ee@sha256:fa58a6765b5431f716ba82f5002a81041224e7430ef2c29b7fdea993a4a96aff"
Pod                   Warning   FailedSync   {kubelet oonodedev-001}   Error syncing pod, skipping: failed to "StartContainer" for "gitlab-ee" with ErrImagePull: "net/http: request canceled"
Pod       spec.containers{gitlab-ee}   Warning   Failed    {kubelet oonodedev-001}   Failed to pull image "gitlab/gitlab-ee@sha256:fa58a6765b5431f716ba82f5002a81041224e7430ef2c29b7fdea993a4a96aff": net/http: request canceled

and in the Origin logs:

Feb 24 15:21:45 oonodedev-001 origin-node[20126] kube_docker_client.go:313] Cancel pulling image "gitlab/gitlab-ee@sha256:fa58a6765b5431f716ba82f5002a81041224e7430ef2c29b7fdea993a4a96aff" because of no progress for 1m0s, latest progress: "ac990a380700: Extracting [==================================================>] 288.7 MB/288.7 MB"

The last layer of this particular image (ie gitlab/gitlab-ee:8.16.4-ee.0) takes several minutes to extract and with the default timeout of 1 minute it never goes through. A normal docker pull works.

The one minute value seems to come from the value of defaultImagePullingStuckTimeout (ref. https://github.com/kubernetes/kubernetes/blob/v1.4.0/pkg/kubelet/dockertools/kube_docker_client.go#L81) which is hardcoded and can't be changed. I'm also seeing this has been changed in Kubernetes 1.6 and the value looks to be customizable.

Could you suggest a possible workaround for the time being? If not, could we increase the default timeout (to something like 10 minutes) and backport it to Origin 1.4 and Origin 1.5?

componenkubernetes kinbug prioritP2

Source

AlbertoPeon

👍5 ❤2 🎉2

Most helpful comment

@alikhajeh1 @bbrfkr @rickbliss @yanhongwang

For Origin 3.6 you can set image-pull-progress-deadline to a meaningful value (e.g 10m) in the KubeletArguments section of the node-config.yaml of all your nodes.

This is working for us.

AlbertoPeon on 18 Sep 2017

🎉6 ❤3

All 23 comments

@derekwaynecarr setting to p1 for triage in case we need to pull this in to 1.5 before the close.

pweil- on 28 Feb 2017

cc @mfojtik @legionus

pweil- on 28 Feb 2017

For reference, the PR change for Kubernetes 1.6 is here:
https://github.com/kubernetes/kubernetes/pull/36887

For the 1.4 and 1.5 releases, changing the default timeout from 1 to 10 minutes may help this situation but would potentially hurt other situations. Is it possible for the large image to be pre-pulled to your nodes instead in the interim?

derekwaynecarr on 6 Mar 2017

Well, it is not only happening with this particular image.

We have seen this issue already in two images (the gitlab image mentioned above and one generated using S2I for one of our users), so I am afraid we could see this again in the future.

AlbertoPeon on 6 Mar 2017

Same happening to me with all images larger then 150 Mb or with more then 5 layers (for example official tomcat image). So we can't increase this timeout?

ksemaev on 31 May 2017

I am facing the same issue and seems this error is random with Kubenetes 1.6. Here is what I observed and explanation is appreciated:

kubernets needs to pull 5 images from internet, and only postgres one failed. I only can fix it by deleting and recreating pod manually. It becomes running right away
kubernete needs to pull 10+images from local private registry, and only one image failed. However, kubernetes retried later and succeeds without issue.

Error syncing pod, skipping: failed to "StartContainer" for "postgresql" with ErrImagePull: "net/http: request canceled"

24m 24m 1 kubelet, master0 spec.containers{postgresql} Normal BackOff Back-off pulling image "sameersbn/postgresql:9.6-2"
24m 24m 1 kubelet, master0 Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "postgresql" with ImagePullBackOff: "Back-off pulling image \"sameersbn/postgresql:9.6-2\""

chenww on 11 Jul 2017

mliker on 13 Jul 2017

+1
Happens to me on 1.6.2 using kops

From what i was able to strace i saw that the pull os stuck on FUTEX_WAIT so some other process is deadlocking it

innovia on 14 Jul 2017

+1
happens to me, 1.5.7 using kops.
I am getting with ErrImagePull: "net/http: request canceled" tries to get the image from AWS ECR.
Any ideas guys?

alifa20 on 18 Jul 2017

❤1

I think but im not sure it happened to me because i only had 1 node again im not sure cause i deleted and build my cluster yesterday again

innovia on 18 Jul 2017

Changed EC2 nodes from t2.medium to m3.large and fixed the problem

alifa20 on 18 Jul 2017

👍1

Also ran into this issue with the GitLab image. 😞

stevenmirabito on 24 Jul 2017

Is there an option to customize the timeout in Origin 3.6? I couldn't find anything in the docs for it but maybe I was searching for the wrong things

alikhajeh1 on 4 Sep 2017

bbrfkr on 5 Sep 2017

rickbliss on 10 Sep 2017

yanhongwang on 18 Sep 2017

@alikhajeh1 @bbrfkr @rickbliss @yanhongwang

For Origin 3.6 you can set image-pull-progress-deadline to a meaningful value (e.g 10m) in the KubeletArguments section of the node-config.yaml of all your nodes.

This is working for us.

AlbertoPeon on 18 Sep 2017

🎉6 ❤3

Actually, I am happy to close the issue now that this is configurable in Origin 3.6.

AlbertoPeon on 18 Sep 2017

🎉2

@AlbertoPeon so in KubeletArguments , we set image-pull-progress-deadline=10m?

xqianwang on 18 Nov 2017

@xqianwang
Yes. We can set the parameter image-pull-progress-deadline into /etc/origin/node/node-config.yaml as follow;

kubeletArguments:
  image-pull-progress-deadline:
  - "10m"

This description works fine in my OpenShift Origin environment.

bbrfkr on 1 Dec 2017

❤1

@bbrfkr Thanks a lot!

xqianwang on 14 Dec 2017

Does this have to to still be set manually? Seeing this in 3.7 and was wondering if it exists as a configurable in the ansible inventory?

sbadakhc on 27 Mar 2018

Yes, you can set it in openshift_node_kubelet_args . Note this has to be JSON-fomatted, so something like:

openshift_node_kubelet_args='{"image-pull-progress-deadline":["10m"]}'

AlbertoPeon on 27 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Allow service accounts to push images to the internal registry

nicolaferraro · 3Comments

Integrated docker-registry.default.svc:5000 is not working as expected

rkrmishra · 4Comments

Getting latest security updates in Docker and Kubernetes

alikhajeh1 · 3Comments

permission denied in pods, using import docker-compose

surajssd · 4Comments

oc cluster up does not recognize --insecure-registry argument

tnozicka · 3Comments