Origin: Timeout when pulling Docker images taking more than 1 minute to extract

Created on 27 Feb 2017  路  23Comments  路  Source: openshift/origin

Version

$ oc version
oc v1.4.1+3f9807a
kubernetes v1.4.0+776c994

OpenShift/Kubernetes fails to pull images whose layers take more than one minute to extract.

$ oc get events -w
Pod                                                   Normal    Scheduled           {default-scheduler }             Successfully assigned gitlab-ee-1-3jso0 to oonodedev-001
Pod                     spec.containers{gitlab-ee}    Normal    Pulling             {kubelet oonodedev-001}   pulling image "gitlab/gitlab-ee@sha256:fa58a6765b5431f716ba82f5002a81041224e7430ef2c29b7fdea993a4a96aff"
Pod                   Warning   FailedSync   {kubelet oonodedev-001}   Error syncing pod, skipping: failed to "StartContainer" for "gitlab-ee" with ErrImagePull: "net/http: request canceled"
Pod       spec.containers{gitlab-ee}   Warning   Failed    {kubelet oonodedev-001}   Failed to pull image "gitlab/gitlab-ee@sha256:fa58a6765b5431f716ba82f5002a81041224e7430ef2c29b7fdea993a4a96aff": net/http: request canceled

and in the Origin logs:

Feb 24 15:21:45 oonodedev-001 origin-node[20126] kube_docker_client.go:313] Cancel pulling image "gitlab/gitlab-ee@sha256:fa58a6765b5431f716ba82f5002a81041224e7430ef2c29b7fdea993a4a96aff" because of no progress for 1m0s, latest progress: "ac990a380700: Extracting [==================================================>] 288.7 MB/288.7 MB"

The last layer of this particular image (ie gitlab/gitlab-ee:8.16.4-ee.0) takes several minutes to extract and with the default timeout of 1 minute it never goes through. A normal docker pull works.

The one minute value seems to come from the value of defaultImagePullingStuckTimeout (ref. https://github.com/kubernetes/kubernetes/blob/v1.4.0/pkg/kubelet/dockertools/kube_docker_client.go#L81) which is hardcoded and can't be changed. I'm also seeing this has been changed in Kubernetes 1.6 and the value looks to be customizable.

Could you suggest a possible workaround for the time being? If not, could we increase the default timeout (to something like 10 minutes) and backport it to Origin 1.4 and Origin 1.5?

componenkubernetes kinbug prioritP2

Most helpful comment

@alikhajeh1 @bbrfkr @rickbliss @yanhongwang

For Origin 3.6 you can set image-pull-progress-deadline to a meaningful value (e.g 10m) in the KubeletArguments section of the node-config.yaml of all your nodes.

This is working for us.

All 23 comments

@derekwaynecarr setting to p1 for triage in case we need to pull this in to 1.5 before the close.

cc @mfojtik @legionus

For reference, the PR change for Kubernetes 1.6 is here:
https://github.com/kubernetes/kubernetes/pull/36887

For the 1.4 and 1.5 releases, changing the default timeout from 1 to 10 minutes may help this situation but would potentially hurt other situations. Is it possible for the large image to be pre-pulled to your nodes instead in the interim?

Well, it is not only happening with this particular image.

We have seen this issue already in two images (the gitlab image mentioned above and one generated using S2I for one of our users), so I am afraid we could see this again in the future.

Same happening to me with all images larger then 150 Mb or with more then 5 layers (for example official tomcat image). So we can't increase this timeout?

I am facing the same issue and seems this error is random with Kubenetes 1.6. Here is what I observed and explanation is appreciated:

  1. kubernets needs to pull 5 images from internet, and only postgres one failed. I only can fix it by deleting and recreating pod manually. It becomes running right away
  2. kubernete needs to pull 10+images from local private registry, and only one image failed. However, kubernetes retried later and succeeds without issue.

Error syncing pod, skipping: failed to "StartContainer" for "postgresql" with ErrImagePull: "net/http: request canceled"

24m 24m 1 kubelet, master0 spec.containers{postgresql} Normal BackOff Back-off pulling image "sameersbn/postgresql:9.6-2"
24m 24m 1 kubelet, master0 Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "postgresql" with ImagePullBackOff: "Back-off pulling image \"sameersbn/postgresql:9.6-2\""

+1

+1
Happens to me on 1.6.2 using kops

From what i was able to strace i saw that the pull os stuck on FUTEX_WAIT so some other process is deadlocking it

+1
happens to me, 1.5.7 using kops.
I am getting with ErrImagePull: "net/http: request canceled" tries to get the image from AWS ECR.
Any ideas guys?

I think but im not sure it happened to me because i only had 1 node again im not sure cause i deleted and build my cluster yesterday again

Changed EC2 nodes from t2.medium to m3.large and fixed the problem

Also ran into this issue with the GitLab image. 馃槥

Is there an option to customize the timeout in Origin 3.6? I couldn't find anything in the docs for it but maybe I was searching for the wrong things

+1

+1

+1

@alikhajeh1 @bbrfkr @rickbliss @yanhongwang

For Origin 3.6 you can set image-pull-progress-deadline to a meaningful value (e.g 10m) in the KubeletArguments section of the node-config.yaml of all your nodes.

This is working for us.

Actually, I am happy to close the issue now that this is configurable in Origin 3.6.

@AlbertoPeon so in KubeletArguments , we set image-pull-progress-deadline=10m?

@xqianwang
Yes. We can set the parameter image-pull-progress-deadline into /etc/origin/node/node-config.yaml as follow;

kubeletArguments:
  image-pull-progress-deadline:
  - "10m"

This description works fine in my OpenShift Origin environment.

@bbrfkr Thanks a lot!

Does this have to to still be set manually? Seeing this in 3.7 and was wondering if it exists as a configurable in the ansible inventory?

Yes, you can set it in openshift_node_kubelet_args . Note this has to be JSON-fomatted, so something like:

openshift_node_kubelet_args='{"image-pull-progress-deadline":["10m"]}'
Was this page helpful?
0 / 5 - 0 ratings