$ oc version
oc v1.4.1+3f9807a
kubernetes v1.4.0+776c994
OpenShift/Kubernetes fails to pull images whose layers take more than one minute to extract.
$ oc get events -w
Pod Normal Scheduled {default-scheduler } Successfully assigned gitlab-ee-1-3jso0 to oonodedev-001
Pod spec.containers{gitlab-ee} Normal Pulling {kubelet oonodedev-001} pulling image "gitlab/gitlab-ee@sha256:fa58a6765b5431f716ba82f5002a81041224e7430ef2c29b7fdea993a4a96aff"
Pod Warning FailedSync {kubelet oonodedev-001} Error syncing pod, skipping: failed to "StartContainer" for "gitlab-ee" with ErrImagePull: "net/http: request canceled"
Pod spec.containers{gitlab-ee} Warning Failed {kubelet oonodedev-001} Failed to pull image "gitlab/gitlab-ee@sha256:fa58a6765b5431f716ba82f5002a81041224e7430ef2c29b7fdea993a4a96aff": net/http: request canceled
and in the Origin logs:
Feb 24 15:21:45 oonodedev-001 origin-node[20126] kube_docker_client.go:313] Cancel pulling image "gitlab/gitlab-ee@sha256:fa58a6765b5431f716ba82f5002a81041224e7430ef2c29b7fdea993a4a96aff" because of no progress for 1m0s, latest progress: "ac990a380700: Extracting [==================================================>] 288.7 MB/288.7 MB"
The last layer of this particular image (ie gitlab/gitlab-ee:8.16.4-ee.0) takes several minutes to extract and with the default timeout of 1 minute it never goes through. A normal docker pull works.
The one minute value seems to come from the value of defaultImagePullingStuckTimeout (ref. https://github.com/kubernetes/kubernetes/blob/v1.4.0/pkg/kubelet/dockertools/kube_docker_client.go#L81) which is hardcoded and can't be changed. I'm also seeing this has been changed in Kubernetes 1.6 and the value looks to be customizable.
Could you suggest a possible workaround for the time being? If not, could we increase the default timeout (to something like 10 minutes) and backport it to Origin 1.4 and Origin 1.5?
@derekwaynecarr setting to p1 for triage in case we need to pull this in to 1.5 before the close.
cc @mfojtik @legionus
For reference, the PR change for Kubernetes 1.6 is here:
https://github.com/kubernetes/kubernetes/pull/36887
For the 1.4 and 1.5 releases, changing the default timeout from 1 to 10 minutes may help this situation but would potentially hurt other situations. Is it possible for the large image to be pre-pulled to your nodes instead in the interim?
Well, it is not only happening with this particular image.
We have seen this issue already in two images (the gitlab image mentioned above and one generated using S2I for one of our users), so I am afraid we could see this again in the future.
Same happening to me with all images larger then 150 Mb or with more then 5 layers (for example official tomcat image). So we can't increase this timeout?
I am facing the same issue and seems this error is random with Kubenetes 1.6. Here is what I observed and explanation is appreciated:
Error syncing pod, skipping: failed to "StartContainer" for "postgresql" with ErrImagePull: "net/http: request canceled"
24m 24m 1 kubelet, master0 spec.containers{postgresql} Normal BackOff Back-off pulling image "sameersbn/postgresql:9.6-2"
24m 24m 1 kubelet, master0 Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "postgresql" with ImagePullBackOff: "Back-off pulling image \"sameersbn/postgresql:9.6-2\""
+1
+1
Happens to me on 1.6.2 using kops
From what i was able to strace i saw that the pull os stuck on FUTEX_WAIT so some other process is deadlocking it
+1
happens to me, 1.5.7 using kops.
I am getting with ErrImagePull: "net/http: request canceled" tries to get the image from AWS ECR.
Any ideas guys?
I think but im not sure it happened to me because i only had 1 node again im not sure cause i deleted and build my cluster yesterday again
Changed EC2 nodes from t2.medium to m3.large and fixed the problem
Also ran into this issue with the GitLab image. 馃槥
Is there an option to customize the timeout in Origin 3.6? I couldn't find anything in the docs for it but maybe I was searching for the wrong things
+1
+1
+1
@alikhajeh1 @bbrfkr @rickbliss @yanhongwang
For Origin 3.6 you can set image-pull-progress-deadline to a meaningful value (e.g 10m) in the KubeletArguments section of the node-config.yaml of all your nodes.
This is working for us.
Actually, I am happy to close the issue now that this is configurable in Origin 3.6.
@AlbertoPeon so in KubeletArguments , we set image-pull-progress-deadline=10m?
@xqianwang
Yes. We can set the parameter image-pull-progress-deadline into /etc/origin/node/node-config.yaml as follow;
kubeletArguments:
image-pull-progress-deadline:
- "10m"
This description works fine in my OpenShift Origin environment.
@bbrfkr Thanks a lot!
Does this have to to still be set manually? Seeing this in 3.7 and was wondering if it exists as a configurable in the ansible inventory?
Yes, you can set it in openshift_node_kubelet_args . Note this has to be JSON-fomatted, so something like:
openshift_node_kubelet_args='{"image-pull-progress-deadline":["10m"]}'
Most helpful comment
@alikhajeh1 @bbrfkr @rickbliss @yanhongwang
For Origin 3.6 you can set
image-pull-progress-deadlineto a meaningful value (e.g 10m) in theKubeletArgumentssection of thenode-config.yamlof all your nodes.This is working for us.