origin 🚀 - Some containers take multiple minutes to start, resulting in timeouts and test failures

Pods appear to be scheduled, but long delays between starting infra container and the remaining containers (in the cases we've seen).

smarterclayton on 20 Sep 2016

There was a similar BZ reported in kube 1.3 when running density tests from 0-100 pods on AWS.

@timothysc team was to investigate that issue to root cause. this looks the same. at the time, we suspected there were global locking issues in openshift-sdn.

/cc @eparis

derekwaynecarr on 20 Sep 2016

We did actually manage to eliminate the sdn however, before we go down that route again....

eparis on 20 Sep 2016

After some investigation today (when SDN was not in use) it may have been
correlated to throttling on secret retrievals. In a local environment
(bone stock origin 1.4.0-alpha.0) I was able to easily reproduce long
delays when multiple pods are being scheduled. Calls to docker appeared to
be fast, but the kubelet itself was reporting
kubelet_docker_operation_latencies in the tens of seconds even for the 50th
percentile (but the 90 and 99th were only a few seconds higher).

On Tue, Sep 20, 2016 at 5:02 PM, Eric Paris [email protected]
wrote:

We did actually manage to eliminate the sdn however, before we go down
that route again....

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-248433329,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p77Cwb9UTzI91FEYEHI61olYNCpuks5qsEnygaJpZM4KBw1L
.

smarterclayton on 21 Sep 2016

Here is the BZ in question: https://bugzilla.redhat.com/show_bug.cgi?id=1343196

It's not the scheduler.

timothysc on 21 Sep 2016

Added the test-flake label because other flake issues have been closed/duped in favor of this one.

csrwng on 21 Sep 2016

I see these errors in a separate failed extended image test:

E0920 17:47:28.496490   16730 kubelet.go:1816] Unable to mount volumes for pod "mongodb-1-deploy_extended-test-mongodb-replica-a41h3-f5b9t(49f979b4-7f7b-11e6-a225-0e03779dc447)": timeout expired waiting for volumes to attach/mount for pod "mongodb-1-deploy"/"extended-test-mongodb-replica-a41h3-f5b9t". list of unattached/unmounted volumes=[deployer-token-nactn]; skipping pod
E0920 17:47:28.496507   16730 pod_workers.go:184] Error syncing pod 49f979b4-7f7b-11e6-a225-0e03779dc447, skipping: timeout expired waiting for volumes to attach/mount for pod "mongodb-1-deploy"/"extended-test-mongodb-replica-a41h3-f5b9t". list of unattached/unmounted volumes=[deployer-token-nactn]
I0920 17:47:28.496820   16730 server.go:608] Event(api.ObjectReference{Kind:"Pod", Namespace:"extended-test-mongodb-replica-a41h3-f5b9t", Name:"mongodb-1-deploy", UID:"49f979b4-7f7b-11e6-a225-0e03779dc447", APIVersion:"v1", ResourceVersion:"3642", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "mongodb-1-deploy"/"extended-test-mongodb-replica-a41h3-f5b9t". list of unattached/unmounted volumes=[deployer-token-nactn]
I0920 17:47:28.496861   16730 server.go:608] Event(api.ObjectReference{Kind:"Pod", Namespace:"extended-test-mongodb-replica-a41h3-f5b9t", Name:"mongodb-1-deploy", UID:"49f979b4-7f7b-11e6-a225-0e03779dc447", APIVersion:"v1", ResourceVersion:"3642", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod "mongodb-1-deploy_extended-test-mongodb-replica-a41h3-f5b9t(49f979b4-7f7b-11e6-a225-0e03779dc447)": timeout expired waiting for volumes to attach/mount for pod "mongodb-1-deploy"/"extended-test-mongodb-replica-a41h3-f5b9t". list of unattached/unmounted volumes=[deployer-token-nactn]

which seem to confirm the theory that secrets are taking too long

csrwng on 21 Sep 2016

reassign to tstclair as he was planning to root cause during 1.5.

eparis on 22 Sep 2016

We have a fair amount of info that this is not secrets (in most cases), and
that Docker is taking extended periods of time to respond to create calls.

On Thu, Sep 22, 2016 at 12:02 PM, Eric Paris [email protected]
wrote:

reassign to tstclair as he was planning to root cause during 1.5.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-248947772,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pwxcBTdDli9JYY9PyMaBAriDyGKeks5qsqaJgaJpZM4KBw1L
.

smarterclayton on 22 Sep 2016

@kargakis got:
/data/src/github.com/openshift/origin/test/extended/deployments/deployments.go:715
Expected an error to have occurred. Got:
: nil
/data/src/github.com/openshift/origin/test/extended/deployments/deployments.go:696

mfojtik on 23 Sep 2016

is there no timeout we can bump to unblock the merge queue? or disable some of these tests? we're totally blocked by this, almost nothing is merging.

bparees on 26 Sep 2016

Please keep this issue up to date with fixes.

On Mon, Sep 26, 2016 at 12:42 PM, Ben Parees [email protected]
wrote:

is there no timeout we can bump to unblock the merge queue? or disable
some of these tests? we're totally blocked by this, almost nothing is
merging.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-249625797,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p6yXd0wZbIz8L0cKwuC5rK8onhP2ks5qt_YPgaJpZM4KBw1L
.

smarterclayton on 26 Sep 2016

I switched my docker graph driver from devicemapper to overlay (which also required that I disable docker's selinux support), and the timings were significantly better. I'd say it went from taking 20-40 seconds in between starting the infra and actual containers (on average) to no more than 5 seconds, with an average of probably 1-2.

Note this is not statistically significant, as I only ran the overlay test 1 time. But definitely something to investigate (devicemapper graph driver possible contention somewhere).

ncdc on 26 Sep 2016

Bumping to p0, this is still blocking merges across the cluster. We need a simple fix in the short term that allows us to stop flaking.

smarterclayton on 28 Sep 2016

If we have to increase timeouts on certain tests let's do it, but I want the flakes gone.

smarterclayton on 28 Sep 2016

This test fails on my machine consistently, the problem being one of the pods of the deployment fails the readiness checks. I've tried changing timeouts on readiness and in tests and sooner or later at some point in time one of the pod will start failing the readiness checks. Not sure what to check more but I'll be debugging more... Posting here just to notify about the progress...

soltysh on 29 Sep 2016

Try to use a different fixture that is not strict about readiness (the
current test requires both pods to become ready) since this test doesn't
test readiness.

On Thu, Sep 29, 2016 at 5:06 PM, Maciej Szulik [email protected]
wrote:

This test
https://github.com/openshift/origin/blob/8ce6de44e16f506a921de75d1e63b0d2ea49195d/test/extended/deployments/deployments.go#L382
fails on my machine consistently, the problem being one of the pods of the
deployment
https://github.com/openshift/origin/blob/8ce6de44e16f506a921de75d1e63b0d2ea49195d/test/extended/testdata/deployment-simple.yaml
fails the readiness checks. I've tried changing timeouts on readiness and
in tests and sooner or later at some point in time one of the pod will
start failing the readiness checks. Not sure what to check more but I'll be
debugging more... Posting here just to notify about the progress...

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-250493844,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADuFfw_W9ornU4EhG8NyIOhvwRd6zmZzks5qu9PZgaJpZM4KBw1L
.

kargakis on 29 Sep 2016

The only time I can get this test to pass always is to change the readiness to tcp probe instead of http one, not sure if it's desirable.

soltysh on 30 Sep 2016

I don't understand why that would make a difference?

On Fri, Sep 30, 2016 at 9:06 AM, Maciej Szulik [email protected]
wrote:

The only time I can get this test to pass always is to change the
readiness to tcp probe instead of http one, not sure if it's desirable.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-250739522,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p5nRqMEF8h0laAmtskmz2ZBSxN6Nks5qvQlQgaJpZM4KBw1L
.

smarterclayton on 30 Sep 2016

I don't understand why that would make a difference?

The reason httpGet are failing is usually due to timeouts on get, but even significantly increasing (up to 10s) those timeouts didn't help. But frankly I have no idea why the two differ, maybe there's some bug in the probes... Will check it out...

soltysh on 30 Sep 2016

Is it because the container is starting and failing and Get actually
exposes the app failure?

On Fri, Sep 30, 2016 at 11:17 AM, Maciej Szulik [email protected]
wrote:

I don't understand why that would make a difference?

The reason httpGet are failing is usually due to timeouts on get, but even
significantly increasing (up to 10s) those timeouts didn't help. But
frankly I have no idea why the two differ, maybe there's some bug in the
probes... Will check it out...

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-250771729,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p3Z2wlq0CeJfdJPsN7hSU-ikZOvlks5qvSfvgaJpZM4KBw1L
.

smarterclayton on 30 Sep 2016

@mfojtik @soltysh bump. Status here?

pweil- on 31 Oct 2016

I think the original issue description (post 1.4 rebase pod startup times have increased) is not happening because of the rebase. It's because of the upgrade from docker 1.9 to 1.10. Right?

ncdc on 31 Oct 2016

Yup, that's what we've identified. I haven't seen anything else in there.

soltysh on 31 Oct 2016

@ncdc and we should now run all CI on docker 1.12 right? so this should no longer happen.

mfojtik on 31 Oct 2016

I vote for closing this issue, and reopening if the problem bites us again.

soltysh on 31 Oct 2016

@mfojtik that will only be true if docker 1.12 is faster than 1.10

ncdc on 31 Oct 2016

@ncdc I smell it is not: https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_conformance/8031/testReport/junit/(root)/Extended/_k8s_io__Services_should_serve_multiport_endpoints_from_pods__Conformance_/

mfojtik on 31 Oct 2016

@danmcp @smarterclayton maybe we need 1 jenkins job that tests using overlay instead of devicemapper, for comparison

ncdc on 31 Oct 2016

👍1

@ncdc @danmcp @smarterclayton or an option to job where you can choose the storage

mfojtik on 31 Oct 2016

Choosing is not good, it has to be permanent, iow. one job running on devicemapper, other on overlay.

soltysh on 31 Oct 2016

Most deployment flakes are related to this. @mfojtik @smarterclayton not sure if we should extend the deployment timeout any more. To be honest, I would prefer using the default deployment timeout (10m) and be done with this at the expense of some tests becoming slower.

kargakis on 1 Nov 2016

The aggressive timeout doesn't seem to help us. Is there any evidence of
that?

On Nov 1, 2016, at 8:18 AM, Michail Kargakis [email protected]
wrote:

Most deployment flakes are related to this. @mfojtik
https://github.com/mfojtik @smarterclayton
https://github.com/smarterclayton not sure if we should extend the
deployment timeout any more. To be honest, I would prefer using the default
deployment timeout (10m) and be done with this at the expense of some tests
becoming slower.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-257552709,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p1EkeY4x66i4bgCPfGgv1W2gIXdDks5q5y4WgaJpZM4KBw1L
.

smarterclayton on 1 Nov 2016

@smarterclayton https://github.com/openshift/origin/issues/11685 the deployer pod succeeds around 5min after it started. The deployment pod was scaled up from the start but didn't transition to Ready until ~5min later. We end up failing the test.

kargakis on 1 Nov 2016

Why did it take 5 min? That's completely unexpected.

On Tue, Nov 1, 2016 at 10:53 AM, Michail Kargakis [email protected]
wrote:

@smarterclayton https://github.com/smarterclayton #11685
https://github.com/openshift/origin/issues/11685 the deployer pod
succeeds around 5min after it started. The deployment pod was scaled up
from the start but didn't transition to Ready until ~5min later. We end up
failing the test.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-257587296,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p2w7WPiGWT3lyvJPD1Qp3ualaKVcks5q51J1gaJpZM4KBw1L
.

smarterclayton on 1 Nov 2016

@rhvgoyal Please take a look at this. Thanks.

jwhonce on 1 Nov 2016

This is so high level, that I have no idea. So far none of the data is suggesting that it is a storage issue. if this is a problem, please narrow it down.

rhvgoyal on 1 Nov 2016

@rhvgoyal we have a doc (I need to get a link to it) showing that the devicemapper timings are slower when you go from docker 1.9 to 1.10. I apologize for not having this data handy right now. I believe @soltysh has it somewhere. We'll get it to you as soon as we can.

ncdc on 1 Nov 2016

It's https://docs.google.com/a/redhat.com/document/d/1AaNZTggal-OUjgJah7FV4mWYNS-in5cHTBLf0BDIkh4/edit?usp=sharing

soltysh on 1 Nov 2016

I'll try to run 1.12 tests and add them to that document. @rhvgoyal where can I get 1.12 binaries/package for F24?

soltysh on 2 Nov 2016

I'm running a combination of http://koji.fedoraproject.org/koji/taskinfo?taskID=16262294 and http://koji.fedoraproject.org/koji/buildinfo?buildID=812817 (but you probably could update the latter to 1.12.3-2).

I have these packages installed:

docker-latest (i.e. remove docker)
docker-common
container-selinux

ncdc on 2 Nov 2016

Given the rate of flakes, we need to do something here.

smarterclayton on 21 Nov 2016

Probably increasing a few timeouts.

smarterclayton on 21 Nov 2016

Agreed, I'm seeing this in half of the failures in #11916. The only question is what can we do? I doubt switching to OverlayFS is an option?

soltysh on 21 Nov 2016

It's not.

On Mon, Nov 21, 2016 at 11:40 AM, Maciej Szulik [email protected]
wrote:

Agreed, I'm seeing this in half of the failures in #11916
https://github.com/openshift/origin/pull/11916. The only question is
what can we do? I doubt switching to OverlayFS is an option?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-261992761,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pzZgZuv0K_N9AyfO7L8oc4T0jSw7ks5rAcmbgaJpZM4KBw1L
.

smarterclayton on 21 Nov 2016

Increasing a few timeouts seems reasonable. I'll dig into the failures and see what's the most frequent ones and where it make sense.

soltysh on 21 Nov 2016

This is not 1.4 blocker, because the longer times are not introduced inside origin code but rather by newer docker version. This document contains detailed log of the test performed with different docker versions vs origin 1.4 and 1.3 vs different storage drivers (overlayfs and devicemapper).

soltysh on 1 Dec 2016

@jwhonce @rhvgoyal bump - please let us know if there's anything we can do to help you debug this

ncdc on 1 Dec 2016

I've updated the doc with docker 1.12 tests against origin 1.4 and 1.5 (just devicemapper). It looks like origin 1.4 with docker 1.12 is additional second slower than with 1.10 and origin 1.5 is even additional second slower (time in parents is median):

| | Origin 1.4 + Docker 1.10 | Origin 1.4 + Docker 1.12 | Origin 1.5 + Docker 1.10 | Origin 1.5 + Docker 1.12 |
|---|---|---|---|---|
| CreateContainer | 10.6s (12.9s) | 11.3s (14.1s) | 11.7s (14.8s) | 12.1s (15.3s) |
| runContainer | 12.8s (15.2s) | 11.7s (14.5) | 12.4s (15.4s) | 12.5s (15.7s) |

EDIT: added origin 1.5 + docker 1.10 times

soltysh on 5 Dec 2016

The docs was updated also.

soltysh on 5 Dec 2016

/cc @mffiedler ...

timothysc on 5 Dec 2016

still seeing issues from this: https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_conformance/9556/

(https://github.com/openshift/origin/issues/9076 was duped to this issue)

bparees on 14 Dec 2016

@kargakis are we closing in on a solution to this?

bparees on 19 Dec 2016

sorry, question should be directed to @jwhonce and @soltysh

bparees on 19 Dec 2016

and if not i'd like to reopen https://github.com/openshift/origin/issues/11649 (which was closed as dupe of this) so a workaround can be put in place because it's cropping up a lot lately.

bparees on 19 Dec 2016

@bparees for now only increasing timeouts for serial re-appearances is the only solution :/

soltysh on 22 Dec 2016

@soltysh ok, then i don't think we should be duping specific test case flakes to this issue, we need to fix each flake. I'm going to reopen #11649.

bparees on 22 Dec 2016

Yes, please, it'll help us find those cases.

soltysh on 22 Dec 2016

Latency metrics for node 172.18.13.129
Dec 21 15:28:31.743: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.99 Latency:2m1.879512s}
Dec 21 15:28:31.743: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.9 Latency:1m37.203206s}
Dec 21 15:28:31.743: INFO: {Operation: Method:pleg_relist_latency_microseconds Quantile:0.99 Latency:50.928686s}
Dec 21 15:28:31.743: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:48.959779s}
Dec 21 15:28:31.743: INFO: {Operation:stop_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:48.823987s}
Dec 21 15:28:31.743: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.9 Latency:46.16127s}
Dec 21 15:28:31.743: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.99 Latency:45.374806s}
Dec 21 15:28:31.743: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.9 Latency:45.374806s}
Dec 21 15:28:31.743: INFO: {Operation:runContainerInPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:41.035892s}
Dec 21 15:28:31.743: INFO: {Operation: Method:pod_worker_start_latency_microseconds Quantile:0.99 Latency:40.465985s}
Dec 21 15:28:31.743: INFO: {Operation:stop_container Method:docker_operations_latency_microseconds Quantile:0.9 Latency:39.089519s}
Dec 21 15:28:31.743: INFO: {Operation:createPodInfraContainer Method:container_manager_latency_microseconds Quantile:0.99 Latency:36.979561s}
Dec 21 15:28:31.743: INFO: {Operation:list_containers Method:docker_operations_latency_microseconds Quantile:0.99 Latency:31.90826s}
Dec 21 15:28:31.743: INFO: {Operation:GetPods Method:container_manager_latency_microseconds Quantile:0.99 Latency:31.907478s}
Dec 21 15:28:31.743: INFO: {Operation:inspect_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:25.352457s}
Dec 21 15:28:31.743: INFO: {Operation:runContainerInPod Method:container_manager_latency_microseconds Quantile:0.9 Latency:13.037115s}

Wow.

smarterclayton on 22 Dec 2016

This is a docker problem. I'm seeing 1/4 conformance runs fail because of this, and start latencies in excess of 2-3 minutes during those runs. Something is broken in Docker, possible related to the parallel builds hang.

Raising this to P0 - @jwhonce if you can make sure the appropriate investigations are taking place. This would be a production blocker for many use cases (loaded node takes many minutes to start some pods).

smarterclayton on 29 Dec 2016

There was at least one container that took 4 minutes to start:

Latency metrics for node ci-pr65-ig-n-x0v5.c.openshift-gce-devel-ci.internal
Jan  6 00:09:38.454: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.99 Latency:4m0.426184s}
Jan  6 00:09:38.454: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.9 Latency:2m33.107112s}
Jan  6 00:09:38.454: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:2m15.985041s}
Jan  6 00:09:38.454: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.9 Latency:2m3.638843s}
Jan  6 00:09:38.454: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.99 Latency:2m3.638843s}
Jan  6 00:09:38.454: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.99 Latency:1m36.1392s}
Jan  6 00:09:38.454: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.9 Latency:1m26.463562s}
Jan  6 00:09:38.454: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.9 Latency:49.979557s}
Jan  6 00:09:38.454: INFO: {Operation:createPodInfraContainer Method:container_manager_latency_microseconds Quantile:0.99 Latency:47.685318s}
Jan  6 00:09:38.454: INFO: {Operation:runContainerInPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:37.203829s}
Jan  6 00:09:38.454: INFO: {Operation:create_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:31.249116s}
Jan  6 00:09:38.454: INFO: {Operation: Method:pleg_relist_latency_microseconds Quantile:0.99 Latency:30.389475s}
Jan  6 00:09:38.454: INFO: {Operation: Method:pod_worker_start_latency_microseconds Quantile:0.99 Latency:24.973853s}
Jan  6 00:09:38.454: INFO: {Operation:runContainerInPod Method:container_manager_latency_microseconds Quantile:0.9 Latency:21.276118s}
Jan  6 00:09:38.454: INFO: {Operation:stop_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:19.079327s}
Jan  6 00:09:38.454: INFO: {Operation:createPodInfraContainer Method:container_manager_latency_microseconds Quantile:0.9 Latency:17.403305s}
Jan  6 00:09:38.454: INFO: {Operation:create_container Method:docker_operations_latency_microseconds Quantile:0.9 Latency:12.668864s}
Jan  6 00:09:38.454: INFO: {Operation: Method:pod_worker_start_latency_microseconds Quantile:0.9 Latency:12.609544s}
Jan  6 00:09:38.454: INFO: {Operation:remove_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:11.227787s}
Jan  6 00:09:38.454: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.5 Latency:10.568633s}
Jan  6 00:09:38.454: INFO: {Operation:list_containers Method:docker_operations_latency_microseconds Quantile:0.99 Latency:10.550709s}

smarterclayton on 6 Jan 2017

@ncdc Do you have a guess at the number of containers being tracked by the docker daemon when you start seeing the excessive latencies? $(docker info |grep 'Containers:') Thanks.

jwhonce on 10 Jan 2017

Running containers are printed in the junit failures. It's generally
between 10-30 containers running, including pause containers, with some CPU
spikes. In the ones I was seeing, it was a 2 core system so I have to
assume that docker was under CPU pressure (probably not memory).

On Jan 9, 2017, at 6:46 PM, Jhon Honce notifications@github.com wrote:

@ncdc https://github.com/ncdc Do you have a guess at the number of
containers being tracked by the docker daemon when you start seeing the
excessive latencies? $(docker info |grep 'Containers:') Thanks.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-271444068,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pzIN2FUlsSIdJNH5pdmK_5AHmR9Uks5rQsbCgaJpZM4KBw1L
.

smarterclayton on 10 Jan 2017

@jwhonce running oc run latency-test --image gcr.io/google_containers/pause:2.0 --replicas=30 will give you what you those failing containers, it's from the linked doc.

soltysh on 10 Jan 2017

My concern now is that we've moved from a "things are 50% slower" to
"something is hanging / pausing and timing out", where the former applies
to simple containers and the latter is complex things running at the same
time (like builds happening at the same time).

On Tue, Jan 10, 2017 at 6:31 AM, Maciej Szulik notifications@github.com
wrote:

@jwhonce https://github.com/jwhonce running oc run latency-test --image
gcr.io/google_containers/pause:2.0 --replicas=30 will give you what you
those failing containers, it's from the linked doc
https://docs.google.com/document/d/1AaNZTggal-OUjgJah7FV4mWYNS-in5cHTBLf0BDIkh4/edit
.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-271552889,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p36GbT8rStex4wjuUmlAacGpSTMUks5rQ2wXgaJpZM4KBw1L
.

smarterclayton on 11 Jan 2017

I'm finally gathering metrics data from the GCE cluster runs (so that we have prometheus data that can at least tell us if we cross certain thresholds). I'm not sure however that default 0.99 is going to capture outliers though - we don't really know whether the 3m is an outlier (entire system goes into IO inversion and one container stops for 3 minutes) or whether it's better distributed. Hoping to get that with the latest metrics: https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_gce/332/ (has links to .metrics files from each nodes that are the prometheus captures).

I did increase the GCE SSD sizes which gives us a small bump in IOP quota, but still seeing some throttles.

smarterclayton on 8 Feb 2017

@vikaslaad and @mffiedler have started a new experiment today and we should have data soon-ish.

timothysc on 8 Feb 2017

https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/48/

bparees on 9 Feb 2017

This may be https://github.com/openshift/origin/issues/12784 instead

On Thu, Feb 9, 2017 at 5:06 PM, Ben Parees notifications@github.com wrote:

https://ci.openshift.redhat.com/jenkins/job/test_pull_
requests_origin_future/48/

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-278789644,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p4o6rTu2bVddpvZGav0PW24PR7I4ks5ra43kgaJpZM4KBw1L
.

smarterclayton on 9 Feb 2017

Even though 2m is an example of this problem

On Thu, Feb 9, 2017 at 5:21 PM, Clayton Coleman ccoleman@redhat.com wrote:

This may be https://github.com/openshift/origin/issues/12784 instead

On Thu, Feb 9, 2017 at 5:06 PM, Ben Parees notifications@github.com
wrote:

https://ci.openshift.redhat.com/jenkins/job/test_pull_reques
ts_origin_future/48/

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-278789644,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p4o6rTu2bVddpvZGav0PW24PR7I4ks5ra43kgaJpZM4KBw1L
.

smarterclayton on 9 Feb 2017

Here's an example of where we waited a very long time and there were no running pods - this would have to be an incredibly slow layer creation to be the docker lock issue:

https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_gce/870/testReport/junit/(root)/Extended/deploymentconfigs_should_respect_image_stream_tag_reference_policy__Conformance_/

I don't see more than a few pods running on any node.

smarterclayton on 23 Feb 2017

I did a run on overlayFS tonight. I still see:

Mar  6 22:42:09.205: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.5 Latency:2m2.536474s}
Mar  6 22:42:09.205: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.99 Latency:2m2.536474s}
Mar  6 22:42:09.205: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.9 Latency:2m2.536474s}
Mar  6 22:42:09.205: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.99 Latency:51.995394s}
Mar  6 22:42:09.205: INFO: {Operation:stop_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:30.130964s}
Mar  6 22:42:09.205: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.99 Latency:25.470511s}
Mar  6 22:42:09.205: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.9 Latency:23.547013s}
Mar  6 22:42:09.205: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:18.910722s}
Mar  6 22:42:09.205: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.9 Latency:17.537977s}

I'm still seeing IO limits being poked (consistent with devicemapper, actually, although that could just be the tests running faster or docker pulls still being the bulk of the workload).

@ncdc @sjenning @derekwaynecarr @jwhonce I should be able to give you repeatable clusters with overlay tomorrow - what do we need to do to verify that this is the CreateLayer previously fingered as the culprit, vs something more fundamental (say, the Docker daemon being saturated and the kubelet just giving up)? This is looking more like a kubelet / docker interaction.

smarterclayton on 7 Mar 2017

xref: https://bugzilla.redhat.com/show_bug.cgi?id=1422511

we are hitting cases where docker list images is > 1min to complete, machine is not actually memory starved in those instances where this occurs.

derekwaynecarr on 13 Mar 2017

This p0 has been open for 10 months, and inactive for 4. Is it actionable? If not, is it really a p0?

liggitt on 18 Jul 2017

@liggitt I leave the priority to you to set/change. As to news on the issue, I finally have a reproducer that uses only docker and am working with the performance team to further isolate the cause of the slow downs.

jwhonce on 18 Jul 2017

Ops updating their monitoring https://trello.com/c/cB9lX5sb . Suspected cause is volume BurstBalance being exhausted.

jwhonce on 1 Aug 2017

@jwhonce could you give more details on the reproduction of the issue and is there any update on this?

xThomo on 16 Aug 2017

@xThomo The reproducer used docker run in multiple threads which determined the BurstBalance on the volumes was exhausted. The Ops team is now implementing monitoring of that metric to determine if that is the root cause for these issues. Or, if there are other gating factors.

jwhonce on 16 Aug 2017

All other issues are separate now. Closing.

smarterclayton on 24 Jan 2018

Origin: Some containers take multiple minutes to start, resulting in timeouts and test failures

All 79 comments

Related issues