Our deployment test suite has started flaking more frequently due to deployer pods needing more time to become ready. We don't have any valuable data since all of our flakes are due to our tests being time-bounded.
See:
https://github.com/openshift/origin/pull/11001
and related flakes:
https://github.com/openshift/origin/issues/10951
https://github.com/openshift/origin/issues/11008
https://github.com/openshift/origin/issues/10989
cc: @derekwaynecarr @smarterclayton @mfojtik
Pods appear to be scheduled, but long delays between starting infra container and the remaining containers (in the cases we've seen).
There was a similar BZ reported in kube 1.3 when running density tests from 0-100 pods on AWS.
@timothysc team was to investigate that issue to root cause. this looks the same. at the time, we suspected there were global locking issues in openshift-sdn.
/cc @eparis
We did actually manage to eliminate the sdn however, before we go down that route again....
After some investigation today (when SDN was not in use) it may have been
correlated to throttling on secret retrievals. In a local environment
(bone stock origin 1.4.0-alpha.0) I was able to easily reproduce long
delays when multiple pods are being scheduled. Calls to docker appeared to
be fast, but the kubelet itself was reporting
kubelet_docker_operation_latencies in the tens of seconds even for the 50th
percentile (but the 90 and 99th were only a few seconds higher).
On Tue, Sep 20, 2016 at 5:02 PM, Eric Paris [email protected]
wrote:
We did actually manage to eliminate the sdn however, before we go down
that route again....โ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-248433329,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p77Cwb9UTzI91FEYEHI61olYNCpuks5qsEnygaJpZM4KBw1L
.
Here is the BZ in question: https://bugzilla.redhat.com/show_bug.cgi?id=1343196
It's not the scheduler.
Added the test-flake label because other flake issues have been closed/duped in favor of this one.
I see these errors in a separate failed extended image test:
E0920 17:47:28.496490 16730 kubelet.go:1816] Unable to mount volumes for pod "mongodb-1-deploy_extended-test-mongodb-replica-a41h3-f5b9t(49f979b4-7f7b-11e6-a225-0e03779dc447)": timeout expired waiting for volumes to attach/mount for pod "mongodb-1-deploy"/"extended-test-mongodb-replica-a41h3-f5b9t". list of unattached/unmounted volumes=[deployer-token-nactn]; skipping pod
E0920 17:47:28.496507 16730 pod_workers.go:184] Error syncing pod 49f979b4-7f7b-11e6-a225-0e03779dc447, skipping: timeout expired waiting for volumes to attach/mount for pod "mongodb-1-deploy"/"extended-test-mongodb-replica-a41h3-f5b9t". list of unattached/unmounted volumes=[deployer-token-nactn]
I0920 17:47:28.496820 16730 server.go:608] Event(api.ObjectReference{Kind:"Pod", Namespace:"extended-test-mongodb-replica-a41h3-f5b9t", Name:"mongodb-1-deploy", UID:"49f979b4-7f7b-11e6-a225-0e03779dc447", APIVersion:"v1", ResourceVersion:"3642", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "mongodb-1-deploy"/"extended-test-mongodb-replica-a41h3-f5b9t". list of unattached/unmounted volumes=[deployer-token-nactn]
I0920 17:47:28.496861 16730 server.go:608] Event(api.ObjectReference{Kind:"Pod", Namespace:"extended-test-mongodb-replica-a41h3-f5b9t", Name:"mongodb-1-deploy", UID:"49f979b4-7f7b-11e6-a225-0e03779dc447", APIVersion:"v1", ResourceVersion:"3642", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod "mongodb-1-deploy_extended-test-mongodb-replica-a41h3-f5b9t(49f979b4-7f7b-11e6-a225-0e03779dc447)": timeout expired waiting for volumes to attach/mount for pod "mongodb-1-deploy"/"extended-test-mongodb-replica-a41h3-f5b9t". list of unattached/unmounted volumes=[deployer-token-nactn]
which seem to confirm the theory that secrets are taking too long
reassign to tstclair as he was planning to root cause during 1.5.
We have a fair amount of info that this is not secrets (in most cases), and
that Docker is taking extended periods of time to respond to create calls.
On Thu, Sep 22, 2016 at 12:02 PM, Eric Paris [email protected]
wrote:
reassign to tstclair as he was planning to root cause during 1.5.
โ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-248947772,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pwxcBTdDli9JYY9PyMaBAriDyGKeks5qsqaJgaJpZM4KBw1L
.
@kargakis got:
/data/src/github.com/openshift/origin/test/extended/deployments/deployments.go:715
Expected an error to have occurred. Got:
/data/src/github.com/openshift/origin/test/extended/deployments/deployments.go:696
is there no timeout we can bump to unblock the merge queue? or disable some of these tests? we're totally blocked by this, almost nothing is merging.
Please keep this issue up to date with fixes.
On Mon, Sep 26, 2016 at 12:42 PM, Ben Parees [email protected]
wrote:
is there no timeout we can bump to unblock the merge queue? or disable
some of these tests? we're totally blocked by this, almost nothing is
merging.โ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-249625797,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p6yXd0wZbIz8L0cKwuC5rK8onhP2ks5qt_YPgaJpZM4KBw1L
.
I switched my docker graph driver from devicemapper to overlay (which also required that I disable docker's selinux support), and the timings were significantly better. I'd say it went from taking 20-40 seconds in between starting the infra and actual containers (on average) to no more than 5 seconds, with an average of probably 1-2.
Note this is not statistically significant, as I only ran the overlay test 1 time. But definitely something to investigate (devicemapper graph driver possible contention somewhere).
Bumping to p0, this is still blocking merges across the cluster. We need a simple fix in the short term that allows us to stop flaking.
If we have to increase timeouts on certain tests let's do it, but I want the flakes gone.
This test fails on my machine consistently, the problem being one of the pods of the deployment fails the readiness checks. I've tried changing timeouts on readiness and in tests and sooner or later at some point in time one of the pod will start failing the readiness checks. Not sure what to check more but I'll be debugging more... Posting here just to notify about the progress...
Try to use a different fixture that is not strict about readiness (the
current test requires both pods to become ready) since this test doesn't
test readiness.
On Thu, Sep 29, 2016 at 5:06 PM, Maciej Szulik [email protected]
wrote:
This test
https://github.com/openshift/origin/blob/8ce6de44e16f506a921de75d1e63b0d2ea49195d/test/extended/deployments/deployments.go#L382
fails on my machine consistently, the problem being one of the pods of the
deployment
https://github.com/openshift/origin/blob/8ce6de44e16f506a921de75d1e63b0d2ea49195d/test/extended/testdata/deployment-simple.yaml
fails the readiness checks. I've tried changing timeouts on readiness and
in tests and sooner or later at some point in time one of the pod will
start failing the readiness checks. Not sure what to check more but I'll be
debugging more... Posting here just to notify about the progress...โ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-250493844,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADuFfw_W9ornU4EhG8NyIOhvwRd6zmZzks5qu9PZgaJpZM4KBw1L
.
The only time I can get this test to pass always is to change the readiness to tcp probe instead of http one, not sure if it's desirable.
I don't understand why that would make a difference?
On Fri, Sep 30, 2016 at 9:06 AM, Maciej Szulik [email protected]
wrote:
The only time I can get this test to pass always is to change the
readiness to tcp probe instead of http one, not sure if it's desirable.โ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-250739522,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p5nRqMEF8h0laAmtskmz2ZBSxN6Nks5qvQlQgaJpZM4KBw1L
.
I don't understand why that would make a difference?
The reason httpGet are failing is usually due to timeouts on get, but even significantly increasing (up to 10s) those timeouts didn't help. But frankly I have no idea why the two differ, maybe there's some bug in the probes... Will check it out...
Is it because the container is starting and failing and Get actually
exposes the app failure?
On Fri, Sep 30, 2016 at 11:17 AM, Maciej Szulik [email protected]
wrote:
I don't understand why that would make a difference?
The reason httpGet are failing is usually due to timeouts on get, but even
significantly increasing (up to 10s) those timeouts didn't help. But
frankly I have no idea why the two differ, maybe there's some bug in the
probes... Will check it out...โ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-250771729,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p3Z2wlq0CeJfdJPsN7hSU-ikZOvlks5qvSfvgaJpZM4KBw1L
.
@mfojtik @soltysh bump. Status here?
I think the original issue description (post 1.4 rebase pod startup times have increased) is not happening because of the rebase. It's because of the upgrade from docker 1.9 to 1.10. Right?
Yup, that's what we've identified. I haven't seen anything else in there.
@ncdc and we should now run all CI on docker 1.12 right? so this should no longer happen.
I vote for closing this issue, and reopening if the problem bites us again.
@mfojtik that will only be true if docker 1.12 is faster than 1.10
@ncdc I smell it is not: https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_conformance/8031/testReport/junit/(root)/Extended/_k8s_io__Services_should_serve_multiport_endpoints_from_pods__Conformance_/
@danmcp @smarterclayton maybe we need 1 jenkins job that tests using overlay instead of devicemapper, for comparison
@ncdc @danmcp @smarterclayton or an option to job where you can choose the storage
Choosing is not good, it has to be permanent, iow. one job running on devicemapper, other on overlay.
Most deployment flakes are related to this. @mfojtik @smarterclayton not sure if we should extend the deployment timeout any more. To be honest, I would prefer using the default deployment timeout (10m) and be done with this at the expense of some tests becoming slower.
The aggressive timeout doesn't seem to help us. Is there any evidence of
that?
On Nov 1, 2016, at 8:18 AM, Michail Kargakis [email protected]
wrote:
Most deployment flakes are related to this. @mfojtik
https://github.com/mfojtik @smarterclayton
https://github.com/smarterclayton not sure if we should extend the
deployment timeout any more. To be honest, I would prefer using the default
deployment timeout (10m) and be done with this at the expense of some tests
becoming slower.
โ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-257552709,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p1EkeY4x66i4bgCPfGgv1W2gIXdDks5q5y4WgaJpZM4KBw1L
.
@smarterclayton https://github.com/openshift/origin/issues/11685 the deployer pod succeeds around 5min after it started. The deployment pod was scaled up from the start but didn't transition to Ready until ~5min later. We end up failing the test.
Why did it take 5 min? That's completely unexpected.
On Tue, Nov 1, 2016 at 10:53 AM, Michail Kargakis [email protected]
wrote:
@smarterclayton https://github.com/smarterclayton #11685
https://github.com/openshift/origin/issues/11685 the deployer pod
succeeds around 5min after it started. The deployment pod was scaled up
from the start but didn't transition to Ready until ~5min later. We end up
failing the test.โ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-257587296,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p2w7WPiGWT3lyvJPD1Qp3ualaKVcks5q51J1gaJpZM4KBw1L
.
@rhvgoyal Please take a look at this. Thanks.
This is so high level, that I have no idea. So far none of the data is suggesting that it is a storage issue. if this is a problem, please narrow it down.
@rhvgoyal we have a doc (I need to get a link to it) showing that the devicemapper timings are slower when you go from docker 1.9 to 1.10. I apologize for not having this data handy right now. I believe @soltysh has it somewhere. We'll get it to you as soon as we can.
I'll try to run 1.12 tests and add them to that document. @rhvgoyal where can I get 1.12 binaries/package for F24?
I'm running a combination of http://koji.fedoraproject.org/koji/taskinfo?taskID=16262294 and http://koji.fedoraproject.org/koji/buildinfo?buildID=812817 (but you probably could update the latter to 1.12.3-2).
I have these packages installed:
Given the rate of flakes, we need to do something here.
Probably increasing a few timeouts.
Agreed, I'm seeing this in half of the failures in #11916. The only question is what can we do? I doubt switching to OverlayFS is an option?
It's not.
On Mon, Nov 21, 2016 at 11:40 AM, Maciej Szulik [email protected]
wrote:
Agreed, I'm seeing this in half of the failures in #11916
https://github.com/openshift/origin/pull/11916. The only question is
what can we do? I doubt switching to OverlayFS is an option?โ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-261992761,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pzZgZuv0K_N9AyfO7L8oc4T0jSw7ks5rAcmbgaJpZM4KBw1L
.
Increasing a few timeouts seems reasonable. I'll dig into the failures and see what's the most frequent ones and where it make sense.
This is not 1.4 blocker, because the longer times are not introduced inside origin code but rather by newer docker version. This document contains detailed log of the test performed with different docker versions vs origin 1.4 and 1.3 vs different storage drivers (overlayfs and devicemapper).
@jwhonce @rhvgoyal bump - please let us know if there's anything we can do to help you debug this
I've updated the doc with docker 1.12 tests against origin 1.4 and 1.5 (just devicemapper). It looks like origin 1.4 with docker 1.12 is additional second slower than with 1.10 and origin 1.5 is even additional second slower (time in parents is median):
| | Origin 1.4 + Docker 1.10 | Origin 1.4 + Docker 1.12 | Origin 1.5 + Docker 1.10 | Origin 1.5 + Docker 1.12 |
|---|---|---|---|---|
| CreateContainer | 10.6s (12.9s) | 11.3s (14.1s) | 11.7s (14.8s) | 12.1s (15.3s) |
| runContainer | 12.8s (15.2s) | 11.7s (14.5) | 12.4s (15.4s) | 12.5s (15.7s) |
EDIT: added origin 1.5 + docker 1.10 times
The docs was updated also.
/cc @mffiedler ...
still seeing issues from this: https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_conformance/9556/
(https://github.com/openshift/origin/issues/9076 was duped to this issue)
@kargakis are we closing in on a solution to this?
sorry, question should be directed to @jwhonce and @soltysh
and if not i'd like to reopen https://github.com/openshift/origin/issues/11649 (which was closed as dupe of this) so a workaround can be put in place because it's cropping up a lot lately.
@bparees for now only increasing timeouts for serial re-appearances is the only solution :/
@soltysh ok, then i don't think we should be duping specific test case flakes to this issue, we need to fix each flake. I'm going to reopen #11649.
Yes, please, it'll help us find those cases.
Latency metrics for node 172.18.13.129
Dec 21 15:28:31.743: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.99 Latency:2m1.879512s}
Dec 21 15:28:31.743: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.9 Latency:1m37.203206s}
Dec 21 15:28:31.743: INFO: {Operation: Method:pleg_relist_latency_microseconds Quantile:0.99 Latency:50.928686s}
Dec 21 15:28:31.743: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:48.959779s}
Dec 21 15:28:31.743: INFO: {Operation:stop_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:48.823987s}
Dec 21 15:28:31.743: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.9 Latency:46.16127s}
Dec 21 15:28:31.743: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.99 Latency:45.374806s}
Dec 21 15:28:31.743: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.9 Latency:45.374806s}
Dec 21 15:28:31.743: INFO: {Operation:runContainerInPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:41.035892s}
Dec 21 15:28:31.743: INFO: {Operation: Method:pod_worker_start_latency_microseconds Quantile:0.99 Latency:40.465985s}
Dec 21 15:28:31.743: INFO: {Operation:stop_container Method:docker_operations_latency_microseconds Quantile:0.9 Latency:39.089519s}
Dec 21 15:28:31.743: INFO: {Operation:createPodInfraContainer Method:container_manager_latency_microseconds Quantile:0.99 Latency:36.979561s}
Dec 21 15:28:31.743: INFO: {Operation:list_containers Method:docker_operations_latency_microseconds Quantile:0.99 Latency:31.90826s}
Dec 21 15:28:31.743: INFO: {Operation:GetPods Method:container_manager_latency_microseconds Quantile:0.99 Latency:31.907478s}
Dec 21 15:28:31.743: INFO: {Operation:inspect_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:25.352457s}
Dec 21 15:28:31.743: INFO: {Operation:runContainerInPod Method:container_manager_latency_microseconds Quantile:0.9 Latency:13.037115s}
Wow.
This is a docker problem. I'm seeing 1/4 conformance runs fail because of this, and start latencies in excess of 2-3 minutes during those runs. Something is broken in Docker, possible related to the parallel builds hang.
Raising this to P0 - @jwhonce if you can make sure the appropriate investigations are taking place. This would be a production blocker for many use cases (loaded node takes many minutes to start some pods).
There was at least one container that took 4 minutes to start:
Latency metrics for node ci-pr65-ig-n-x0v5.c.openshift-gce-devel-ci.internal
Jan 6 00:09:38.454: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.99 Latency:4m0.426184s}
Jan 6 00:09:38.454: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.9 Latency:2m33.107112s}
Jan 6 00:09:38.454: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:2m15.985041s}
Jan 6 00:09:38.454: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.9 Latency:2m3.638843s}
Jan 6 00:09:38.454: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.99 Latency:2m3.638843s}
Jan 6 00:09:38.454: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.99 Latency:1m36.1392s}
Jan 6 00:09:38.454: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.9 Latency:1m26.463562s}
Jan 6 00:09:38.454: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.9 Latency:49.979557s}
Jan 6 00:09:38.454: INFO: {Operation:createPodInfraContainer Method:container_manager_latency_microseconds Quantile:0.99 Latency:47.685318s}
Jan 6 00:09:38.454: INFO: {Operation:runContainerInPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:37.203829s}
Jan 6 00:09:38.454: INFO: {Operation:create_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:31.249116s}
Jan 6 00:09:38.454: INFO: {Operation: Method:pleg_relist_latency_microseconds Quantile:0.99 Latency:30.389475s}
Jan 6 00:09:38.454: INFO: {Operation: Method:pod_worker_start_latency_microseconds Quantile:0.99 Latency:24.973853s}
Jan 6 00:09:38.454: INFO: {Operation:runContainerInPod Method:container_manager_latency_microseconds Quantile:0.9 Latency:21.276118s}
Jan 6 00:09:38.454: INFO: {Operation:stop_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:19.079327s}
Jan 6 00:09:38.454: INFO: {Operation:createPodInfraContainer Method:container_manager_latency_microseconds Quantile:0.9 Latency:17.403305s}
Jan 6 00:09:38.454: INFO: {Operation:create_container Method:docker_operations_latency_microseconds Quantile:0.9 Latency:12.668864s}
Jan 6 00:09:38.454: INFO: {Operation: Method:pod_worker_start_latency_microseconds Quantile:0.9 Latency:12.609544s}
Jan 6 00:09:38.454: INFO: {Operation:remove_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:11.227787s}
Jan 6 00:09:38.454: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.5 Latency:10.568633s}
Jan 6 00:09:38.454: INFO: {Operation:list_containers Method:docker_operations_latency_microseconds Quantile:0.99 Latency:10.550709s}
@ncdc Do you have a guess at the number of containers being tracked by the docker daemon when you start seeing the excessive latencies? $(docker info |grep 'Containers:') Thanks.
Running containers are printed in the junit failures. It's generally
between 10-30 containers running, including pause containers, with some CPU
spikes. In the ones I was seeing, it was a 2 core system so I have to
assume that docker was under CPU pressure (probably not memory).
On Jan 9, 2017, at 6:46 PM, Jhon Honce notifications@github.com wrote:
@ncdc https://github.com/ncdc Do you have a guess at the number of
containers being tracked by the docker daemon when you start seeing the
excessive latencies? $(docker info |grep 'Containers:') Thanks.
โ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-271444068,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pzIN2FUlsSIdJNH5pdmK_5AHmR9Uks5rQsbCgaJpZM4KBw1L
.
@jwhonce running oc run latency-test --image gcr.io/google_containers/pause:2.0 --replicas=30 will give you what you those failing containers, it's from the linked doc.
My concern now is that we've moved from a "things are 50% slower" to
"something is hanging / pausing and timing out", where the former applies
to simple containers and the latter is complex things running at the same
time (like builds happening at the same time).
On Tue, Jan 10, 2017 at 6:31 AM, Maciej Szulik notifications@github.com
wrote:
@jwhonce https://github.com/jwhonce running oc run latency-test --image
gcr.io/google_containers/pause:2.0 --replicas=30 will give you what you
those failing containers, it's from the linked doc
https://docs.google.com/document/d/1AaNZTggal-OUjgJah7FV4mWYNS-in5cHTBLf0BDIkh4/edit
.โ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-271552889,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p36GbT8rStex4wjuUmlAacGpSTMUks5rQ2wXgaJpZM4KBw1L
.
I'm finally gathering metrics data from the GCE cluster runs (so that we have prometheus data that can at least tell us if we cross certain thresholds). I'm not sure however that default 0.99 is going to capture outliers though - we don't really know whether the 3m is an outlier (entire system goes into IO inversion and one container stops for 3 minutes) or whether it's better distributed. Hoping to get that with the latest metrics: https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_gce/332/ (has links to .metrics files from each nodes that are the prometheus captures).
I did increase the GCE SSD sizes which gives us a small bump in IOP quota, but still seeing some throttles.
@vikaslaad and @mffiedler have started a new experiment today and we should have data soon-ish.
This may be https://github.com/openshift/origin/issues/12784 instead
On Thu, Feb 9, 2017 at 5:06 PM, Ben Parees notifications@github.com wrote:
https://ci.openshift.redhat.com/jenkins/job/test_pull_
requests_origin_future/48/โ
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-278789644,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p4o6rTu2bVddpvZGav0PW24PR7I4ks5ra43kgaJpZM4KBw1L
.
Even though 2m is an example of this problem
On Thu, Feb 9, 2017 at 5:21 PM, Clayton Coleman ccoleman@redhat.com wrote:
This may be https://github.com/openshift/origin/issues/12784 instead
On Thu, Feb 9, 2017 at 5:06 PM, Ben Parees notifications@github.com
wrote:https://ci.openshift.redhat.com/jenkins/job/test_pull_reques
ts_origin_future/48/โ
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/11016#issuecomment-278789644,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p4o6rTu2bVddpvZGav0PW24PR7I4ks5ra43kgaJpZM4KBw1L
.
Here's an example of where we waited a very long time and there were no running pods - this would have to be an incredibly slow layer creation to be the docker lock issue:
https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_gce/870/testReport/junit/(root)/Extended/deploymentconfigs_should_respect_image_stream_tag_reference_policy__Conformance_/
I don't see more than a few pods running on any node.
I did a run on overlayFS tonight. I still see:
Mar 6 22:42:09.205: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.5 Latency:2m2.536474s}
Mar 6 22:42:09.205: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.99 Latency:2m2.536474s}
Mar 6 22:42:09.205: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.9 Latency:2m2.536474s}
Mar 6 22:42:09.205: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.99 Latency:51.995394s}
Mar 6 22:42:09.205: INFO: {Operation:stop_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:30.130964s}
Mar 6 22:42:09.205: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.99 Latency:25.470511s}
Mar 6 22:42:09.205: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.9 Latency:23.547013s}
Mar 6 22:42:09.205: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:18.910722s}
Mar 6 22:42:09.205: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.9 Latency:17.537977s}
I'm still seeing IO limits being poked (consistent with devicemapper, actually, although that could just be the tests running faster or docker pulls still being the bulk of the workload).
@ncdc @sjenning @derekwaynecarr @jwhonce I should be able to give you repeatable clusters with overlay tomorrow - what do we need to do to verify that this is the CreateLayer previously fingered as the culprit, vs something more fundamental (say, the Docker daemon being saturated and the kubelet just giving up)? This is looking more like a kubelet / docker interaction.
xref: https://bugzilla.redhat.com/show_bug.cgi?id=1422511
we are hitting cases where docker list images is > 1min to complete, machine is not actually memory starved in those instances where this occurs.
This p0 has been open for 10 months, and inactive for 4. Is it actionable? If not, is it really a p0?
@liggitt I leave the priority to you to set/change. As to news on the issue, I finally have a reproducer that uses only docker and am working with the performance team to further isolate the cause of the slow downs.
Ops updating their monitoring https://trello.com/c/cB9lX5sb . Suspected cause is volume BurstBalance being exhausted.
@jwhonce could you give more details on the reproduction of the issue and is there any update on this?
@xThomo The reproducer used docker run in multiple threads which determined the BurstBalance on the volumes was exhausted. The Ops team is now implementing monitoring of that metric to determine if that is the root cause for these issues. Or, if there are other gating factors.
All other issues are separate now. Closing.