Origin: High latency with resource="imagestreamimports" and verb="POST"

Created on 16 Nov 2018 · 22Comments · Source: openshift/origin

With the default monitoring from cluster-monitoring-operator the alert KubeAPILatencyHigh reports periodical problems with a high latency on API-Server.
The resource "imagestreamimports" is triggered periodical because I have imagestreams with "importPolicy.scheduled" = true.

Version

oc v3.11.0+0cbc58b
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://s-openshift.mycompany.com:443
openshift v3.11.0+6c2b013-59
kubernetes v1.11.0+d4cacc0

Steps To Reproduce

Install Openshift 3.11 with Ansible
Verify that openshift-monitoring is installed with default configuration
Create some imagestreams with importPolicy.scheduled = true

Current Result

API calls latency is quiet high 1-4 seconds
screenshot_2018-11-16 prometheus time series collection and processing server

Expected Result

Fast API calls with low latency.

Additional Infos

Maybe related with #14264
Noticed this messages in API-Server Log.

I1116 10:15:42.621803       1 trace.go:76] Trace[2046733054]: "Create /apis/image.openshift.io/v1/namespaces/my-project/imagestreamimports" (started: 2018-11-16 10:15:41.486368579 +0000 UTC m=+80018.770131016) (total time: 1.135397696s):
Trace[2046733054]: [1.13489078s] [1.134427655s] Object stored in database
I1116 10:19:28.517513       1 trace.go:76] Trace[1052075296]: "Create /apis/image.openshift.io/v1/namespaces/my-project/imagestreamimports" (started: 2018-11-16 10:19:26.48801308 +0000 UTC m=+80243.771775530) (total time: 2.029461307s):
Trace[1052075296]: [2.028804306s] [2.028362867s] Object stored in database
I1116 10:26:59.096277       1 trace.go:76] Trace[1356838264]: "Create /apis/image.openshift.io/v1/namespaces/my-project/imagestreamimports" (started: 2018-11-16 10:26:56.48712211 +0000 UTC m=+80693.770884485) (total time: 2.609073903s):
Trace[1356838264]: [2.608495955s] [2.607996341s] Object stored in database
I1116 10:34:28.377486       1 trace.go:76] Trace[1656023571]: "Create /apis/image.openshift.io/v1/namespaces/my-project/imagestreamimports" (started: 2018-11-16 10:34:26.485752391 +0000 UTC m=+81143.769514793) (total time: 1.891700389s):
Trace[1656023571]: [1.891163987s] [1.890827358s] Object stored in database
I1116 10:42:01.529790       1 trace.go:76] Trace[40799593]: "Create /apis/image.openshift.io/v1/namespaces/my-project/imagestreamimports" (started: 2018-11-16 10:41:56.677180492 +0000 UTC m=+81593.960942873) (total time: 4.852573087s):
Trace[40799593]: [4.851734298s] [4.851280253s] Object stored in database
I1116 10:45:42.642385       1 trace.go:76] Trace[556365896]: "Create /apis/image.openshift.io/v1/namespaces/my-project/imagestreamimports" (started: 2018-11-16 10:45:41.48695107 +0000 UTC m=+81818.770713475) (total time: 1.155388409s):
Trace[556365896]: [1.154500425s] [1.154029628s] Object stored in database

componenimage lifecyclrotten

Source

Reamer

Most helpful comment

Just adding a voice here - the alarm began to trigger for us when we configured the image streams for scheduled updates for the official OpenShift images in the openshift namespace, on two different clusters.

clcollins on 2 May 2019

👍2

All 22 comments

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot on 14 Feb 2019

/remove-lifecycle stale

Reamer on 14 Feb 2019

This API call goes to remote registries and, in general, we can't guarantee it to be fast.

dmage on 21 Feb 2019

StoneIsle on 25 Mar 2019

@dmage Can you explain a little bit more?
Some thoughts:
We enable with Imagestream configuration importPolicy.scheduled: true a periodic sync with remote image registries. I think some kind of scheduler will check for this configuration flag. The scheduler should then ask the remote registry for updated information and if all information are present in RAM, then we should push information at once to the API-Server, which updates the API-objects.

Reamer on 25 Mar 2019

@Reamer we have the controller "openshift.io/image-import" that goes through all image streams that have "importPolicy.scheduled: true" and creates ImageStreamImport objects. The image API has a special handler for creating ImageStreamImport objects: it gets the list of remote registries from the ImageStreamImport object and from the image stream, fetches manifests from the remote registries, and writes information about new images back to the image stream.

dmage on 25 Mar 2019

I disabled alertrule in alertmanager configuration, because it's to noisy. Maybe we should disable this specific api endpoint alertrule in general, because I think nearly everyone should have this problems, if they use dockerhub as a remote registry.

Reamer on 25 Mar 2019

clcollins on 2 May 2019

👍2

Just adding a voice here - the alarm began to trigger for us when we configured the image streams for scheduled updates for the official OpenShift images in the openshift namespace, on two different clusters.

Are you using registry.access.redhat.com or registry.redhat.io?

yocum137 on 1 Jul 2019

@yocum137 I work on the same cluster @clcollins was referring to, we're not using either of those registries. the registries in our openshift namespace are mostly on docker.io or registry.centos.org

We're OKD, not OCP.

nate-duke on 26 Jul 2019

Issues go stale after 90d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot on 24 Oct 2019

/remove-lifecycle stale

Reamer on 25 Oct 2019

We have this issue too:

Using OCP v3.11.146 (Kubernetes v1.11.0+d4cacc0).
After enabling importPolicy.scheduled: true for some imagestreams.
These imagestreams are pointing to registry.redhat.io.
They are correctly being synchronized.
They produce a KubeAPILatencyHigh alert in Prometheus (The API server has a 99th percentile latency of 7.826666666666666 seconds for POST imagestreamimports.)

I will contact Red Hat Support on how to handle with this.

trumbaut on 18 Nov 2019

For your reference: the same issue is being described at Red Hat Bugzilla - Bug 1670380 - Alertmanager triggers error when updating the image.

trumbaut on 20 Nov 2019

👀1

Issues go stale after 90d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot on 18 Feb 2020

Issue should have been fixed in Red Hat OpenShift Container Platform 3.11.170, but I haven't test it myself. Can someone confirm?

trumbaut on 20 Feb 2020

@trumbaut that's great that it's fixed downstream! Any idea where we cound find that in openshift/origin? The last official release we have here is v3.11.0 in October 2018.

Did a bit of digging based on that "Port ... to 3.11" MR that maybe this is in the quay.io/openshift/origin-cluster-monitoring-operator images, maybe on the v3.11.0 tag but I have a hard time finding anything concrete about the history of images on quay. We could probably find a CI process log somewhere that would connect that MR to an image getting pushed to the origin-cluster-monitoring-operator image repository. I'm willing to test in our dev cluster if we can find at least a tad of evidence that the change is actually available outside of OCP.

nate-duke on 20 Feb 2020

@nate-duke :

Any idea where we cound find that in openshift/origin? The last official release we have here is v3.11.0 in October 2018.

The changes can be found at

But no, I heave no idea if new releases for OKD 3.11 will still be delivered. OKD 4 can be found at https://github.com/openshift/okd.

trumbaut on 4 Mar 2020

But no, I heave no idea if new releases for OKD 3.11 will still be delivered. OKD 4 can be found at https://github.com/openshift/okd.

We've since silenced this alert but i did do the leg work of updating our images so there's a chance we picked up these fixes if they were ever included in anything built for OKD 3.11.

I'm glad to see that the development of OKD 4 is proceeding. I've been following that repo for the last few months. I think we'll likely wait until FCOS and support for it in OKD 4.0 has matured just a bit more before we go building a new cluster to actually try and run it.

nate-duke on 4 Mar 2020

👍1

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot on 3 Apr 2020

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-bot on 3 May 2020

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.