Origin: High latency with resource="imagestreamimports" and verb="POST"

Created on 16 Nov 2018  路  22Comments  路  Source: openshift/origin

With the default monitoring from cluster-monitoring-operator the alert KubeAPILatencyHigh reports periodical problems with a high latency on API-Server.
The resource "imagestreamimports" is triggered periodical because I have imagestreams with "importPolicy.scheduled" = true.

Version
oc v3.11.0+0cbc58b
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://s-openshift.mycompany.com:443
openshift v3.11.0+6c2b013-59
kubernetes v1.11.0+d4cacc0
Steps To Reproduce
  1. Install Openshift 3.11 with Ansible
  2. Verify that openshift-monitoring is installed with default configuration
  3. Create some imagestreams with importPolicy.scheduled = true
Current Result

API calls latency is quiet high 1-4 seconds
screenshot_2018-11-16 prometheus time series collection and processing server

Expected Result

Fast API calls with low latency.

Additional Infos

Maybe related with #14264
Noticed this messages in API-Server Log.

I1116 10:15:42.621803       1 trace.go:76] Trace[2046733054]: "Create /apis/image.openshift.io/v1/namespaces/my-project/imagestreamimports" (started: 2018-11-16 10:15:41.486368579 +0000 UTC m=+80018.770131016) (total time: 1.135397696s):
Trace[2046733054]: [1.13489078s] [1.134427655s] Object stored in database
I1116 10:19:28.517513       1 trace.go:76] Trace[1052075296]: "Create /apis/image.openshift.io/v1/namespaces/my-project/imagestreamimports" (started: 2018-11-16 10:19:26.48801308 +0000 UTC m=+80243.771775530) (total time: 2.029461307s):
Trace[1052075296]: [2.028804306s] [2.028362867s] Object stored in database
I1116 10:26:59.096277       1 trace.go:76] Trace[1356838264]: "Create /apis/image.openshift.io/v1/namespaces/my-project/imagestreamimports" (started: 2018-11-16 10:26:56.48712211 +0000 UTC m=+80693.770884485) (total time: 2.609073903s):
Trace[1356838264]: [2.608495955s] [2.607996341s] Object stored in database
I1116 10:34:28.377486       1 trace.go:76] Trace[1656023571]: "Create /apis/image.openshift.io/v1/namespaces/my-project/imagestreamimports" (started: 2018-11-16 10:34:26.485752391 +0000 UTC m=+81143.769514793) (total time: 1.891700389s):
Trace[1656023571]: [1.891163987s] [1.890827358s] Object stored in database
I1116 10:42:01.529790       1 trace.go:76] Trace[40799593]: "Create /apis/image.openshift.io/v1/namespaces/my-project/imagestreamimports" (started: 2018-11-16 10:41:56.677180492 +0000 UTC m=+81593.960942873) (total time: 4.852573087s):
Trace[40799593]: [4.851734298s] [4.851280253s] Object stored in database
I1116 10:45:42.642385       1 trace.go:76] Trace[556365896]: "Create /apis/image.openshift.io/v1/namespaces/my-project/imagestreamimports" (started: 2018-11-16 10:45:41.48695107 +0000 UTC m=+81818.770713475) (total time: 1.155388409s):
Trace[556365896]: [1.154500425s] [1.154029628s] Object stored in database
componenimage lifecyclrotten

Most helpful comment

Just adding a voice here - the alarm began to trigger for us when we configured the image streams for scheduled updates for the official OpenShift images in the openshift namespace, on two different clusters.

All 22 comments

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

/remove-lifecycle stale

This API call goes to remote registries and, in general, we can't guarantee it to be fast.

+1

@dmage Can you explain a little bit more?
Some thoughts:
We enable with Imagestream configuration importPolicy.scheduled: true a periodic sync with remote image registries. I think some kind of scheduler will check for this configuration flag. The scheduler should then ask the remote registry for updated information and if all information are present in RAM, then we should push information at once to the API-Server, which updates the API-objects.

@Reamer we have the controller "openshift.io/image-import" that goes through all image streams that have "importPolicy.scheduled: true" and creates ImageStreamImport objects. The image API has a special handler for creating ImageStreamImport objects: it gets the list of remote registries from the ImageStreamImport object and from the image stream, fetches manifests from the remote registries, and writes information about new images back to the image stream.

I disabled alertrule in alertmanager configuration, because it's to noisy. Maybe we should disable this specific api endpoint alertrule in general, because I think nearly everyone should have this problems, if they use dockerhub as a remote registry.

Just adding a voice here - the alarm began to trigger for us when we configured the image streams for scheduled updates for the official OpenShift images in the openshift namespace, on two different clusters.

Just adding a voice here - the alarm began to trigger for us when we configured the image streams for scheduled updates for the official OpenShift images in the openshift namespace, on two different clusters.

Are you using registry.access.redhat.com or registry.redhat.io?

@yocum137 I work on the same cluster @clcollins was referring to, we're not using either of those registries. the registries in our openshift namespace are mostly on docker.io or registry.centos.org

We're OKD, not OCP.

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

/remove-lifecycle stale

We have this issue too:

  • Using OCP v3.11.146 (Kubernetes v1.11.0+d4cacc0).
  • After enabling importPolicy.scheduled: true for some imagestreams.
  • These imagestreams are pointing to registry.redhat.io.
  • They are correctly being synchronized.
  • They produce a KubeAPILatencyHigh alert in Prometheus (The API server has a 99th percentile latency of 7.826666666666666 seconds for POST imagestreamimports.)

I will contact Red Hat Support on how to handle with this.

For your reference: the same issue is being described at Red Hat Bugzilla - Bug 1670380 - Alertmanager triggers error when updating the image.

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Issue should have been fixed in Red Hat OpenShift Container Platform 3.11.170, but I haven't test it myself. Can someone confirm?

Related:

@trumbaut that's great that it's fixed downstream! Any idea where we cound find that in openshift/origin? The last official release we have here is v3.11.0 in October 2018.

Did a bit of digging based on that "Port ... to 3.11" MR that maybe this is in the quay.io/openshift/origin-cluster-monitoring-operator images, maybe on the v3.11.0 tag but I have a hard time finding anything concrete about the history of images on quay. We could probably find a CI process log somewhere that would connect that MR to an image getting pushed to the origin-cluster-monitoring-operator image repository. I'm willing to test in our dev cluster if we can find at least a tad of evidence that the change is actually available outside of OCP.

@nate-duke :

Any idea where we cound find that in openshift/origin? The last official release we have here is v3.11.0 in October 2018.

The changes can be found at

But no, I heave no idea if new releases for OKD 3.11 will still be delivered. OKD 4 can be found at https://github.com/openshift/okd.

But no, I heave no idea if new releases for OKD 3.11 will still be delivered. OKD 4 can be found at https://github.com/openshift/okd.

We've since silenced this alert but i did do the leg work of updating our images so there's a chance we picked up these fixes if they were ever included in anything built for OKD 3.11.

I'm glad to see that the development of OKD 4 is proceeding. I've been following that repo for the last few months. I think we'll likely wait until FCOS and support for it in OKD 4.0 has matured just a bit more before we go building a new cluster to actually try and run it.

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings