Origin: openshift-api issues breaking cluster

Created on 4 Dec 2018 · 14Comments · Source: openshift/origin

There are currently openshift-api issues that are breaking cluster to become unusable.

Steps To Reproduce

run the installer to get a cluster running
see the failures

Current Result

openshift-apiserver is down with various connection refused errors:
E1204 19:21:10.940510 1 memcache.go:147] couldn't get resource list for samplesoperator.config.openshift.io/v1alpha1: Get https://172.30.0.1:443/apis/samplesoperator.config.openshift.io/v1alpha1?timeout=32s: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:10.948714 1 memcache.go:147] couldn't get resource list for servicecertsigner.config.openshift.io/v1alpha1: Get https://172.30.0.1:443/apis/servicecertsigner.config.openshift.io/v1alpha1?timeout=32s: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:10.953465 1 memcache.go:147] couldn't get resource list for tuned.openshift.io/v1alpha1: Get https://172.30.0.1:443/apis/tuned.openshift.io/v1alpha1?timeout=32s: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:11.805850 1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.RoleBinding: Get https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:11.818657 1 reflector.go:136] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:129: Failed to list *core.LimitRange: Get https://172.30.0.1:443/api/v1/limitranges?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:11.818742 1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: connection refused
Cannot ssh into master, tcp/connection errors throughout pods, for a time oc and kubectl commands also not working. Eventually most of the pods break with CrashLoopError.

Expected Result

I don't expect any errors.

Additional Information

[try to run $ oc adm diagnostics (or oadm diagnostics) command if possible]
[if you are reporting issue related to builds, provide build logs with BUILD_LOGLEVEL=5]
[consider attaching output of the $ oc get all -o json -n <namespace> command to the issue]
[visit https://docs.openshift.org/latest/welcome/index.html]

kinbug lifecyclrotten

Source

kikisdeliveryservice

Most helpful comment

The current API instability may be a symptom of some underlying master instability. I don't know what's going on yet, but in a recent CI run, there was a running die-off of pods before the machine-config daemon pulled the plug and rebooted the node. Notes for the MCD part in openshift/machine-config-operator#224. Notes on etcd-member (the first pod to die) in openshift/installer#844. I don't know what's going on there, but I can certainly see occasional master reboots causing connectivity issues like these.

wking on 8 Dec 2018

👍2

All 14 comments

This issue came out of a conversation with @deads2k and @brancz.

ashcrow on 4 Dec 2018

When your server comes back up (it will crash and try to recover) quickly run

oc label ns openshift-monitoring 'openshift.io/run-level=1'
oc create quota -nopenshift-monitoring stoppods --hard=pods=0
oc -n openshift-monitoring delete pods --all

And you should end up with a stable cluster. Just one without any monitoring

deads2k on 4 Dec 2018

👍1

we can bump domain memory for masters to 4Gi
https://github.com/openshift/installer/blob/ee73f72017bdfe681629e8d3f41cb5ae1d5b4775/pkg/asset/machines/libvirt/machines.go#L70

@openshift/sig-cloud

derekwaynecarr on 4 Dec 2018

we can bump domain memory for masters to 4Gi
https://github.com/openshift/installer/blob/ee73f72017bdfe681629e8d3f41cb5ae1d5b4775/pkg/asset/machines/libvirt/machines.go#L70

Currently only the installer provisions masters (because once the cluster is running you'd need manual intervention to attach new etcd nodes). And installer-launched masters got bumped to 4GB in openshift/installer#785 (just landed).

wking on 5 Dec 2018

should be resolved now

deads2k on 6 Dec 2018

🎉1

Reopening bc I'm still seeing these errors and tcp timeouts.
Seeing the following on openshift-apiserver:

E1206 23:15:36.782303       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:36.783546       1 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.785915       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:36.786749       1 memcache.go:147] couldn't get resource list for image.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.788326       1 memcache.go:147] couldn't get resource list for oauth.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.789024       1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.789890       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.791221       1 memcache.go:147] couldn't get resource list for route.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.792344       1 memcache.go:147] couldn't get resource list for security.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.793448       1 memcache.go:147] couldn't get resource list for template.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.794602       1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:36.795432       1 memcache.go:147] couldn't get resource list for packages.apps.redhat.com/v1alpha1: the server could not find the requested resource
E1206 23:15:46.170716       1 watch.go:212] unable to encode watch object: expected pointer, but got invalid kind
E1206 23:15:46.840537       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:46.843581       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:46.849583       1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:46.850103       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server could not find the requested resource
E1206 23:15:46.851487       1 memcache.go:147] couldn't get resource list for route.openshift.io/v1: the server could not find the requested resource
E1206 23:15:46.852618       1 memcache.go:147] couldn't get resource list for security.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:46.853553       1 memcache.go:147] couldn't get resource list for template.openshift.io/v1: the server could not find the requested resource
E1206 23:15:46.854751       1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:56.909115       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:56.921756       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:56.935902       1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:56.959769       1 memcache.go:147] couldn't get resource list for security.openshift.io/v1: the server is currently unable to handle the request

kikisdeliveryservice on 7 Dec 2018

wking on 8 Dec 2018

👍2

I think this might have been resolved by openshift/machine-config-operator#225. Can anyone still reproduce? If not, can we close this?

wking on 14 Dec 2018

I still see the similar error in origin-template-service-broker

E0104 16:14:25.634855 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request

| E0104 16:15:56.017707 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:16:26.196299 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:18:26.868674 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| W0104 16:19:11.850715 1 reflector.go:272] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: watch of *v1.Template ended with: The resourceVersion for the provided watch is too old.
| E0104 16:19:27.129940 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:20:57.681989 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:21:27.845002 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:21:57.932396 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:23:28.291818 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:25:28.747637 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:26:59.112399 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:27:59.312813 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| W0104 16:28:04.023096 1 reflector.go:272] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: watch of *v1.Template ended with: The resourceVersion for the provided watch is too old.
| E0104 16:31:30.049245 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:32:30.262111 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:33:00.432617 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:35:00.870254 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
| E0104 16:36:01.147749 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request

In the command line, Get 503 error:
[root@openshift-master-1 ~]# oc get clusterserviceclass -n kube-service-catalog --loglevel=8
I0105 00:39:26.688956 20712 loader.go:359] Config loaded from file /root/.kube/config
I0105 00:39:26.690015 20712 loader.go:359] Config loaded from file /root/.kube/config
I0105 00:39:26.699546 20712 loader.go:359] Config loaded from file /root/.kube/config
I0105 00:39:26.713030 20712 loader.go:359] Config loaded from file /root/.kube/config
I0105 00:39:26.714057 20712 round_trippers.go:383] GET https://openshift.sunhocapital.com:8443/apis/servicecatalog.k8s.io/v1beta1/clusterserviceclasses?limit=500
I0105 00:39:26.714277 20712 round_trippers.go:390] Request Headers:
I0105 00:39:26.714498 20712 round_trippers.go:393] User-Agent: oc/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0
I0105 00:39:26.714684 20712 round_trippers.go:393] Accept: application/json;as=Table;v=v1beta1;g=meta.k8s.io, application/json
I0105 00:39:26.742742 20712 round_trippers.go:408] Response Status: 503 Service Unavailable in 27 milliseconds
I0105 00:39:26.743036 20712 round_trippers.go:411] Response Headers:
I0105 00:39:26.743222 20712 round_trippers.go:414] Cache-Control: no-store
I0105 00:39:26.743457 20712 round_trippers.go:414] Content-Type: text/plain; charset=utf-8
I0105 00:39:26.743489 20712 round_trippers.go:414] X-Content-Type-Options: nosniff
I0105 00:39:26.743530 20712 round_trippers.go:414] Content-Length: 20
I0105 00:39:26.743544 20712 round_trippers.go:414] Date: Fri, 04 Jan 2019 16:39:26 GMT
I0105 00:39:26.743643 20712 request.go:897] Response Body: service unavailable
No resources found.
I0105 00:39:26.743843 20712 helpers.go:201] server response object: [{
"metadata": {},
"status": "Failure",
"message": "the server is currently unable to handle the request (get clusterserviceclasses.servicecatalog.k8s.io)",
"reason": "ServiceUnavailable",
"details": {
"group": "servicecatalog.k8s.io",
"kind": "clusterserviceclasses",
"causes": [
{
"reason": "UnexpectedServerResponse",
"message": "service unavailable"
}
]
},
"code": 503
}]
F0105 00:39:26.744007 20712 helpers.go:119] Error from server (ServiceUnavailable): the server is currently unable to handle the request (get clusterserviceclasses.servicecatalog.k8s.io)

mslovy on 4 Jan 2019

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot on 5 Apr 2019

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot on 5 May 2019

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-bot on 4 Jun 2019

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.