There are currently openshift-api issues that are breaking cluster to become unusable.
openshift-apiserver is down with various connection refused errors:
E1204 19:21:10.940510 1 memcache.go:147] couldn't get resource list for samplesoperator.config.openshift.io/v1alpha1: Get https://172.30.0.1:443/apis/samplesoperator.config.openshift.io/v1alpha1?timeout=32s: dial tcp 172.30.0.1:443: connect: connection refused
E1204 19:21:10.948714 1 memcache.go:147] couldn't get resource list for servicecertsigner.config.openshift.io/v1alpha1: Get https://172.30.0.1:443/apis/servicecertsigner.config.openshift.io/v1alpha1?timeout=32s: dial tcp 172.30.0.1:443: connect: connection refused
E1204 19:21:10.953465 1 memcache.go:147] couldn't get resource list for tuned.openshift.io/v1alpha1: Get https://172.30.0.1:443/apis/tuned.openshift.io/v1alpha1?timeout=32s: dial tcp 172.30.0.1:443: connect: connection refused
E1204 19:21:11.805850 1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.RoleBinding: Get https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: connection refused
E1204 19:21:11.818657 1 reflector.go:136] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:129: Failed to list *core.LimitRange: Get https://172.30.0.1:443/api/v1/limitranges?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: connection refused
E1204 19:21:11.818742 1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: connection refused
Cannot ssh into master, tcp/connection errors throughout pods, for a time oc and kubectl commands also not working. Eventually most of the pods break with CrashLoopError.
I don't expect any errors.
[try to run $ oc adm diagnostics (or oadm diagnostics) command if possible]
[if you are reporting issue related to builds, provide build logs with BUILD_LOGLEVEL=5]
[consider attaching output of the $ oc get all -o json -n <namespace> command to the issue]
[visit https://docs.openshift.org/latest/welcome/index.html]
This issue came out of a conversation with @deads2k and @brancz.
When your server comes back up (it will crash and try to recover) quickly run
oc label ns openshift-monitoring 'openshift.io/run-level=1'
oc create quota -nopenshift-monitoring stoppods --hard=pods=0
oc -n openshift-monitoring delete pods --all
And you should end up with a stable cluster. Just one without any monitoring
we can bump domain memory for masters to 4Gi
https://github.com/openshift/installer/blob/ee73f72017bdfe681629e8d3f41cb5ae1d5b4775/pkg/asset/machines/libvirt/machines.go#L70
@openshift/sig-cloud
we can bump domain memory for masters to 4Gi
https://github.com/openshift/installer/blob/ee73f72017bdfe681629e8d3f41cb5ae1d5b4775/pkg/asset/machines/libvirt/machines.go#L70
Currently only the installer provisions masters (because once the cluster is running you'd need manual intervention to attach new etcd nodes). And installer-launched masters got bumped to 4GB in openshift/installer#785 (just landed).
should be resolved now
Reopening bc I'm still seeing these errors and tcp timeouts.
Seeing the following on openshift-apiserver:
E1206 23:15:36.782303 1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:36.783546 1 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.785915 1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:36.786749 1 memcache.go:147] couldn't get resource list for image.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.788326 1 memcache.go:147] couldn't get resource list for oauth.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.789024 1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.789890 1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.791221 1 memcache.go:147] couldn't get resource list for route.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.792344 1 memcache.go:147] couldn't get resource list for security.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.793448 1 memcache.go:147] couldn't get resource list for template.openshift.io/v1: the server could not find the requested resource
E1206 23:15:36.794602 1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:36.795432 1 memcache.go:147] couldn't get resource list for packages.apps.redhat.com/v1alpha1: the server could not find the requested resource
E1206 23:15:46.170716 1 watch.go:212] unable to encode watch object: expected pointer, but got invalid kind
E1206 23:15:46.840537 1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:46.843581 1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:46.849583 1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:46.850103 1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server could not find the requested resource
E1206 23:15:46.851487 1 memcache.go:147] couldn't get resource list for route.openshift.io/v1: the server could not find the requested resource
E1206 23:15:46.852618 1 memcache.go:147] couldn't get resource list for security.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:46.853553 1 memcache.go:147] couldn't get resource list for template.openshift.io/v1: the server could not find the requested resource
E1206 23:15:46.854751 1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:56.909115 1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:56.921756 1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:56.935902 1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request
E1206 23:15:56.959769 1 memcache.go:147] couldn't get resource list for security.openshift.io/v1: the server is currently unable to handle the request
The current API instability may be a symptom of some underlying master instability. I don't know what's going on yet, but in a recent CI run, there was a running die-off of pods before the machine-config daemon pulled the plug and rebooted the node. Notes for the MCD part in openshift/machine-config-operator#224. Notes on etcd-member (the first pod to die) in openshift/installer#844. I don't know what's going on there, but I can certainly see occasional master reboots causing connectivity issues like these.
I think this might have been resolved by openshift/machine-config-operator#225. Can anyone still reproduce? If not, can we close this?
I still see the similar error in origin-template-service-broker
聽 | E0104 16:15:56.017707 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
聽 | E0104 16:16:26.196299 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
聽 | E0104 16:18:26.868674 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
聽 | W0104 16:19:11.850715 1 reflector.go:272] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: watch of *v1.Template ended with: The resourceVersion for the provided watch is too old.
聽 | E0104 16:19:27.129940 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
聽 | E0104 16:20:57.681989 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
聽 | E0104 16:21:27.845002 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
聽 | E0104 16:21:57.932396 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
聽 | E0104 16:23:28.291818 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
聽 | E0104 16:25:28.747637 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
聽 | E0104 16:26:59.112399 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
聽 | E0104 16:27:59.312813 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
聽 | W0104 16:28:04.023096 1 reflector.go:272] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: watch of *v1.Template ended with: The resourceVersion for the provided watch is too old.
聽 | E0104 16:31:30.049245 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
聽 | E0104 16:32:30.262111 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
聽 | E0104 16:33:00.432617 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
聽 | E0104 16:35:00.870254 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
聽 | E0104 16:36:01.147749 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
In the command line, Get 503 error:
[root@openshift-master-1 ~]# oc get clusterserviceclass -n kube-service-catalog --loglevel=8
I0105 00:39:26.688956 20712 loader.go:359] Config loaded from file /root/.kube/config
I0105 00:39:26.690015 20712 loader.go:359] Config loaded from file /root/.kube/config
I0105 00:39:26.699546 20712 loader.go:359] Config loaded from file /root/.kube/config
I0105 00:39:26.713030 20712 loader.go:359] Config loaded from file /root/.kube/config
I0105 00:39:26.714057 20712 round_trippers.go:383] GET https://openshift.sunhocapital.com:8443/apis/servicecatalog.k8s.io/v1beta1/clusterserviceclasses?limit=500
I0105 00:39:26.714277 20712 round_trippers.go:390] Request Headers:
I0105 00:39:26.714498 20712 round_trippers.go:393] User-Agent: oc/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0
I0105 00:39:26.714684 20712 round_trippers.go:393] Accept: application/json;as=Table;v=v1beta1;g=meta.k8s.io, application/json
I0105 00:39:26.742742 20712 round_trippers.go:408] Response Status: 503 Service Unavailable in 27 milliseconds
I0105 00:39:26.743036 20712 round_trippers.go:411] Response Headers:
I0105 00:39:26.743222 20712 round_trippers.go:414] Cache-Control: no-store
I0105 00:39:26.743457 20712 round_trippers.go:414] Content-Type: text/plain; charset=utf-8
I0105 00:39:26.743489 20712 round_trippers.go:414] X-Content-Type-Options: nosniff
I0105 00:39:26.743530 20712 round_trippers.go:414] Content-Length: 20
I0105 00:39:26.743544 20712 round_trippers.go:414] Date: Fri, 04 Jan 2019 16:39:26 GMT
I0105 00:39:26.743643 20712 request.go:897] Response Body: service unavailable
No resources found.
I0105 00:39:26.743843 20712 helpers.go:201] server response object: [{
"metadata": {},
"status": "Failure",
"message": "the server is currently unable to handle the request (get clusterserviceclasses.servicecatalog.k8s.io)",
"reason": "ServiceUnavailable",
"details": {
"group": "servicecatalog.k8s.io",
"kind": "clusterserviceclasses",
"causes": [
{
"reason": "UnexpectedServerResponse",
"message": "service unavailable"
}
]
},
"code": 503
}]
F0105 00:39:26.744007 20712 helpers.go:119] Error from server (ServiceUnavailable): the server is currently unable to handle the request (get clusterserviceclasses.servicecatalog.k8s.io)
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.
/close
@openshift-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen.
Mark the issue as fresh by commenting/remove-lifecycle rotten.
Exclude this issue from closing again by commenting/lifecycle frozen./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Issue still present with v3.11 on OpenSuSE Tw.
Most helpful comment
The current API instability may be a symptom of some underlying master instability. I don't know what's going on yet, but in a recent CI run, there was a running die-off of pods before the machine-config daemon pulled the plug and rebooted the node. Notes for the MCD part in openshift/machine-config-operator#224. Notes on etcd-member (the first pod to die) in openshift/installer#844. I don't know what's going on there, but I can certainly see occasional master reboots causing connectivity issues like these.