Describe the bug:
cert-manager installs it's validation server into kubernetes API aggregation layer, but it's a broken one (probably missing some endpoints):
$ kubectl api-resources
error: unable to retrieve the complete list of server APIs: admission.certmanager.k8s.io/v1beta1: the server is currently unable to handle the request
This breaks namespace deletions, because when you are deleting a namespace, kubernetes-controller-manager tries to list for all items in the namespace. admission.certmanager.k8s.io does not implement list, so the controller-manager refuses to delete namespace.
Expected behaviour:
cert-manager should either proper implement a proper aggregated API server or move validation out of the API aggregation layer.
As personal opinion, keeping it in aggregation sounds cool :)
Environment details:
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.0.0-master+$Format:%h$", GitCommit:"10ecc6db83fd47a93eb0940e2e4434f2b0a5c3ec", GitTreeState:"clean", BuildDate:"2018-08-15T21:23:47Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.5-gke.5", GitCommit:"9aba9c1237d9d2347bef28652b93b1cba3aca6d8", GitTreeState:"clean", BuildDate:"2018-12-11T02:36:50Z", GoVersion:"go1.10.3b4", Compiler:"gc", Platform:"linux/amd64"}
GKE0.6.0helm/kind bug
Thank you for discovering this 馃檹
I had noticed occasionally my namespace required some extra work to get deleted, but I wasn't sure if this was down to some other transient failure.
I'll have to get this all set up, but your reasoning of not implementing list makes perfect sense 馃槃
We currently use the openshift/generic-admission-server package to create the webhook apiserver. Perhaps we should open an issue upstream there, and in the meantime potentially fork the repo/add it to third_party.
/priority important-soon
/area webhook
/kind bug
/milestone v0.7
/help
@munnerz:
This request has been marked as needing help from a contributor.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
Thank you for discovering this 馃檹
I had noticed occasionally my namespace required some extra work to get deleted, but I wasn't sure if this was down to some other transient failure.
I'll have to get this all set up, but your reasoning of not implementing list makes perfect sense 馃槃
We currently use the
openshift/generic-admission-serverpackage to create the webhook apiserver. Perhaps we should open an issue upstream there, and in the meantime potentially fork the repo/add it tothird_party./priority important-soon
/area webhook
/kind bug
/milestone v0.7
/help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I'm taking a look into this now - how exactly did you reproduce this? Do you mean it breaks deleting the 'cert-manager' namespace, or it prevents deleting all namespaces?
Could you possibly share a brief step-by-step reproduction, so I can put a fix together? :smile:
I seem to be able to replicate this by:
1) Deploying cert-manager
2) Deleting the cert-manager namespace (but NOT deleting the APIService/ValidatingWebhook resource)
3) Attempting to delete any other namespace
IMO, this is working-as-intended. You deleted the namespace that runs the webhook/apiservice resource, but you've not deleted the actual webhook and apiservice configuration resources.
You can 'fix' your cluster if you run into this issue, by properly running kubectl delete -f with the installation manifest you first used (as this will clean up the ValidatingWebhookConfiguration and APIService resources).
Installing these types of resources are seen as 'privileged' operations in k8s land, and should only administrators should be permitted to install them as they can cause cluster instability if misconfigured.
I don't think there's really anything we need/can do to prevent this, aside from advising users of the proper way to uninstall cert-manager (i.e. not just deleting the namespace).
You can also replicate it by making the webhook pod unschedulable for example.
I think the best way to move forward it to expose the validation webhook trough a regular http server rather than exposing the endpoints trough the API aggregation layer. This way, if the webhook is unavailable it only takes down the certificate manager related actions (eg. creating issuers or certificates) rahter than affecting kubenretes normal behavior.
You can also replicate it by making the webhook pod unschedulable for example.
馃憤 yep, that's correct.
I think the best way to move forward it to expose the validation webhook trough a regular http server rather than exposing the endpoints trough the API aggregation layer.
We've avoided this, because it means we need to implement our own mTLS layer (as we need to perform mutual auth on the webhooks, because eventually we'll be using this pattern for more 'secure' operations).
I'm going to dig into the GC controller to see if there's anyway we can have it not block if our APIService is down... it's not an insignificant amount of work to switch us away from utilising the aggregation layer, really 馃槵
So Kubernetes allows you to disable garbage collection for particular resource types by passing a flag to the Kubernetes apiserver.
This would help prevent this issue occurring, but otherwise this is sort of a can't-fix issue.
In order to avoid or fix the issue, you can remove the APIService resource and/or uninstall cert-manager properly (either with helm delete or kubectl delete).
In the meantime, I don't think there's anything else we can really do here 馃槵
/close
@munnerz: Closing this issue.
In response to this:
So Kubernetes allows you to disable garbage collection for particular resource types by passing a flag to the Kubernetes apiserver.
This would help prevent this issue occurring, but otherwise this is sort of a can't-fix issue.
In order to avoid or fix the issue, you can remove the APIService resource and/or uninstall cert-manager properly (either with
helm deleteorkubectl delete).In the meantime, I don't think there's anything else we can really do here 馃槵
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.