Cert-manager: Installing webhook as aggregation breaks namespace deletion

Created on 11 Feb 2019 · 8Comments · Source: jetstack/cert-manager

Describe the bug:
cert-manager installs it's validation server into kubernetes API aggregation layer, but it's a broken one (probably missing some endpoints):

$ kubectl api-resources
error: unable to retrieve the complete list of server APIs: admission.certmanager.k8s.io/v1beta1: the server is currently unable to handle the request

This breaks namespace deletions, because when you are deleting a namespace, kubernetes-controller-manager tries to list for all items in the namespace. admission.certmanager.k8s.io does not implement list, so the controller-manager refuses to delete namespace.

Expected behaviour:
cert-manager should either proper implement a proper aggregated API server or move validation out of the API aggregation layer.

As personal opinion, keeping it in aggregation sounds cool :)

Environment details:

Kubernetes version (e.g. v1.10.2):

Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.0.0-master+$Format:%h$", GitCommit:"10ecc6db83fd47a93eb0940e2e4434f2b0a5c3ec", GitTreeState:"clean", BuildDate:"2018-08-15T21:23:47Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.5-gke.5", GitCommit:"9aba9c1237d9d2347bef28652b93b1cba3aca6d8", GitTreeState:"clean", BuildDate:"2018-12-11T02:36:50Z", GoVersion:"go1.10.3b4", Compiler:"gc", Platform:"linux/amd64"}

Cloud-provider/provisioner (e.g. GKE, kops AWS, etc): GKE
cert-manager version (e.g. v0.4.0): 0.6.0
Install method (e.g. helm or static manifests): helm

/kind bug

arewebhook kinbug lifecyclactive prioritawaiting-more-evidence

Source

calind

👍11

All 8 comments

Thank you for discovering this 🙏

I had noticed occasionally my namespace required some extra work to get deleted, but I wasn't sure if this was down to some other transient failure.

I'll have to get this all set up, but your reasoning of not implementing list makes perfect sense 😄

We currently use the openshift/generic-admission-server package to create the webhook apiserver. Perhaps we should open an issue upstream there, and in the meantime potentially fork the repo/add it to third_party.

/priority important-soon
/area webhook
/kind bug
/milestone v0.7
/help

munnerz on 12 Feb 2019

@munnerz:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

Thank you for discovering this 🙏

I had noticed occasionally my namespace required some extra work to get deleted, but I wasn't sure if this was down to some other transient failure.

I'll have to get this all set up, but your reasoning of not implementing list makes perfect sense 😄

We currently use the openshift/generic-admission-server package to create the webhook apiserver. Perhaps we should open an issue upstream there, and in the meantime potentially fork the repo/add it to third_party.

/priority important-soon
/area webhook
/kind bug
/milestone v0.7
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jetstack-bot on 12 Feb 2019

I'm taking a look into this now - how exactly did you reproduce this? Do you mean it breaks deleting the 'cert-manager' namespace, or it prevents deleting all namespaces?

Could you possibly share a brief step-by-step reproduction, so I can put a fix together? :smile:

munnerz on 20 Feb 2019

I seem to be able to replicate this by:

1) Deploying cert-manager
2) Deleting the cert-manager namespace (but NOT deleting the APIService/ValidatingWebhook resource)
3) Attempting to delete any other namespace

IMO, this is working-as-intended. You deleted the namespace that runs the webhook/apiservice resource, but you've not deleted the actual webhook and apiservice configuration resources.

You can 'fix' your cluster if you run into this issue, by properly running kubectl delete -f with the installation manifest you first used (as this will clean up the ValidatingWebhookConfiguration and APIService resources).

Installing these types of resources are seen as 'privileged' operations in k8s land, and should only administrators should be permitted to install them as they can cause cluster instability if misconfigured.

I don't think there's really anything we need/can do to prevent this, aside from advising users of the proper way to uninstall cert-manager (i.e. not just deleting the namespace).

munnerz on 20 Feb 2019

👍1

You can also replicate it by making the webhook pod unschedulable for example.

I think the best way to move forward it to expose the validation webhook trough a regular http server rather than exposing the endpoints trough the API aggregation layer. This way, if the webhook is unavailable it only takes down the certificate manager related actions (eg. creating issuers or certificates) rahter than affecting kubenretes normal behavior.

calind on 22 Feb 2019

👍1

You can also replicate it by making the webhook pod unschedulable for example.

👍 yep, that's correct.

I think the best way to move forward it to expose the validation webhook trough a regular http server rather than exposing the endpoints trough the API aggregation layer.

We've avoided this, because it means we need to implement our own mTLS layer (as we need to perform mutual auth on the webhooks, because eventually we'll be using this pattern for more 'secure' operations).

I'm going to dig into the GC controller to see if there's anyway we can have it not block if our APIService is down... it's not an insignificant amount of work to switch us away from utilising the aggregation layer, really 😬

munnerz on 22 Feb 2019

So Kubernetes allows you to disable garbage collection for particular resource types by passing a flag to the Kubernetes apiserver.

This would help prevent this issue occurring, but otherwise this is sort of a can't-fix issue.

In order to avoid or fix the issue, you can remove the APIService resource and/or uninstall cert-manager properly (either with helm delete or kubectl delete).

In the meantime, I don't think there's anything else we can really do here 😬

/close

munnerz on 6 Mar 2019

@munnerz: Closing this issue.

In response to this:

So Kubernetes allows you to disable garbage collection for particular resource types by passing a flag to the Kubernetes apiserver.

This would help prevent this issue occurring, but otherwise this is sort of a can't-fix issue.

In order to avoid or fix the issue, you can remove the APIService resource and/or uninstall cert-manager properly (either with helm delete or kubectl delete).

In the meantime, I don't think there's anything else we can really do here 😬

/close