Describe the bug:
Installing cert-manager with --wait, the webhook is not ready to service after the installation is finished.
Expected behaviour:
Creating issuers and certificates should work immediately after helm install has completed when using the --wait flag.
Steps to reproduce the bug:
Install cert-manager:
helm install cert-manager \
jetstack/cert-manager \
--namespace cert-manager \
--version v0.15.0 \
--set installCRDs=true \
--wait
Then immediately try to create an issuer (e.g. clusterissuer), for instance:
apiVersion: cert-manager.io/v1alpha2
kind: ClusterIssuer
metadata:
name: ca-issuer
spec:
ca:
secretName: ca-key-pair
(with some secret already provisioned or whatever).
Optionally, you can also issue a test certificate.
Often (saw it 5 times in 8 tries), creating the cluster issuer will give an error:
Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: x509: certificate signed by unknown authority
When the issuer was created successfully, directly issuing a certificate will often take up to a minute (no ready state shown in kubectl get certificates -A for some time). After the first certificate was successfully issued, everything works as expected.
However, doing the same steps but waiting some time (e.g. 60s) after finishing cert-manager deployment and before creating the cluster issuer, everything is fine and certificates are issued quickly.
Anything else we need to know?:
Environment details::
Probably irrelevant, but I'm testing upgrading from version 0.13 so some time before deploying I delete the old cert-manager deployment including the CRDs.
/kind bug
@Martin-Idel-SI thanks for the bug report.
This is a tricky one.
helm install --wait checks the Ready condition of all the Deployments / Replicasets / Pods in the chart.
The cert-manager Deployment already includes a ReadinessProbe, which not only checks that the webhook service is listening, but also checks the serving certificates of the webhook server.
The kubernetes API server won't be able to connect to the webhook server until cainjector has added CA bundle to the webhook configuration object.
And judging by the error in your description, this is why the API call fails soon after helm install --wait returns.
Unfortunately, the webhookconfiguration resource doesn't expose a status field indicating readiness.
See kubectl explain validatingwebhookconfiguration
I suggest that you retry the kubectl apply step until it succeeds (with a timeout).
Or retry applying a test resource, such as described in https://cert-manager.io/docs/installation/kubernetes/#verifying-the-installation, before then applying your actual resources, once you know the webhook is responding.
This is also what Kubernetes do themselves in their e2e tests here:
https://github.com/kubernetes/kubernetes/blob/1700acb035ce24b760ff6e2c3fb6ee6246528ab1/test/e2e/apimachinery/webhook.go#L2405-L2429
/area webhook
/area deploy
@wallrj: The label(s) area/deployment cannot be applied, because the repository doesn't have them
In response to this:
/area webhook
/area deployment
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@Martin-Idel-SI I hope that the polling suggestion above helps. I'll close this issue for now, but please re-open if you think there is a better solution to this problem.
/close
@wallrj: Closing this issue.
In response to this:
@Martin-Idel-SI I hope that the polling suggestion above helps. I'll close this issue for now, but please re-open if you think there is a better solution to this problem.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@wallrj Thank you for the explanation! I wasn't aware of the limitations with validatingwebhookconfiguration, that's unfortunate of course but I can see why you can't easily work around that issue.
Yes deploying some test resource is what we currently do. I'm fine with this.