I noticed that when I annotate a few hundred ingress at once, cert-manager gets into a bad state and all orders become stalled. Orders in the "ready" state are never finalized, orders in the pending state are never authorized, and some orders just created are never submitted to letsencrypt. It seems as if there is some loop that is just stuck.
I tried restarting cert-manager, but it doesn't seem to help. Upon startup, I did notice a bunch of errors that look like this
E0223 17:53:48.315954 1 util.go:71] cert-manager/controller/orders/handleOwnedResource "msg"="error getting order referenced by resource" "error"="order.acme.cert-manager.io \"REDACTED\" not found" "related_resource_kind"="Order" "related_resource_name"="REDACTED" "related_resource_namespace"="REDACTED" "resource_kind"="Challenge" "resource_name"="REDACTED" "resource_namespace"="REDACTED"
It seems that while processing challenges, it can't find the order, but it looks to me like the order is there. To get things working again, I just deleted all the challenges and restarted cert-manager. The error above went away and everything started working again.
My guess is that because I issued so many orders at once, letsencrypt rate limited my account which somehow gets cert-manager into a bad state that it can't recover from without destroying the bad challenges.
Expected behaviour:
Ideally, cert-manager would never get into a bad state, but if it does it would be great if it knew how to recover. Also, even if a few orders are in a bad state, it shouldn't stall ALL orders.
Steps to reproduce the bug:
This is difficult to reproduce and I haven't been able to simulate it, but my suspicion is that it's hitting this line here
https://github.com/jetstack/cert-manager/blob/e66f362e4da273a5fc9eff2c599bbe0c0676da5b/pkg/controller/util.go#L67
And there is something off with the ref or order Group or Kind.
Anything else we need to know?:
Environment details::
/kind bug
I have this issue as well - currently investigating
Kubernetes version (e.g. v1.10.2): v1.16.7
Cloud-provider/provisioner (e.g. GKE, kops AWS, etc): kubespray
cert-manager version (e.g. v0.4.0): 0.14.0
Install method (e.g. helm or static manifests): static manifests
In my case, what had happened is that a majority of my certs are issued from a single domain - that domain hit a ratelimit (viewable in the events on some of the 'pending' orders) - and cert-manager seems to just sit and wait, never attempting to move on to other orders!
I did see plenty of the "error getting order referenced by resource", and some of my developers were deleting and re-creating their oreders/certs/challenges etc.
The temporary solution was removing those new ingresses/certificates that had an outstanding order object for the ratelimited domain. Removing them and re-starting cert-manager quickly fixed the issue as cert-manager moved on to other orders. I fear if I re-enable those ingresses, certificate generation in my cluster will stop again.
This is an issue with certmanager 0.13.1 and 0.14.0
@jesseshieh and @erulabs I notice that the error message error getting order referenced by resource is also mentioned in an older issue: https://github.com/jetstack/cert-manager/issues/2667
I think this is a duplicate of that issue.
Let me know if you agree.
/area acme
/close
@wallrj: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
In my case, what had happened is that a majority of my certs are issued from a single domain - that domain hit a ratelimit (viewable in the events on some of the 'pending' orders) - and cert-manager seems to just sit and wait, never attempting to move on to other orders!
I did see plenty of the
"error getting order referenced by resource", and some of my developers were deleting and re-creating their oreders/certs/challenges etc.The temporary solution was removing those new ingresses/certificates that had an outstanding
orderobject for the ratelimited domain. Removing them and re-starting cert-manager quickly fixed the issue as cert-manager moved on to other orders. I fear if I re-enable those ingresses, certificate generation in my cluster will stop again.This is an issue with certmanager 0.13.1 and 0.14.0