Cert-manager: v0.7 on GKE stops after temporary cert (no order events or challenges)

Created on 15 Mar 2019  Â·  34Comments  Â·  Source: jetstack/cert-manager

Describe the bug:
The certificate process seems stuck on temp cert and no challenge or new one is issued from Lets Encrypt.

  Issuer Ref:
    Kind:       ClusterIssuer
    Name:       letsencrypt-production
  Secret Name:  tls-legacy-production
Status:
  Conditions:
    Last Transition Time:  2019-03-15T04:38:23Z
    Message:               Certificate issuance in progress. Temporary certificate issued.
    Reason:                TemporaryCertificate
    Status:                False
    Type:                  Ready
Events:
  Type    Reason              Age   From          Message
  ----    ------              ----  ----          -------
  Normal  Generated           13m   cert-manager  Generated new private key
  Normal  GenerateSelfSigned  13m   cert-manager  Generated temporary self signed certificate
  Normal  OrderCreated        13m   cert-manager  Created Order resource "tls-legacy-production-2711062190"

Expected behaviour:
Expect challenge to be created and the temporary certificate replaced with the issued one from Lets Encrypt.

Steps to reproduce the bug:
https://hub.helm.sh/charts/jetstack/cert-manager

I successfully installed v.0.6.0 on 3 other clusters and followed exact same steps with a 4th and it failed missing certs. Issue described creating temp certificate and the plan for adding in v0.7.0 so I upgraded using helm jetstack/cert-manager. I deleted everything and created the CRDs for 0.7 release, then namespace, label, and install.

After that I created ClusterIssuer with http01 challenge like the others (using same file).

Finally, I created the Ingress and after a few minutes it was successful. The only issue is it's using the temporary certificate created to get around the GKE issue reported in v0.6.0

Extra info
The failing ingress / cluster has multiple domains in hosts where the others usually have a single domain (with multiple hosts). It shouldn't matter, but just noting in case helpful.

Environment details::

  • Kubernetes version (e.g. v1.10.2): v1.12.5-gke.5
  • Cloud-provider/provisioner (e.g. GKE, kops AWS, etc): GKE
  • cert-manager version (e.g. v0.4.0): v0.7.0
  • Install method (e.g. helm or static manifests): helm

/kind bug

kinbug lifecyclstale

Most helpful comment

I'm also experiencing this issue on 0.7. Running on AKS

All 34 comments

After a couple hours then Event log disappears

Status:
  Conditions:
    Last Transition Time:  2019-03-15T04:38:23Z
    Message:               Certificate issuance in progress. Temporary certificate issued.
    Reason:                TemporaryCertificate
    Status:                False
    Type:                  Ready
Events:                    <none>

but still never triggered the lets encrypt challenge or replaced temp cert

I'm also experiencing this issue on 0.7. Running on AKS

Thanks for confirming @agolomoodysaada ... Also I created related issue #1476 because when adding the CRDs, Kubernetes errs and you have to add a flag to make it work.

I would like to add the cert-manager log related to this.

E0315 14:44:43.337490       1 controller.go:208] challenges controller: Re-queuing item "<redacted-resource-name>" due to error processing: Could not validate CAA: Unexpected response code 'SERVFAIL' for <redacted-url>

After downgrading to 0.6, resetting CRDs, deleting faulty secret, and restarting cert-manager, I got the issue resolved. This indicates to me that this is a regression in 0.7.

Hey @munnerz ... we'd love some insight into this if you can guide us through fixing what appears to be a regression in the latest release.

I also downgraded to stable/cert-manager and at first it failed to run challenge. On a whim, I removed all but one domain from my ingress hosts and it worked with 0.6.2 version. I re-added other domains in the ingress and it failed again. I'll create a new ticket for that issue; ultimately I had to provision multiple static IPs and implement separate ingresses per domain (multiple hostnames / subdomains for a single domain work, but not multiple unique domains).

I have the same problem on Azure with v0.7. It never gets beyond issuing the temporary cert.

I am experiencing this on Azure AKS 1.11.3.
When I change to the staging acme url it does work

I have the same problem on Oracle Cloud with v0.7.0 using dns01 validation.

I am experiencing this issue on a cluster built on Openstack by Kops using http01 validation.

Reverting to 0.6.2 resolved the issue.

There may be a regression here, but I had the same experience as folks in this thread, I'm on a new DigitalOcean K8S cluster now ... are all of you people by chance using a DefaultClusterIssuer setting? That could be the regression.

I started stuck at:

Certificate issuance in progress. Temporary certificate issued.

I got this after having successfully upgraded my v0.6.0 to v0.7.0 on a different cluster which already without issue, but it also had all of the certs I needed provisioned, here's what I think differed. Followed all of the instructions, and I set:

--set ingressShim.defaultIssuerName=letsencrypt-prod \
--set ingressShim.defaultIssuerKind=ClusterIssuer \
--set global.rbac.create=true

The letsencrypt-prod is a ClusterIssuer that I created with default acme challenge for HTTP01.

I found two things that needed to be fixed to make it work, it's not totally clear that either of them are a regression. First, it did not seem to pay attention to my ClusterIssuer defaults. I had placed the kubernetes.io/tls-acme: "true" annotation, but I also had to add certmanager.k8s.io/cluster-issuer: letsencrypt-prod before it attempted to complete the order. That might be a regression?

But then my order failed, which I judge to be due to a default firewall that prevented the check. That is likely not everyone else's issue.

Come to think of it I never checked if the other cluster I upgraded was able to renew a cert or not. I looked and I am seeing some:

Order docs/asdf.zxcv-123456789 is not in 'valid' state. Waiting for Order to transition before attempting to issue Certificate.

I tried adding the annotation to those, and it's not perfectly clear if anything is wrong. It seems like the order had completed anyway, even without the annotation. But in 15 minutes, with my firewall changes, today's new DigitalOcean cluster should retry a challenge and I think it should pass and issue a cert. So maybe it's just this annotation

So are any of the other people who reported this issue trying to use a cert-manager configuration with a DefaultClusterIssuer?

  Normal   CertIssued     24s                cert-manager  Certificate issued successfully

I repeated the same experiment with a second host, and this time it worked fine with only kubernetes.io/tls-acme: "true" and no certmanager.k8s.io/cluster-issuer: letsencrypt-prod

Don't know what might be wrong. Maybe my only issue from the beginning was actually the firewall.

Same issue here. But I'm using the managed DigitalOcean Kubernetes.
Yesterday I used the Cert-Manager in another DO cluster, and it worked flawless.
Today I deployed another Cluster with another domain, and now the challenge stucks in the described state for hours.

status:                                                                       
  conditions:                                                                 
  - lastTransitionTime: "2019-04-11T13:54:28Z"                                
    message: Certificate issuance in progress. Temporary certificate issued.  
    reason: TemporaryCertificate                                              
    status: "False"                                                           
    type: Ready                                                               

Is there any possibility to re-challenge the certificates?

@ChSch3000 My issue report was from a DOK8S managed cluster as well. It turned out that my problem was fundamentally due to not using a Load Balancer for ingress. My cluster has two nodes and by default 80 and 443 are both firewalled. I had configured Ingress using HostPort mode, with DaemonSet configuration, so the two nodes would each act as a load balancer... and cert-manager failed to issue certs because no external traffic was actually reaching the ingress.

Does that help? The challenge will be retried on its own after some number of hours, you can tell how long it intends to wait by reading the cert-manager pod logs. I believe that cert-manager waits a while after failed challenge to prevent the LetsEncrypt issuers from viewing that as hammering.

I was using the DO LoadBalancer for the Ingress, but didn't succeed.
Anyway, after a few hours I decided to delete the Cert-Manager helm chart and all CRD objects (challenges/certificates and the generated secrets) Then i redeployed the Cert-Manager, after that the certificates were issued as expected.

Any progress on this one?

I am running Cert Manager v0.7.2 and facing the same issue.. after 30 minutes, still Certificate issuance in progress. Temporary certificate issued..
Is downgrading the only solution now?

Since I’m on GLE and latest stable k8s is 0.12.x, my only solution was revert. The stable folder was deprecated so I added jetstack repo and the -version 0.6.0 flag and it worked. (Helm chart)

I guess you can add flag if using kubectl with manifest but never tried.

Good luck!

Sent from my iPhone

On May 4, 2019, at 12:52 PM, Bert Oost notifications@github.com wrote:

I am running Cert Manager v0.7.2 and facing the same issue.. after 30 minutes, still Certificate issuance in progress. Temporary certificate issued..
Is downgrading the only solution now?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

Tried downgrading, but no difference for me.. reverted all and starting over again

[edit]
I figured it out.. I use Traefik as Ingress controller and I had configured it to always redirect.. So the ACME challenge wasn't able to reach it's destination :-) with that disabled, it suddenly starts to work

Experiencing the same on GCP with a GCE ingress.

Also experiencing the same on GCP with a nginx ingress.

I was also experiencing this issue on Azure AKS with any ingress with version 0.7.0, installed by static manifests. Downgrading to 0.6.2 worked to fix it, though.

Out of curiosity I tried the 0.8.0 release this morning, and it fixed the issue for me--I've been able to successfully issue certs on my cluster since the upgrade.

Had this issue with cert-manager 0.8.0 as it performed DNS validations. It turned out to be an extra newline character in my DNS provider's API key and the cert-manager pod logs helped me find the problem.

Also getting this issue on Digitalocean with a Traefik ingress. Challenge isn't even being created (using cloudflare DNS challenge). Used static manifests to deploy, using v0.8.

Don't know if it helps, but following the logs of the cert-manager pod reveal this debug message:
Need to create 0 challenges. Not sure why a challenge isn't necessary.

This was happening to me. Checking the logs on my cert manager pod, I noticed that letsencrypt is upset with me having too many orders for the subdomain I'm trying to use.

too many certificates already issued for exact set of domains: xxx.yyy.com:

Error points to this

https://letsencrypt.org/docs/rate-limits/

Also facing the same issue with GKE with cert-manager 0.8.1, GCE Ingress and HTTP01 challenge.

I just upgraded from 0.7 to 0.8.1 myself to resolve this issue on AKS. The problem still exists for my current HTTP01 challenge as well.

experiencing the same problem, followed instructions here https://github.com/stefanprodan/istio-gke/blob/master/docs/istio/05-letsencrypt-setup.md just used cert-manager 0.8.1

as many others in this thread, I identified the cause of the problem and managed to solve it, by checking out the logs of the cert-manager pod. In my case, I had to manually enable Cloud DNS, by visiting a website in GCP console (the url was in the logs).

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

There is no specific issue here which needs to be addressed.
If you think you have found a bug, please create a new issue or feel free to chat on the Slack channel for general troubleshooting problems.

/close

@JoshVanL: Closing this issue.

In response to this:

There is no specific issue here which needs to be addressed.
If you think you have found a bug, please create a new issue or feel free to chat on the Slack channel for general troubleshooting problems.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings