cert-manager and/or kube-lego hit Let's Encrypt too aggressively

Created on 23 Mar 2018  ·  69Comments  ·  Source: jetstack/cert-manager

/kind bug

What happened:

Hi, I'm an engineer at Let's Encrypt. I think you may also have heard from my colleague @cpu. We're finding that a lot of our top clients (21 out of 25, by log volume) have the User-Agent "Go-http-client/1.1". Unfortunately, many of those clients are configured such that they're using excessive resources on Let's Encrypt's servers. They are requesting certificates for a small number of domains, but doing so repeatedly and very rapidly - some of them are consistently sending 14 HTTP requests per second.

After talking to some of the people running these clients, it seems like one common source of the problem is kube-lego or its successor cert-manager. As far as I can tell, these packages don't have any synchronization, meaning that some use cases, like on-demand certificate issuance, can lead to the problem I described above.

I notice that both kube-lego and cert-manager use the https://godoc.org/golang.org/x/crypto/acme package directly. Per the godoc, "Most common scenarios will want to use autocert subdirectory instead, which provides automatic access to certificates from Let's Encrypt and any other ACME-based CA." In particular, the autocert package provide synchronization to avoid unnecessarily concurrent requests.

Would you consider moving to the autocert package? Alternately, I'd like to ask you to provide some internal synchronizaton in both kube-lego and cert-manager to avoid the problem.

For reference, here are the issues I've filed on the Go repo about it:

https://github.com/golang/go/issues/24497
https://github.com/golang/go/issues/24496

Thanks,
Jacob

areacme kinbug lifecyclactive prioritimportant-longterm

Most helpful comment

Yep, I can confirm that as of right now, all the v0.6.* clients we see are sending much lower volumes of traffic. There's one v0.6.0 client I see that's sending 126k requests in the last 6 hours, which I'll look into. The second nearest contender is ~3.8k in 6 hours. And v0.6.1 has no excessive clients (though maybe that's because it's relatively new?).

I'm satisfied to close out this issue and maybe open new ones for more specific behaviors we see in later versions. Thanks so much for all your work on this!

All 69 comments

Hey @jsha

Thanks for opening the issue - you are correct, right now we don't do much to limit the number of requests to ACME servers in either cert-manager or kube-lego.

In terms of autocert, taking a brief look I am unsure how well it fits our use case. Notably, cert-manager does not itself solve HTTP challenges (but instead creates 'acmesolver' pods which do it for us). This is incompatible with autoCert on account of the httpTokens field not being exported (https://github.com/golang/crypto/blob/master/acme/autocert/autocert.go#L169).

We could however take some of the code in there and rework our own use of golang.org/x/crypto/acme around it.

There are a few things to bare in mind:

  • cert-manager does not solve HTTP challenges itself, instead creating pods that are configured with challenge info to do it for us

  • work on the acmev2 branch (#309) will also make the validation process asynchronous. This will mean we don't 'block' cert-manager when performing validations, thus allowing us to perform more validations at once.

cert-manager is seen as a 'multi tenant' controller, in that a user could define multiple Issuers, each corresponding to a different ACME private key/account. This means different rate limits need to be applied at different 'levels', namely at the account level and then at the IP level (and this also needs to be applied on a per-acme-server basis).

This will allow us to tolerate failures in cert-manager (e.g. the pod exiting/the node cert-manager is running on failing), as we store no information in memory right now (we simply store the OrderURL back on the Certificate.status.acme.order.URL field).

It'd be great to work out where best to store information about rate limiting, and how it should be applied (per cert-manager, per issuer, per certificate or per domain). As far as I'm aware, the only rate limiting info returned by ACME is the Retry-After header, and there is no way to detect/discover quotas. This makes it very difficult for us to stay within those quotas.

FWIW, 14 RPS seems extremely high, even for our current implementation. Are you able to share any info on:

1) what kind of requests these are? (e.g. create order/certificate, or accept authz etc)
2) whether all those requests are from the same source IP
3) whether all those requests are for the same acme account
4) the response (or some details on the response) being returned

That'll help get started tracking down that issue - even without rate limiting, I'd not expect to see this unless a user has upwards 80 certificates that are being validated simultaneously.

Regarding HTTP header, we can set a custom user agent fairly quick and easily to help track this down further.

Looking at one such client, over the last 3 hours, they are requesting certificates for 6 different FQDNs. Four of the FQDNs have requests coming from a single IP. Each of those FQDNs has 7,122 new-authz requests during that three-hour period. The other two FQDNs have requests coming from a single, different IP. Those FQDNs have sent 42 new-authz requests.

All of the requests are from a single ACME account. Some fraction of the requests are succeeding and generating certificates as fast as Let's Encrypt rate limits will allow. The majority are rate limited at some point during the flow, often during new-cert.

Also, many of these are overlapping for the same hostname. That is, we see a new-authz request and a POST to the challenge URL (to request validation), but while validation is still ongoing, a new-authz request for the same domain comes in.

I've reached out to the owner of one of the misconfigured clients to see if they'd be willing to talk with you directly. Would you email me (my username at eff.org) so I can put you in touch assuming they say yes?

This is incompatible with autoCert on account of the httpTokens field not being exported (https://github.com/golang/crypto/blob/master/acme/autocert/autocert.go#L169).

If this was changed, would you be able to use autocert?

I think you may also have heard from my colleague @cpu.

Apologies, I think I left that initial conversation in a state where I said I'd return with more data. I dropped that ball during the V2 launch. Sorry folks!

Also, many of these are overlapping for the same hostname. That is, we see a new-authz request and a POST to the challenge URL (to request validation), but while validation is still ongoing, a new-authz request for the same domain comes in.

This symptom in particular suggests that the problem is lack of synchronization, rather than lack of rate limits. Ideally, cert-manager should have enough state to determine (a) that there is currently a valid certificate, so renewal is not needed, and (b) that there is a currently in-flight validation attempt, so no additional validation should be attempted. I think applying a simple rate limit would not fix either of these problems.

That's why I suggested autocert as an alternative, because the maintainers have informed me that it does provide such synchronization.

👍 I'd like to isolate whether it is cert-manager or kube-lego creating this extraneous requests.

kube-lego is pretty relentless with creating authz's iirc, whereas cert-manager caches the authorization URIs in case of temporary failure.

309 will take this further and cache the order URL, and we're a lot more careful with creating new orders too.

As mentioned in #409, I think we can provide this synchronisation mechanism and more into cert-manager with our own client wrapper implementation, by building per-user, per-request-type and per-server buckets for requests. This could be extended to per-domain, although we may need to be careful about creating too many buckets 😄

I'd like to isolate whether it is cert-manager or kube-lego creating this extraneous requests.

Yep, me too! Given the User-Agent problem, do you have any ideas how to go about this? Maybe a first step would be to stand up a copy of each package with an example domain name and see if you can get the issue to reproduce locally. I'm assuming they have sufficient logging that you could look for duplicated new-authz requests in the logs?

We are one of the offending sites and we are using kube-lego. I'd be happy to cooperate in the process of getting this fixed. I've mentioned it to one of our engineers, with a link to this thread and he can chime in as needed. We don't have a ton of time to work on this so I'm hoping for a quick fix. Let me know what info you need from us or what workarounds we need to apply. I believe we had planned to update to cert-manager, but it doesn't sound like that fixes the problem just yet.

@jsha regarding synchronisation - I think we can implement this ourselves too at the same layer 😄. Having it unified should hopefully make it a lot easier for us to reason about what the client side limits we are placing on cert-manager are.

Given the User-Agent problem, do you have any ideas how to go about this?

Well the easiest thing here is to set a custom user agent, which we should probably do regardless. This won't make it immediately obvious though as to what's doing what.

Given the way kube-lego works, I'm inclined to suggest it is kube-lego and not cert-manager that is going into a new-authz loop. cert-manager caches authz URLs, whereas kube-lego will attempt to create a new authz every time validation fails, even if that validation failure was transient. This could definitely lead to the issues you're describing.

I'd imagine the exact failure could be down to kube-lego e.g.:

  • misconfigured in a RBAC enabled environment
  • kube-lego unable to store some resource back in the k8s apiserver
  • kube-lego unable to access some required external service
  • any other failure during the validation 'sync'

This would cause it to:

1) create an authz with the server
2) try and store any data in the k8s apiserver
3) this request fails, causing the entire process to fail
4) kube-lego loops around.

cert-manager on the other hand will store the authz (or 'order') URLs, so that it doesn't need to keep creating a new one continuously (so at least if it is failing, multiple orders won't be created).

@james-pellow regarding cert-manager vs kube-lego - I'd have thought cert-manager behaves a lot better, but still not in an explicitly controlled and limited way. We will soon introduce explicit client side rate limits/synchronisation, which should help us not only reduce the instances of this happening, but also control them.

@jsha thank you for the direct engagement! As a keen cert-manager user it is great to see the outreach. ❤️

@jsha @munnerz before we leap to implement particular solutions, can we make sure we have good data on the problem? @jsha is seeing some clients with really poorly behaved Go-http-client/1.1 clients. But it sounds like none of us know if that is cert-manager , or kube-lego, or kube-cert-manager, or none of these and actually some less-well-known application that happens to be so badly behaved that it monopolizes @jsha's worst-case Go-http-client/1.1 data.

@munnerz may I strongly advocate for a minor 0.2.4 release ASAP that just changes the User-Agent that cert-manager uses. (Or another way to identify cert-manager activity to @jsha). Even if only a portion of users upgrade in the short term, @jsha will be able to get real data on aggregate cert-manager behavior. Then @munnerz, we either out ourselves as causing the problem, and dig it to fix it, or identify if cert-manager is actually a popular but relatively well-behaved app 😄

Using cert-manager, and observing it in production, I would be surprised if it is making many duplicate requests. As @munnerz said, a pod is launched to resolve a challenge. And cert-manager is very chatty about adding an audit trail of k8s Events for every action. If high-rate duplicate requests are happening, I should be seeing ample evidence of excessive challenge pod launches and duplicate Events for the same certificate. I'm not seeing any of that.

Let's cooperate with @jsha so we have a data-driven response to the issue!

For the User-Agent issue: I looked briefly at sending a PR for kube-lego to add it, but was stymied by the fact that User-Agent, like other headers, is set on the request object, not on http.Client. Since crypto/acme doesn't expose the request object, it may be hard to reach in and set one. I filed an upstream bug for crypto/acme to provide a default User-Agent, and allow users of the library to add more detail: https://github.com/golang/go/issues/24496.

I definitely agree on getting more data before deciding on a solution. I think waiting for the upstream UA change, plus a UA change here, plus waiting for a significant client population to upgrade, would probably result in too long a collection cycle, and we'd wind up with only impartial data anyhow.

@whereisaaron you mentioned that cert-manager is very chatty. Is kube-lego similarly chatty? Can you tell @james-pellow where to collect logs that might shine some light on what's going wrong?

@munnerz based on your description, plus anecdotal evidence from contacting subscribers, I think it's quite plausible that the problem exists only in kube-lego. However, it seems like not enough people are heeding the warning on the kube-lego repo, because we continue to see new clients with this behavior, not just old ones. So hopefully once we nail down the exact problem, we can land a fix in kube-lego if necessary.

Thanks all for your help thus far!

Hi folks -- I encountered this issue seemingly during my transition from kube-lego to cert-manager.

I made the switch on Mar 13, and @jsha confirmed my LE traffic as follows:

in the last 30 days, I saw zero traffic for "desertbluffs.com". I saw 263 requests on 3/14 UTC, then nothing until 3/17, on which I saw 384k requests mentioning desertbluffs.com. After that, the daily traffic stayed high

My ingress+tls configuration is here

One aggravating factor: this domain uses two DNS providers, Google Cloud DNS + Route 53, and I switched from http-01 to dns-01 auth in prep for wildcard certs with this kube-lego-to-cert-manager switch.

I'm running [email protected] (Helm chart ver) and can bring cert-manager back up with http-01 auth to see if the issue persists. Else let me know how I might proceed with any debugging that may be helpful!

Hi folks, I'm another of the offending sites and @jsha reached out to me to chime in here with our configuration.

We switched from kube-lego to cert-manager on 19 March. We run fairly stock helm chart with ingressShim turned on. We use the extraArgs:

    extraArgs:
    - --default-issuer-name=letsencrypt-prod
    - --default-issuer-kind=ClusterIssuer

We use helm chart v0.2.3 with cert-manager v0.2.3.

Let me know if we can provide any additional information to help move this forward to resolution.

As with @nlopez we also switched from http-01 to dns-01 auth in preparation for wildcards. Our cluster-issuer looks like this.

apiVersion: certmanager.k8s.io/v1alpha1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: {{ .Values.certs.host }}
    email: {{ .Values.certs.email }}
    privateKeySecretRef:
      name: {{ .Values.certs.privateKeyName }}
    dns01:
      providers:
      - name: cloudflare
        cloudflare:
          email: {{ .Values.certs.dnsEmail }}
          apiKeySecretRef:
            name: {{ .Values.certs.dnsSecretName }}
            key: cloudflare_api_key

@jsha is there an HTTP request header we can more easily add that you can detect? Like X-ACME-Agent: cert-manager 0.2.3 or similar?

@jsha I don't know much about kube-lego. I trialed it in the early days, but didn't like it and bailed to help develop kube-cert-manager (a fork and rewrite of @kelseyhightower's ThirdPartyResource example).

@munnerz looks like people are pointing to a possible bug. @compleatang do any of your certs have a lot of SAN's like @nlopez's example? I thinking maybe some sort of exponential amplification bug from the number of SAN's in a cert.

Some more information, from my logs:

I0326 21:05:04.027430       1 sync.go:200] Certificate scheduled for renewal in -19 hours
E0326 21:05:04.027505       1 controller.go:196] certificates controller: Re-queuing item "REDACT-1" due to error processing: error picking challenge type to use for domain 'REDACT-1': no configured and supported challenge type found
I0326 21:05:04.027578       1 controller.go:187] certificates controller: syncing item 'REDACT-2'
I0326 21:05:04.027906       1 sync.go:281] Preparing certificate with issuer
I0326 21:05:04.028560       1 prepare.go:239] Compare "" with "https://acme-v01.api.letsencrypt.org/acme/reg/REDACT-4"
I0326 21:05:04.028607       1 prepare.go:239] Compare "" with "https://acme-v01.api.letsencrypt.org/acme/reg/REDACT-4"
I0326 21:05:04.209005       1 sync.go:286] Error preparing issuer for certificate: error picking challenge type to use for domain 'REDACT-1': no configured and supported challenge type found
I0326 21:05:04.221728       1 sync.go:200] Certificate scheduled for renewal in -19 hours
E0326 21:05:04.221788       1 controller.go:196] certificates controller: Re-queuing item "REDACT-3" due to error processing: error picking challenge type to use for domain 'REDACT-3': no configured and supported challenge type found
I0326 21:05:04.221826       1 controller.go:187] certificates controller: syncing item 'REDACT-2'
I0326 21:05:04.222102       1 sync.go:281] Preparing certificate with issuer
I0326 21:05:04.222861       1 prepare.go:239] Compare "" with "https://acme-v01.api.letsencrypt.org/acme/reg/REDACT-4"
I0326 21:05:04.422359       1 sync.go:200] Certificate scheduled for renewal in -847 hours

There are a lot of these sequences in there. The logs are chock full of them. It looks from a brief skim as if each domain is getting requeued and spun every second which is causing a lot of output.

@whereisaaron the most FQDN I have on one TLS object is 5, so much less than @nlopez

Ah okay, this makes sense. Sounds like a misconfigured Certificate resource
that doesn’t have a challenge mechanism configured for at least one of the
domains.

This WILL cause excessive lookups against the acme API, subject to standard
backoff policies (exponential with a cap of ~1m iirc). This MAY cause
multiple orders (or whatever they’re called in acme v1) to be made. I’ll
need to double check the codebase for the acme issuer to be sure (sorry, on
my phone right now but I’ll have time to verify this tomorrow!). Ideally,
this would 1) not cause new authzs to be created (instead reusing the old
one that is currently not possible to fulfil due to invalid configuration)
and 2) should cache info about authzs on the Certificate.status, to prevent
us hitting live APIs each time.

FWIW, without (2) being added, we should expect N (where N is the number of
domains on a Certificate) requests every X seconds (the current back off
time, capped at something like 1 minute). This is however per misconfigured
Certificate.

Here, a rate limit locally would help to a certain extent, but really I
think a longer backoff would help.
Or potentially in the face of some errors, such as this one that is not
transient but caused by a mid configuration, we could ‘return nil’ (eg
indicate no error thus don’t retry until the resource is updated/is valid)
when returning from our sync loop.

I’ll dig in a bit more tomorrow unless anyone else does in the meantime,
and thanks for your logs 😄

If anyone else has logs available, that'd be really helpful tracking down
the different failure modes we may face!
On Mon, 26 Mar 2018 at 22:14, Casey Kuhlman notifications@github.com
wrote:

Some more information, from my logs:

I0326 21:05:04.027430 1 sync.go:200] Certificate scheduled for renewal in -19 hours
E0326 21:05:04.027505 1 controller.go:196] certificates controller: Re-queuing item "REDACT-1" due to error processing: error picking challenge type to use for domain 'REDACT-1': no configured and supported challenge type found
I0326 21:05:04.027578 1 controller.go:187] certificates controller: syncing item 'REDACT-2'
I0326 21:05:04.027906 1 sync.go:281] Preparing certificate with issuer
I0326 21:05:04.028560 1 prepare.go:239] Compare "" with "https://acme-v01.api.letsencrypt.org/acme/reg/31448073"
I0326 21:05:04.028607 1 prepare.go:239] Compare "" with "https://acme-v01.api.letsencrypt.org/acme/reg/31448073"
I0326 21:05:04.209005 1 sync.go:286] Error preparing issuer for certificate: error picking challenge type to use for domain 'REDACT-1': no configured and supported challenge type found
I0326 21:05:04.221728 1 sync.go:200] Certificate scheduled for renewal in -19 hours
E0326 21:05:04.221788 1 controller.go:196] certificates controller: Re-queuing item "REDACT-3" due to error processing: error picking challenge type to use for domain 'REDACT-3': no configured and supported challenge type found
I0326 21:05:04.221826 1 controller.go:187] certificates controller: syncing item 'REDACT-2'
I0326 21:05:04.222102 1 sync.go:281] Preparing certificate with issuer
I0326 21:05:04.222861 1 prepare.go:239] Compare "" with "https://acme-v01.api.letsencrypt.org/acme/reg/31448073"
I0326 21:05:04.422359 1 sync.go:200] Certificate scheduled for renewal in -847 hours

There are a lot of these sequences in there. The logs are chock full of
them. It looks from a brief skim as if each domain is getting requeued and
spun every second which is causing a lot of output.


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/jetstack/cert-manager/issues/407#issuecomment-376313862,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAMbPzWe2DceD3Zr8i4KKQuoARkXnC9Eks5tiVongaJpZM4S39w0
.

@munnerz so do I have something misconfigured on my end; or is this a code issue within cert-manager (or both, which is what it sounds like to me)? If it's on our end, happy to fix, but I must confess that I didn't really follow all of that so not really sure where to begin to rectify.

Could you share a copy of your Issuer resource as well as one of the
effected Certificate resources so we can try and verify where the issue
lies?
On Mon, 26 Mar 2018 at 22:34, Casey Kuhlman notifications@github.com
wrote:

@munnerz https://github.com/munnerz so do I have something
misconfigured on my end; or is this a code issue within cert-manager (or
both, which is what it sounds like to me)? If it's on our end, happy to
fix, but I must confess that I didn't really follow all of that so not
really sure where to begin to rectify.


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/jetstack/cert-manager/issues/407#issuecomment-376319376,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAMbP70nf4pJpEsh1cmzOpks9dIw4kiPks5tiV7UgaJpZM4S39w0
.

is there an HTTP request header we can more easily add that you can detect? Like X-ACME-Agent: cert-manager 0.2.3 or similar?

Unfortunately, the core problem is that crypto/acme allows you to override http.Client, but not set fields on individual requests. That rules out setting any custom headers. Fortunately, I've talked to the upstream maintainer on crypto/acme and it sounds like it won't be too long.

Sounds like a misconfigured Certificate resource
that doesn’t have a challenge mechanism configured for at least one of the
domains.
This WILL cause excessive lookups against the acme API, subject to standard
backoff policies (exponential with a cap of ~1m iirc).

This sounds plausible, though there may also be a bug in the backoff policies. If I pick out a single domain name from each of the above clients, I see about 1 request to new-authz per second, meaning the client seems to not be backing off past 1 second. Also, I agree a much higher backoff cap would be good. I'd suggest maxing out at 24 hours.

With regards overriding the default user agent - this can be done by
setting a custom RoundTripper on the http.Client used by acme.Client 😄
On Mon, 26 Mar 2018 at 23:31, Jacob Hoffman-Andrews <
[email protected]> wrote:

is there an HTTP request header we can more easily add that you can
detect? Like X-ACME-Agent: cert-manager 0.2.3 or similar?

Unfortunately, the core problem is that crypto/acme allows you to
override http.Client, but not set fields on individual requests. That
rules out setting any custom headers. Fortunately, I've talked to the
upstream maintainer on crypto/acme and it sounds like it won't be too
long.

Sounds like a misconfigured Certificate resource
that doesn’t have a challenge mechanism configured for at least one of the
domains.
This WILL cause excessive lookups against the acme API, subject to standard
backoff policies (exponential with a cap of ~1m iirc).

This sounds plausible, though there may also be a bug in the backoff
policies. If I pick out a single domain name from each of the above
clients, I see about 1 request to new-authz per second, meaning the client
seems to not be backing off past 1 second. Also, I agree a much higher
backoff cap would be good. I'd suggest maxing out at 24 hours.


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/jetstack/cert-manager/issues/407#issuecomment-376333067,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAMbP9-EdqS9qhbFJH9pl24T2Vk6PAt0ks5tiWw6gaJpZM4S39w0
.

Read the thread and I think I am still confused a bit,

would this be fixed if the HTTPTokens field was exported: https://github.com/golang/crypto/blob/master/acme/autocert/autocert.go#L169

Because I think we could get that upstream and then you should be able to swap out the dependency on the acme package

Hey @jessfraz,

I'm not too sure - it might be possible, but given acmev2 on the horizon, and the fact that autocert hasn't (yet) been updated to support it, I've not had a chance to investigate much further.

PR #309 introduces a lot of changes to how we utilise orders, and (from what I can see), results in far fewer calls to the Letsencrypt API:

  • It adds an interface around the ACME client we use
  • Introduces 'middleware' to the ACME client which should help us diagnose and fix similar problems in future (e.g. through request logging and exposing prometheus metrics from the acme client)
  • Caches order details, including selected challenges, on the Certificate resource
  • Because of how the new v2 implementation of the acme package works, results in far fewer Authz/Challenges being made (challenges are only performed if they are required)
  • Adjust the rate limiting to back off more aggressively 2*(2^numFailures)

All of these combined leads me to think a switch to autocert, which would involve a fairly large rewrite of our ACME issuer, would not bring much more value (and might make it harder to properly implement some of the new benefits listed above).

FWIW, I've just released v0.2.4 which contains @jsha changes to set a user agent. Hopefully that will allow us to get some more insight as to what is going on.

I'm also going to be cutting v0.3.0-alpha.1 this week, which will include PR #309 (switching to ACMEv2). Hopefully we'll see better behaving clients as a result!

FWIW, changing from dns to http01 certification seemingly reduced my traffic tremendously.

Thanks for the feedback @compleatang - I'll dig into this more.

Once #225 is merged, we'll have a way to expose metrics from cert-manager.

Perhaps it'll then be valuable to add in some kind of test to our testing pipeline that ensures the total number of requests made to the ACME server (overall, and per second/minute) is less than some sensible value. This would help us catch any regressions in future too, and give us an easy way to measure success of patches in an attempt to resolve this issue.

@jsha I know it has only been a few days that cert-manager v2.4 with the custom agent has been out, but are you able to spot any cert-manager activity at your end as yet?

Thanks for the reminder. I looked at the log and do see some traffic from cert-manager. I see two problem cases:

  • Some clients request new-authz for the same name over and over again, and never POST to the corresponding challenge. They get success (201) each time, and receive the same authz due to authz reuse.

    • Another issue with the above case: The final backoff time appears to vary. For one client it's stable at every fifteen minutes; for another it's every thirty minutes. Ideally unattended retries should backoff exponentially all the way up to a day.

  • At least one client appears to have gotten stuck in a new-cert loop for a day, sending 6,137 new-cert requests and getting a 429 (rate limited) response each time. It seems like the logic that triggers new-cert needs some work as well.

I'm happy to intervene occasionally by providing logs analysis, but I'd definitely appreciate it if you could set up a test harness that exercises a few of the relevant test cases, and check the retry behavior in more detail. If each iteration requires doing a cert-manager release, and waiting a few days for logs analysis, a fix will take much longer than if you can test locally. A few ideas for test cases:

  • A configuration that contains an already-issued certificate that needs renewal, where the renewal attempts continuously fail. How frequent are retries?
  • A similar, failing setup with multiple config entries for the same certificate name. Does the rate of renewal attempts increase?
  • A setup where the client holds a valid authz, but new-cert requests fail every time (try both 429 and 500)

You may find https://github.com/letsencrypt/pebble useful for setting up such a harness, since it's simpler than Boulder and the intent is to allow hooks for testing (though such hooks are not yet written).

One last question: Based on the investigation you've done so far, have you been able to confirm that the problem exists with kube-lego as well?

BTW, as an example of how to do an integration test like the above: In Boulder we use https://github.com/jmhodges/clock to mock out our clocks for testing. We also instantiate our clock.Clock instances with a wrapper that reads the environment variable FAKE_CLOCK and uses its content if set. This allows us, for instance, to bring up the Boulder environment once, issue some certificates, and then "fast forward" time so that those certificates are near expiration. You could do something similar with cert-manager and Pebble.

Another offender here, I see this in the logs multiple times per second

Using chart cert-manager-0.2.9 with image quay.io/jetstack/cert-manager-controller:v0.2.4

The error must be no more than 63 characters is filed as #425

I0419 05:45:19.215889       1 controller.go:187] certificates controller: syncing item 'jx-production/tls-jx-production-croc-hunter-jenkinsx'
I0419 05:45:19.215929       1 sync.go:107] Error checking existing TLS certificate: secret "tls-jx-production-croc-hunter-jenkinsx" not found
I0419 05:45:19.215959       1 sync.go:238] Preparing certificate with issuer
I0419 05:45:19.216290       1 prepare.go:239] Compare "" with "https://acme-v01.api.letsencrypt.org/acme/reg/31522552"
I0419 05:45:19.614653       1 sync.go:242] Error preparing issuer for certificate: error presenting acme authorization for domain "jx-production-croc-hunter-jenkinsx.jx-production.xxx.xxxxxxxx.xxx": error ensuring http01 challenge service: Service "cm-tls-jx-production-croc-hunter-jenkinsx-jkebb" is invalid: [metadata.labels: Invalid value: "jx-production-croc-hunter-jenkinsx.jx-production.xxx.xxxxxxxx.xxx": must be no more than 63 characters, spec.selector: Invalid value: "jx-production-croc-hunter-jenkinsx.jx-production.xxx.xxxxxxxx.xxx": must be no more than 63 characters]
E0419 05:45:19.618380       1 sync.go:190] [jx-production/tls-jx-production-croc-hunter-jenkinsx] Error getting certificate 'tls-jx-production-croc-hunter-jenkinsx': secret "tls-jx-production-croc-hunter-jenkinsx" not found
E0419 05:45:19.618417       1 controller.go:196] certificates controller: Re-queuing item "jx-production/tls-jx-production-croc-hunter-jenkinsx" due to error processing: error presenting acme authorization for domain "jx-production-croc-hunter-jenkinsx.jx-production.xxx.xxxxxxxx.xxx": error ensuring http01 challenge service: Service "cm-tls-jx-production-croc-hunter-jenkinsx-jkebb" is invalid: [metadata.labels: Invalid value: "jx-production-croc-hunter-jenkinsx.jx-production.xxx.xxxxxxxx.xxx": must be no more than 63 characters, spec.selector: Invalid value: "jx-production-croc-hunter-jenkinsx.jx-production.xxx.xxxxxxxx.xxx": must be no more than 63 characters]

@carlossg if you can try v0.3.0-alpha.1 and report back - there were some changes made that now allow us to support longer domain names.

We should probably also fix up some of the try logic in 0.2 in order to ensure we don't retry too hard...

After I upgraded and switched to v02 api it only runs every minute, now it fails with

E0419 15:29:14.563671       1 controller.go:186] certificates controller: Re-queuing item "jx-production/tls-jx-production-croc-hunter-jenkinsx" due to error processing: error getting certificate from acme server: acme: urn:ietf:params:acme:error:malformed: Error finalizing order :: CN was longer than 64 bytes

Following up here after commenting on https://github.com/jetstack/kube-lego/pull/326#issuecomment-383692989, to make sure we're all on the same page:

What we'd like is for clients to back off until they hit 24 hours, and then remember that backoff interval until they get a success. In other words, we have many many clients that fail every time they attempt issuance. We want those clients to never try more than once every 24 hours.

It seems like the cert-manager change you've implemented in https://github.com/jetstack/cert-manager/pull/496 will retry up to once per minute, and on average faster than that. Am I reading that PR correctly? If so, I'd like to request further work to achieve the behavior I outlined in the quote above. Thanks!

To give an idea of the scale of the issue, and why I keep pressing on this so much: Over the last 15 minutes, kube-lego and cert-manager* currently account for 39% of total requests to Let's Encrypt, but only 2% of successful issuances.

*Note: I'm counting all clients with Go-http-client/1.1 as kube-lego or cert-manager, so this could be including some other clients, but based on reaching out to a sample of users, I think kube-lego and cert-manager dominate the usage for that User-Agent.

So I've just dug into how kube-lego handles its workqueues further.

Based on the code in WatchEvents (https://github.com/jetstack/kube-lego/blob/master/pkg/kubelego/watch.go#L73) - it appears that whenever any Kubernetes Ingress resource is created, updated, or deleted, all Ingress resources are immediately scheduled for re-processing.

Ingresses that already have a valid certificate will be skipped, but any user with a number of failing/invalid ingresses will make requests to LE APIs in an attempt to validate those ingresses.

As part of those syncs, more updates will likely be made to ingresses, thus re-queuing these ingresses to be immediately reprocessed after the 'round' of processing ingresses fails.

The good news is we do only process one Ingress resource at a time, which should reduce the hits to the API somewhat (this could be a lot worse otherwise).

So, action items on this:

1) I'm going to create a PR shortly that adjusts how kube-lego uses the workqueue to:
a) process one ingress at a time, and not trigger all at once
b) use a rate limiter on a per item basis, with a base backoff of 10 minutes and a max of 24h (from what I understand, and have seen, the ratelimit package caps out at the max rate limit, it doesn't reset once it is hit. It only resets once Forget() is called, which should only be after a successful issue)

2) Add a global 10 minute rate limit after a failed validation - so if any validation fails, nothing will be processed for 10 minutes as well as any per-item rate limits.

I will cc you shortly to take a look at this change - it may be easier/more certain to satisfy point (2) by simply accepting your PR instead.

I can then cut a new release of kube-lego, which will also include the new user agent changes so we can see if this performs as expected. I'd imagine it will not reduce the failure rate, but it should reduce the overall rate of requests as desired. I don't think it's feasible for us to build a rate limiting test framework around kube-lego in the short-term to measure this (and wouldn't fit within time frames).

cert-manager on the other hand should definitely have a test harness for this. Once again, thanks for the pointers and I will cc you on future PRs around this. It'll take some changes to the way we handle clocks (in order to make it more testable).

First off, thanks for all your attention to this issue! I have been getting in touch with high-traffic clients from our logs that send the "Go-http-client/1.1" user-agent and asking them to upgrade kube-lego or cert-manager if they're running it. Generally the folks that have upgraded have started sending a lot less traffic.

However, I think there is still an issue even in jetstack-cert-manager/v0.2.4 and v0.2.5. Over the last 24 hours, I see clients with these user-agents that are generating 4.5 rps, 2.8 rps, and 2.5 rps, for the same domain names over and over. There are several others in the ~1 rps range, and they drop off after that.

Hey @jsha and @cpu

Just checking backing in on this. Have you seen many users upgrade to the new version of kube-lego, and are you seeing better results?

Ditto with cert-manager - the v0.3.0 stable release is out, and I'm keen to see if we've squashed these bugs.

As you have suggested, a testing framework that properly mocks time in order to verify this our side is definitely required, and I'm hoping to get something together in time for v0.4 (approx July 1s)

@munnerz if there is a small fix for this then probably worth a v0.2.6 release just for that. While 0.3.0 is stable, it is not a straightforward upgrade for existing clusters. I'll certainly test and use 0.3.x for all new clusters however.

I've contacted a number of users that were on the old kube-lego, and once they upgraded, they stopped sending excessive traffic.

For cert-manager, I see jetstack-cert-manager/v0.3.0-1e606b3eadfa069332b25feeb3d5aecc46d07ece somewhat outnumbering jetstack-cert-manager/v0.2.4 in terms of traffic.

v0.3.0 seems to have introduced a new bug. When I look at the top clients with v0.3.0, I see at least one client that is doing 1.26 rps of traffic for a single domain name. It's requested new-authz for that domain name 28 times in the last 15 minutes, but it's also made 4,600 requests for /directory, /acme/new-nonce, /acme/new-acct, and /acme/acct. In other words, it looks like it is constantly checking its own account status. There's no need to do this: Once a client has created an account, it can assume that account is good.

I'd like to repeat my recommendation to switch to autocert. It's actively maintained, and people don't seem to be having any problems with it. They also recently added support for dual certs (one for RSA and one for ECDSA). Earlier in this issue thread, it sounds like the main problem you had with switching to autocert was that it doesn't allow you to hook challenge solving and do it yourself. I think it would be worthwhile to ask the autocert maintainers for a hook to do that, and possibly send them a PR.

It's requested new-authz for that domain name 28 times in the last 15 minutes, but it's also made 4,600 requests for /directory, /acme/new-nonce, /acme/new-acct, and /acme/acct

So the account lookups are expected right now, as we verify the account is valid on each 'sync' of a Certificate.

The change to v0.3.0 changed our validation process to be asynchronous in order to better handle failures, however at the moment that can result in more queries being made than needed. I'm writing a PR at the moment that will cache as much information about the account/challenge as possible on our own Certificate resource, so that we can 'resume' the validation without making additional queries to the ACME server. We can then invalidate that cache when a 'failure' condition is detected (i.e. Order is in an Invalid state as opposed to Pending) or when a user changes their solver configuration.

I'd like to repeat my recommendation to switch to autocert.

I am definitely open to this, however given the design of autocert and the best practices for the design of controllers, I do not want to compromise our current feature set/regress.

It's also worth noting that autocert does not support ACMEv2 right now, which is an absolute blocker for us adopting it.

we verify the account is valid on each 'sync' of a Certificate

Is there a reason to do this? Most clients don't, and work just fine. It's extremely rare for an account to spontaneously become invalid. The simplest and best fix for this issue would just be to remove this code rather than to add caching.

The simplest and best fix for this issue would just be to remove this code rather than to add caching.

Due to the level-based programming design of Kubernetes controllers, we cannot just remove this bit of code. That bit of code, when the account is not registered, is what actually registers the account.

By storing the information on the Certificate resource (i.e. in our own API) we can assume that the account is registered and thus not re-attempt to register, but we cannot remove that code altogether.

An update on current status: Out of our top 25 most aggressive clients by IP, we see:

4 clients with jetstack-cert-manager/v0.2.4
1 client with jetstack-cert-manager/v0.3.0-alpha.1-....
11 clients with Go-http-client/1.1

To me this suggests two things:

  • Our biggest problem is still that people don't upgrade. We'll continue trying to reach out, but a lot of the email addresses are non-deliverable. If you have any channels you can use to get people on old versions to upgrade, I'd really appreciate it!

  • The v0.2.4 release is at least as aggressive as the previous releases were. I'd like to second @whereisaaron's request: Since upgrades from v0.2.x to v3.x are non-trivial, I'd really appreciate you prioritizing fixes for v0.2.4 series so we can maximize the number of users upgrading.

Also, thanks for adding the informative User-Agent header! It's turning out to be really helpful figuring out who's on what version.

Oh and one other piece of good news: For kube-lego users that upgraded to v0.1.6, they all disappeared from the list of most aggressive clients. In other words, the kube-lego patch was very successful! If there are any lessons from that fix that can be applied to the cert-manager fix, that might help inform the next few patches.

@jsha the current v0.2.x release is v0.2.5 which includes rate limit improvements over v0.2.4. You didn't mention v0.2.5 in your aggressive list, does that indicate v0.2.5 is generally better than v0.2.4? Or just that I am the only one using it :-)

Oops, you're right @whereisaaron! I forgot there was a v0.2.5 that fixed the issues, even though that was mentioned earlier in this thread. :-) There indeed are a good number of people using v0.2.5, and v0.2.5 doesn't show up in the list of aggressive clients, so I think those fixes were successful. Yay!

@munnerz thanks for releasing v0.3.1 with an attempted fix! I'm looking at our logs, and I still see multiple clients exhibiting the rapid-account-update failure mode, so it looks like the fix was not effective.

How is progress on setting up your test harness? This latest bug seems like it doesn't need a particularly complex harness. If you run a local boulder (or pebble) and examine the logs while you bring up a test version of cert-manager, it seems like the behavior should reproduce.

There is also some movement on adding Prometheus metrics to cert-manager in PR #225
That will make it easy for us users to spot unusual high rate events.

Following up on this again. I've saved a log query that gives us a numeric estimate of the problem by means of the "new-cert success ratio." That is, for each request that successfully issued a certificate, how many other requests were there? We expect this to be less than 1, because you need to fetch the directory, create the authorization(s) / order, fulfill them, and poll. However, if it's significantly less than 1, that's bad. For kube-lego v0.1.6, which successfully fixed this set of problems, we see in the last 24 hours:

useragent | count | new-cert success ratio
jetstack-kube-lego/0.1.6-4eb6cd03 | 2,476 | 0.04039
jetstack-kube-lego/0.1.6-61705680 | 947 | 0.02746

For cert-manager, we see:

useragent | count | new-cert success ratio
jetstack-cert-manager/v0.3.0-1e606b3eadfa069332b25feeb3d5aecc46d07ece | 7,591,106 | 2.42389e-5
jetstack-cert-manager/v0.2.4 | 3,588,159 | 1.28199e-5
jetstack-cert-manager/v0.3.1-f804cb56fbfe9469f3aada6db0935f0e0abae194 | 2,037,311 | 5.54653e-5
jetstack-cert-manager/v0.3.2-f1833647406f9bd89fe5461b787ab8aaff5553cb | 648,020 | 4.78380e-4
jetstack-cert-manager/v0.2.5 | 565,817 | 2.03246e-4

In other words, cert-manager v0.3.2 is still two orders of magnitude less efficient than the fixed kube-lego in terms of requests made per issued certificate.

Thanks for the details Jacob!

I have a proposal in the works that should be ready by end of week that
should make a big difference to this as it will substantially change how we
handle orders.

Right now, orders are being processed asynchronously - which means we query
the acme server multiple times during an order. This has made our order
handling logic quite complex, and has resulted in more calls than necessary
being made to the acme server. Full details will be in the proposal that I
am half way through writing!

In short, it will see us moving to a synchronous model of processing orders
and authorizations which will not only make this process easier to test,
but also more similar to traditional ACME clients. I envision this will
finally put this issue to bed, although do not want to speak too soon!

I'll ping you once the proposal is ready. I'll be cutting v0.4 of
cert-manager in the coming days, and am making this restructuring my
primary focus for the v0.5 release (currently scheduled for 11th August).

Sorry for not keeping this issue more up to date - I have not forgotten it
by any means!

On Wed, 11 Jul 2018 at 19:40, Jacob Hoffman-Andrews <
[email protected]> wrote:

Following up on this again. I've saved a log query that gives us a numeric
estimate of the problem by means of the "new-cert success ratio." That is,
for each request that successfully issued a certificate, how many other
requests were there? We expect this to be less than 1, because you need to
fetch the directory, create the authorization(s) / order, fulfill them, and
poll. However, if it's significantly less than 1, that's bad. For kube-lego
v0.1.6, which successfully fixed this set of problems, we see in the last
24 hours:

useragent | count | new-cert success ratio
jetstack-kube-lego/0.1.6-4eb6cd03 | 2,476 | 0.04039
jetstack-kube-lego/0.1.6-61705680 | 947 | 0.02746

For cert-manager, we see:

useragent | count | new-cert success ratio
jetstack-cert-manager/v0.3.0-1e606b3eadfa069332b25feeb3d5aecc46d07ece | 7,591,106 | 2.42389e-5
jetstack-cert-manager/v0.2.4 | 3,588,159 | 1.28199e-5
jetstack-cert-manager/v0.3.1-f804cb56fbfe9469f3aada6db0935f0e0abae194 | 2,037,311 | 5.54653e-5
jetstack-cert-manager/v0.3.2-f1833647406f9bd89fe5461b787ab8aaff5553cb | 648,020 | 4.78380e-4
jetstack-cert-manager/v0.2.5 | 565,817 | 2.03246e-4

In other words, cert-manager v0.3.2 is still two orders of magnitude less
efficient than the fixed kube-lego in terms of requests made per issued
certificate.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/jetstack/cert-manager/issues/407#issuecomment-404270282,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAMbP2MCD49ng7V8HQc3pTX3yDDgS_s6ks5uFkaQgaJpZM4S39w0
.

That sounds like a good plan, @munnerz, and I agree that it's likely to go a long way towards fixing these issues.

@jsha just a quick update, there is a proposal out for review currently at https://github.com/jetstack/cert-manager/pull/809

I've also got an implementation of it here working well, and the number of API calls should be massively reduced. We now cache almost all of the API-side information in the Kubernetes API, and only 'resync' when we think there's a chance something may have changed.

As part of this work, I've also managed to build up a unit test harness to verify how and when we hit ACME APIs. There is still work to do to turn this into a full integration test framework, but it's a big improvement 😄 looking forward to getting this rolled out!

Thanks for your continued attention to this issue.

I've checked our logs, and the excessive traffic is still a problem in v0.5.0. I see several clients hitting us with 7rps of traffic each, in a tight loop requesting /directory, /new-nonce, /acme/new-acct, and /acme/acct over and over again. This is the bug I mentioned on June 11.

At the time, you said:

So the account lookups are expected right now, as we verify the account is valid on each 'sync' of a Certificate.

I probably wasn't clear enough at the time: Please don't do this. It's unnecessary. If you make a renewal request, and the account is invalid, you'll get an error response. In other words, if a certificate is still valid, 'sync'ing it (in cert-manager terms) should generate zero requests to Let's Encrypt.

When can I expect you to cut a release that removes this account-check-on-sync code?

Thanks,
Jacob

Now that #788 has merged, clients using the 'canary' release (i.e. HEAD of master) should be utilising this new behaviour.

We now cache all details aggressively in the Kubernetes API server, and have the ability to add more complex rate limiting and quota control.

I'm still verifying there are not issues/regressions with this release, specifically by building out a more complex conformance suite that is able to verify renewal behaviour as well.

I've also made a few significant changes to the e2e testing framework too, which should mean we'll soon be able to perform a far wider number of checks on our own use of the pebble dev server during tests.

Sorry this has taken so long to get out, and that it's not quite there yet. I am hoping to be able to cut a release with this new behaviour within the next month or so.

I may also consider cutting an alpha release sooner, as this is a fairly significant change to our codebase once again.

Hopefully this will be the last significant refactor of our ACME implementation 🙈

I still see this buggy behavior in clients with the User-Agent string go-acme/2 jetstack-cert-manager/canary-3347fcc613ff1165b18b11a484008bb53aaec341, which seems to correspond to the current master.

We now cache all details aggressively in the Kubernetes API server, and have the ability to add more complex rate limiting and quota control.

This is different from removing the account fetching logic. Will you remove the account fetching logic? Usually if there is a bug that you are having trouble tracking down, it is much better to simplify than to add more complexity.

I've also made a few significant changes to the e2e testing framework too

Do you currently log all ACME requests? Given how common this behavior seems to be, I would think you could set up a cert-manager instance and leave it running for several days or weeks, then count the log lines. You shouldn't see more than a handful of requests per sixty days.

@jsha I've just spotted that we're using very small numbers on the 'issuers' controller specifically, which would definitely cause tight loops in cases where cert-manager is deployed incorrectly or having an issuer persisting its state to the API server.

I have summarised the change more clearly here: https://github.com/jetstack/cert-manager/pull/981

This should definitely reduce API usage significantly for at least some users (without deeper insight into those statistics I cannot be sure how many).

As I said, we'll be building out the test suite to more carefully inspect our API usage over the coming weeks 😄

This is different from removing the account fetching logic. Will you remove the account fetching logic?

This is not possible, as we register the ACME account on the users behalf when they create an order resource.

One further thing we can do if there are still issues after cutting a new patch release containing the above reference patch, is consider simply not ever re-verifying the ACME account after it has been registered the first time, unless the ACME URL has changed. Currently we re-verify the ACME account when cert-manager starts up (and rely on Kubernetes itself to apply restart back-off in the case of the cert-manager pod rapidly starting and exiting).

Do you currently log all ACME requests? Given how common this behavior seems to be, I would think you could set up a cert-manager instance and leave it running for several days or weeks, then count the log lines. You shouldn't see more than a handful of requests per thirty days.

We do log requests to the ACME server - however I don't believe the usage issues we are seeing is on 'the happy path'. rather it is a result of misconfigurations in either the cert-manager deployment, or its usage. For that reason, I am trying to build these misconfigurations into an e2e suite that we can run for longer periods of time and verify exactly this 😄

not ever re-verifying the ACME account after it has been registered the first time, unless the ACME URL has changed. Currently we re-verify the ACME account when cert-manager starts up (and rely on Kubernetes itself to apply restart back-off in the case of the cert-manager pod rapidly starting and exiting).

This is the right behavior, and is what most other clients do. Rather than wait for another release cycle, I'd like you to please incorporate this change in the next release.

We do log requests to the ACME server - however I don't believe the usage issues we are seeing is on 'the happy path'. rather it is a result of misconfigurations in either the cert-manager deployment, or its usage.

That seems likely, I agree. Have you tried running a server on the happy path for a while, though? I'd like to rule out the most straightforward reproduction cases sooner rather than blocking on a more complex e2e testing suite.

This is the right behavior, and is what most other clients do.

Ack.

Have you tried running a server on the happy path for a while, though? I'd like to rule out the most straightforward reproduction cases sooner rather than blocking on a more complex e2e testing suite.

Yes, we run this ourselves and don't see this kind of API usage 😄

Excellent, thanks for confirming. :-)

One more request: The User-Agent string for tagged releases looks like: go-acme/2 jetstack-cert-manager/v0.5.0-7924346bd84e41053cc508956b0a1b567c932416. Since that's a tagged release, the git commit is redundant and just consumes extra bytes in our logs. Could you omit that part of the User-Agent string for tagged releases?

Also, above you suggest that one of the failure cases may be that cert-manager crash-loops under certain conditions, and maybe the Kubernetes backoff is not being correctly applied. Can you post instructions so that affected users can check if their instance has been crash-looping?

Could you omit that part of the User-Agent string for tagged releases?

Yep that should be :)

Also, above you suggest that one of the failure cases may be that cert-manager crash-loops under certain conditions, and maybe the Kubernetes backoff is not being correctly applied. Can you post instructions so that affected users can check if their instance has been crash-looping?

I wasn't suggesting that it wouldn't be applied correctly 😄 it is standard Kubernetes behaviour to apply its own CrashLoopBackOff policy to pods, which caps out at restarting once every 5 minutes after around 3 or 4 tries if I recall correctly :)

I wasn't suggesting that it wouldn't be applied correctly smile it is standard Kubernetes behaviour to apply its own CrashLoopBackOff policy to pods, which caps out at restarting once every 5 minutes after around 3 or 4 tries if I recall correctly :)

This wouldn't be sufficient to explain the behavior we're seeing, then, since clients are retrying much more frequently than once every 5 minutes, over a long period of time.

We've had an issue with cert-manager that would spam letsencrypt due to insufficient RBAC privileges. The cert-manager was not allowed to update a cert secret which was about to expire and it kept retrying very often until we noticed we were being rate-limited. It only had the create privilege ..

Could this be one such failure case, @munnerz ?

This wouldn't be sufficient to explain the behavior we're seeing, then, since clients are retrying much more frequently than once every 5 minutes, over a long period of time.

I think the conversation has gotten a bit mixed up 😬 I wasn't suggest that failure case is what's causing the excess traffic. I more had in mind the case @ahilsend has just described.

In those instances, the recent PR to adjust the workqueues will have great effect, as it will cause us to retry every minute (with a base of 10s) instead of every second (with a base of 10ms).

As I mentioned in the other PR, we can take the next steps in this instance and look at constructing a dedicated workqueue specifically for ACME issuers, which will allow us to apply rate limits more akin to what you're asking (i.e. days instead of minutes).

Hope that makes sense? 😄

(and thanks @ahilsend for the feedback!)

Hope that makes sense?

Yep, makes sense. Thanks for the clarification.

I'm seeing at least one client that's using v0.5.1 and is stuck in a loop creating a new account, creating a new order, validating its challenges, and issuing a certificate. Doing about 500 requests per hour. This seems likely to be separate bug from the account-fetching loop.

Also, a good handful of other v0.5.1 clients appear to have a problem where they endlessly poll orders, authorizations, and challenges so long as they remain pending. Cert-manager should give up polling any given object after a certain number of tries. For instance, Certbot will poll challenges up to 30 times at intervals of 3 seconds. Also: It's generally only necessary to poll either the authorization or the challenge for any given submitted challenge, not the whole [order, authz, challenge] chain.

stuck in a loop creating a new account, creating a new order, validating its challenges, and issuing a certificate

So to confirm, the client is calling RegisterAccount multiple times for the same private key/email address? Or the private key is changing each time?

This sounds to me like it could be down to invalid RBAC configuration as well, as it sounds like cert-manager is failing to persist any state.

We can actually add some 'pre-flight' checks in on start up that verifies that we do have permission to perform the actions we need to before actually starting the controllers properly. This won't help if a user breaks their RBAC configuration after starting cert-manager, but I think for 99% of cases it'll definitely help.

Cert-manager should give up polling any given object after a certain number of tries
Also: It's generally only necessary to poll either the authorization or the challenge for any given submitted challenge, not the whole [order, authz, challenge] chain

This is implemented in the HEAD of master 😄 v0.5.1 was a patch release, and does not contain the latest changes in master, which will make their way into v0.6 😄. We'll be testing v0.6 over the coming days, including putting a message out on our mailing list encouraging those in non-production environments to test out their configurations (before cutting the 'stable' v0.6 release)

Hey @jsha -

After the changes in v0.6 (and v0.6.1 after that), as well as some of the new changes scheduled to be included in v0.7, I think we've made significant improvements in reducing our API usage.

Can you confirm that on your side? Are you still seeing abnormal or abusive traffic patterns from newer cert-manager clients?

If not, or if you are seeing distinctly different traffic profiles that warrant changes, is the problem sufficiently resolved for us to close this issue and follow up with new issues for any new problems that arise?

Thanks to all for the support debugging and helping us resolve these issues so far!

Yep, I can confirm that as of right now, all the v0.6.* clients we see are sending much lower volumes of traffic. There's one v0.6.0 client I see that's sending 126k requests in the last 6 hours, which I'll look into. The second nearest contender is ~3.8k in 6 hours. And v0.6.1 has no excessive clients (though maybe that's because it's relatively new?).

I'm satisfied to close out this issue and maybe open new ones for more specific behaviors we see in later versions. Thanks so much for all your work on this!

Yep please do reach out with details on that 126k/6h user - that seems very perculiar..!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kragniz picture kragniz  ·  4Comments

jbouzekri picture jbouzekri  ·  4Comments

jbeda picture jbeda  ·  4Comments

howardjohn picture howardjohn  ·  3Comments

Azylog picture Azylog  ·  3Comments