Hi,
I've been reviewing Let's Encrypt logs for subscribers that are using excessive resources. Some of the top users of resources are using the xenolf-acme user-agent. I'm sure this is not the maintainers' fault; there are probably a number of ways end-users can misconfigure the client to cause these problems. However, I'm wondering if there are things xenolf-acme could do to make it harder to reach these misconfigurations.
For instance, one mode I'm seeing is a client that will poll a single challenge tens of thousands of times, at a rate of sixteen requests per second. Perhaps xenolf-acme could build in a hardcoded max poll rate of once per second, with a deadline of ten minutes, after which the challenge won't be polled anymore?
Similarly, I see some xenolf-acme clients that fail validations for the same domain repeatedly. For instance, one client retried the same failing domain about 1,100 times in a single day. It would be great if xenolf-acme implemented backoff on failing validations as described in https://letsencrypt.org/docs/integration-guide/.
Note that I can't tell from logs whether these error situations are caused by using lego from the command line (in a tight loop) versus the in-memory client. The user-agent string is "Go-http-client/1.1 (linux; amd64) xenolf-acme". I suspect they are probably using the in-memory client based on the rapid polling of the same challenge id for hours.
Also, it would be useful for xenolf-acme to include a version number in the user-agent string, so that after these issues are improved, I can easily tell whether a given client is using an older version or not.
Thanks!
I've done this via caddy on accident multiple times. My scenario has been:
restart always
.Caddy goes into a restart loop and continually fails challenges. This is apparently a design decision for caddy to exit if it fails to get certs, but it does quickly get me rate limited when I make mistakes like the above.
Not sure if my scenario can cause the types of traffic to a single challenge like you are seeing, or if repeated attempts will result in new challenges.
I believe Caddy has its own user-agent, and I have seen it crash in a tight loop as you describe.
I can't be sure, but I think this case is different, partly because the ratio between number of challenge polling requests and number of new-authz requests is really high. Whereas in the Caddy case I think it's generally 1:1ish.
Yeah, Caddy will appear in the UA if it was the client. (And the Caddy docs clearly discourage automatically respawning if Caddy exits with exit code 1.)
@jsha Have you and @xenolf decided any way to go forward with this? (I thought I would have time but since our lab surprisingly got accepted into the Amazon Alexa competition I have been busier than I anticipated...)
We've gotten in touch with the person running the client and hopefully we'll find out more about what their setup looks like soon-ish. Hopefully that will leads us to some ideas about how to reduce the chances of similar breakage in the future. Thanks for your help so far!
After talking to the author of the client that was failing, it sounds like they are using xenolf-acme in library mode, and have it configured to auto-issue based on incoming SNI headers. As I understand it, xenolf-acme doesn't have any locking in this case, and can wind up with multiple goroutines attempting to issue in parallel, which can lead to excessive requests, and possibly indefinite retries?
My recommendation would be to add a sync.Map
or similar, keyed on domain names that have ongoing issuance requests. If a user of the library requests issuance for a domain name that already has an issuance in progress, the call could either block until the original issuance request completes, or return error immediately, depending on what you'd like your API to look like.
Thanks again for your help in debugging and brainstorming about this!
Any thoughts on the above idea about providing some form of locking or other synchronization to avoid excessive requests for the same domain name?
@jsha Does the problem still exist since lego v2?
If yes, do you have more information with the new user-agent?
In our logs, the pre-v2 clients are still the highest issuance. I do see a couple of v2 clients that are persistently sending about 1 rps of traffic, failing for the same domain name over and over again. Would you like me to ask the owners of those clients if I can put them in touch with you?
Yes with pleasure, if I can help to find a solution for those clients.
If needed I'm elDez on your community support.
I've reached out to two people, one of them was investigating a bug and forgot to turn off their client (now fixed).
On closer look, these are the user-agents that top the list for "xenolf-acme/2":
containous-traefik/v1.7.8 xenolf-acme/2.1.0 (release; linux; amd64)
CertMagic Caddy/0.11.3 xenolf-acme/2.0.1 (detach; linux; amd64)
Would you like me to file separate bugs on the repos for those two projects?
@jsha Feel free to put me in touch with the Caddy users who are experiencing this. If you have more specific info about the Caddy clients, you can post it in an issue on the caddy repo.
(Do you have any issues with CertMagic alone, without Caddy in the UA string?)
@jsha Yes thanks, if you could open a bug on Traefik, it would be useful to me (I'm a core maintainer of Traefik) to follow the problem inside Traefik.
Otherwise, I'm working to improve the lego command line by better controlling all requests.
Most helpful comment
If needed I'm elDez on your community support.