First, thanks for the fantastic, consistent work on this library. You have no idea how much I appreciate it. :pray:
I've been getting some reports lately of certificates not renewing, and it turns out it's because the TLS-ALPN challenge always fails when Cloudflare (or some other CDN) terminates TLS. Of course this is to be expected.
However, in our case, the TLS-ALPN and HTTP challenges are both enabled, yet only one gets used, because the selection of a challenge is deterministic and always returns TLS-ALPN first: https://github.com/go-acme/lego/blame/55572c26060b91518381fe99910dfacf46035544/challenge/resolver/solver_manager.go#L64
It used to be that renewals would eventually succeed because challenges were randomly ordered, so sometimes it would try the HTTP challenge, which often succeeds in the case of being fronted by a CDN.
So my feature request is that this behavior of random selection be restored, OR lego should retry a failed challenge with a different available challenge right away.
As it currently is, having both the TLS-ALPN and HTTP challenges enabled results in the HTTP challenge never being used, thus resulting in less reliability when there are network issues or -- heaven forbid -- a challenge type has to be disabled for security reasons, or the CA is doing maintenance, etc.
I would really love for both enabled challenges to be able to be used. It will greatly enhance the reliability of this package.
What do you think?
I will not randomize the challenges because this will create a flaky behavior and a user will have more problems with the LE rate limits.
For me, a challenge should only be activated if it can be successful.
I think the application that uses lego have to find a way to determine (automatically or manually) whether a challenge can be done or not.
I will take a look if a pseudo retry/cycle is possible, but trying to use a challenge that can not be solved does not seem to me to be very respectful for Let's Encrypt.
I will not randomize the challenges because this will create a flaky behavior and a user will have more problems with the LE rate limits.
Can you explain how changing the challenge order results in flaky behavior, exactly? And especially how it creates problems with rate limits? I don't see why this has to do with rate limits. If anything, it will help _fix_ rate limit issues because if the TLS challenge keeps failing, then the HTTP one will finish up the order/authz and free up that resource.
I think the application that uses lego have to find a way to determine (automatically or manually) whether a challenge can be done or not.
This is extremely difficult to do reliably, without a third-party server to facilitate the check (which is expensive). Do you have any recommendations that work as reliably as trying the challenge itself (i.e. does not depend on certain assumptions about the client's DNS environment, etc)?
trying to use a challenge that can not be solved does not seem to me to be very respectful for Let's Encrypt.
Well, for one thing, it's not always easy to know when a challenge can't be solved. For another, this is only a failover mechanism, which will actually _free up_ resources once the challenge succeeds; the current behavior is causing rate limit issues. So this change would actually show more respect for the CA.
I will take a look if a pseudo retry/cycle is possible
Fortunately, I am fairly confident this is possible!
My planned workaround is to simply reconfigure the ACME client and try the other challenge type if the first one fails. Hopefully I don't have to do this though, because then only Caddy users would benefit from this redundancy in case of problems. It'd be nice if all lego users had this.
One other thing I noticed... I do want to clarify that the current behavior went against my intuition -- the release notes for v2.0.1 are misleading:
[lib] Check all challenges in a predictable order
This sounds like _all_ challenges are tried in a predictable order. That is fine, as long as _all_ challenges are in fact tried. But in reality, only _the first_ challenge is tried, which is the same every time, rendering the other ones useless.
When I say "all", I'm talking about the challenges returned by LE API, not user-defined challenges.
I understand that now, after we discussed it in Slack. :) (But you can see how it is confusing out of context. I admit I did get the wrong idea, apparently.)
In the meantime, I've worked around this in CertMagic by manually configuring a random challenge first as the only enabled challenge, then retrying with a totally different configuration using the next challenge type, and so on.
Update: Since then, I have implemented proper challenge randomization and fallbacks in acmez which also learns which challenge type is the most successful and prioritizes those first.
Most helpful comment
Can you explain how changing the challenge order results in flaky behavior, exactly? And especially how it creates problems with rate limits? I don't see why this has to do with rate limits. If anything, it will help _fix_ rate limit issues because if the TLS challenge keeps failing, then the HTTP one will finish up the order/authz and free up that resource.
This is extremely difficult to do reliably, without a third-party server to facilitate the check (which is expensive). Do you have any recommendations that work as reliably as trying the challenge itself (i.e. does not depend on certain assumptions about the client's DNS environment, etc)?
Well, for one thing, it's not always easy to know when a challenge can't be solved. For another, this is only a failover mechanism, which will actually _free up_ resources once the challenge succeeds; the current behavior is causing rate limit issues. So this change would actually show more respect for the CA.
Fortunately, I am fairly confident this is possible!
My planned workaround is to simply reconfigure the ACME client and try the other challenge type if the first one fails. Hopefully I don't have to do this though, because then only Caddy users would benefit from this redundancy in case of problems. It'd be nice if all lego users had this.