Caddy: Intelligent updating of SSL certs

Created on 23 Sep 2019 · 12Comments · Source: caddyserver/caddy

1. What would you like to have changed?

I've been waiting for well over an hour for caddy to come back up after a restart because of the unintelligent way it handles updating certificates.

As you know, Linode often needs 15 minutes or so to propagate DNS changes. Because caddy wants to update certificates before it starts serving (even when it already has valid certificates), it's often offline for a minimum of 15 minutes.

To massively compound this issue, it also solves DNS challenges serially, so it needs an hour to complete four such DNS challenges instead of the 15 minutes it would take if they were handled in parallel.

Furthermore, I'm close to hitting the Let's Encrypt limits because caddy has decided to update a whole bunch of certificates at the same time despite there being many weeks' difference in their expiry dates.

Caddy should be a lot smarter about updating certificates to avoid such extreme downtime and limit-induced errors.

It should start serving immediately using the valid certificates it already has and update certificates while it's running, instead of refusing to start serving until it's finished fetching certificates that do not need to be fetched right now. It should also update certificates at reasonable intervals, not all at once.

2. Why is this feature a useful, necessary, and/or important addition to this project?

Because a webserver that can take anywhere from 15 minutes to well north of an hour to start serving is not suitable for use in a production environment.

3. What alternatives are there, or what are you doing in the meantime to work around the lack of this feature?

Looking into using a different webserver. Caddy was supposed to take the hassle out of HTTPS, but this is a legitimately catastrophic issue.

4. Please link to any relevant issues, pull requests, or other discussions.

feature request

Source

deanishe

Most helpful comment

Please try to understand: The problem isn't that caddy refused to start due to an error.

I am hearing you, believe me. :) I definitely understand the difference here.

The problem is that Caddy refuses to serve requests until it has updated certificates that do not need to be updated in order for Caddy to serve requests

But the certificates _do_ need to be updated, because they are expiring soon. Critically soon. And they failed for 3 weeks. How much longer would you like to have postponed the actual problem?

What it should have done is immediately start serving requests using the perfectly valid certs it already has and worry about updating them once it was up and running and doing its actual job. There is absolutely no reason it couldn't do exactly that.

On the contrary, if Caddy starts serving expiring/ed certificates that it knows will not renew, there are zero good outcomes: the error will continue to go undisclosed and TLS handshakes will fail or connections will be refused at some arbitrary time in the future. Instead of getting complaints that the server is aborting for "no good reason", we'll get complaints that Caddy's cert management "fails to renew certificates" (neither of which are accurate).

Yes, there might have been some issues updating certificates (I don't know—systemd only has the logs since the last reboot, which was today), but that's beside the point.

That's unfortunate, because it's really _not_ beside the point: it is the point.

Systemd absolutely can keep logs beyond reboots (I'm looking at some right now that go back to January last year, through dozens and dozens of reboots); if you have no logs before reboots, it's a system misconfiguration.

Caddy can also be configured to emit logs directly to persistent files (or to stdout/stderr) which persist beyond reboots.

So, maybe we will never know. But I'm convinced that there would have been error messages in the logs for the last several weeks. Which leads me to this:

This is broken behaviour, tbh. Caddy should absolutely throw a bunch of dire warnings if there's a persistent problem with cert renewal.

You claim that giving 3 weeks of leniency is broken behavior, and that Caddy should "throw a bunch of dire warnings" -- yet that's exactly what it was doing: but they went unmonitored and now are gone.

There isn't enough information available to pinpoint the root cause.

So at this point---and I'm really really sorry to say this because I know this sucks---but given the circumstances, it could have been fixed by checking the logs.

Caddy 2 will probably be able to handle this differently because of how its configuration is loaded.

How on earth is unnecessarily shutting down a business for an extra week the "right" thing to do?

I didn't realize you were using this free software for your business! My bad. We recommend that businesses _always_ get a support contract so we can help ensure a proper setup of their web server and logs and prevent problems exactly like this one. I will have our sales guy get in touch to see what we can do for you.

mholt on 24 Sep 2019

❤2

All 12 comments

Hi, sorry that you're experiencing difficulties.

Allow me to be clear and blunt in the interest of helping most directly: it seems that just about every paragraph of your post represents a misunderstanding or an operator error contrary to best practices as documented. I'll do my best to help.

Because caddy wants to update certificates before it starts serving ... it's often offline for a minimum of 15 minutes.

Full stop for a moment -

I've been waiting for well over an hour for caddy to come back up after a restart

Don't stop the web server.
Don't stop the web server.
Don't stop the web server.

Please use SIGUSR1 for graceful, zero-downtime config reloads. This should go without saying, but if you stop the web server (no matter which web server you're using), your site will have downtime.

To massively compound this issue, it also solves DNS challenges serially, so it needs an hour to complete four such DNS challenges instead of the 15 minutes it would take if they were handled in parallel.

There is some concurrency, however this is not really a Caddy limitation, please see https://github.com/go-acme/lego/pull/237#issuecomment-450617110 for part of the discussion regarding the complexities of this issue.

Furthermore, I'm close to hitting the Let's Encrypt limits because caddy has decided to update a whole bunch of certificates at the same time despite there being many weeks' difference in their expiry dates.

Certain rate limits do not apply to renewals: https://letsencrypt.org/docs/rate-limits/

Caddy does not repeat renewal attempts for successfully-renewed-and-stored certificates, so the duplicate cert rate limit only applies if your storage is misconfigured (i.e. no write permissions, or wiping storage between runs).

Regardless of how similar their expiry dates, if all the certificates are needed and their expiration is near or past, Caddy must renew them in order to serve your sites properly and securely. This is true regardless of your web server or tooling: serving expired certificates is bad, whether it be sooner or later, the problems need to be solved.

Caddy should be a lot smarter about updating certificates to avoid such extreme downtime and limit-induced errors.

Any specific suggestions? Caddy is the "smartest" web server when it comes to certificates; by that I mean you'll be hard-pressed to find something better. We are always looking to make it better but need some specific ideas; simply saying "make it smarter" isn't productive.

In addition, this error is operator-caused, because the web server was stopped.

It should start serving immediately using the valid certificates it already has and update certificates while it's running

What should it do with the sites that don't have acceptable certificates in the meantime? If you give a web server a bad configuration at startup, and it can't do what you ask, that's an error that needs to be fixed.

It should also update certificates at reasonable intervals, not all at once.

It does renew them at reasonable intervals, unless you're asking for your sites to come online one-by-one over time? That's... weird, and confusing. Caddy's certificate maintenance happens in the background _while it is running_ but if you stop the web server then it has no choice but to "catch up" to what it couldn't do while it was stopped.

If you have a lot of domains, I recommend you use a wildcard certificate instead.

If a wildcard certificate can't be used, then LE's "Certificates per Registered Domain" rate limit doesn't apply to you because they must be different domains in that case.

mholt on 23 Sep 2019

❤1

Don't stop the web server.
Don't stop the web server.
Don't stop the web server.

This is impossible advice, tbh. The server had a problem and was automatically rebooted, which isn't exactly a one-in-a-million corner case, is it?

Caddy must renew them in order to serve your sites properly and securely.

It doesn't need to renew them on start-up before it starts serving, though, does it? Not when it's already got certs that are weeks from expiry.

This is true regardless of your web server or tooling

It's true that they need to renew certs occasionally, yes. It is most definitely not true that they refuse to serve any requests for an hour while they replace perfectly valid certs, which is the real issue here.

serving expired certificates is bad, whether it be sooner or later, the problems need to be solved.

I was quite explicit that caddy already had valid certificates.

Any specific suggestions?

As I stated in my first post (which you apparently didn't read very closely), caddy should start serving immediately when it already has valid certificates, even if they're only valid for another few hours.

Instead, it refuses to serve any requests until it's updated certs that are weeks from expiry.

Obviously, a bunch of Linode DNS challenges represent something of a worst-case-scenario, but one way or another, caddy's refusing to serve requests until it has replaced valid certificates results in unnecessary downtime.

What should it do with the sites that don't have acceptable certificates in the meantime?

Not relevant to this issue, tbh. Caddy had valid certificates.

If you give a web server a bad configuration at startup

I didn't give caddy a bad configuration. It behaved badly of its own accord.

That's... weird, and confusing.

It's not weird and confusing. You're merely proceeding from the incorrect assumption that I'm an idiot, as evidenced by your blaming it on "operator error" instead of, you know, acting in good faith and thinking.

but if you stop the web server then it has no choice but to "catch up" to what it couldn't do while it was stopped.

It was down for five minutes while the server rebooted. How much "catching up" could there possibly be to do?

If you have a lot of domains, I recommend you use a wildcard certificate instead.

I am using wildcards for the domains I control. That's why I had to wait over an hour for caddy to do four DNS challenges.

deanishe on 23 Sep 2019

👎2

This is impossible advice, tbh. The server had a problem and was automatically rebooted, which isn't exactly a one-in-a-million corner case, is it?

No, you're right, we need to handle that better, and I'm planning on it in Caddy 2 (which is already in beta! so it's not too far away).

It doesn't need to renew them on start-up before it starts serving, though, does it? Not when it's already got certs that are weeks from expiry.

Actually, at startup, Caddy doesn't error-out for certificates that are more than just 1 week from expiration:

https://sourcegraph.com/github.com/mholt/certmagic@ecf5d6b59edba20b5c963c92ff342d43e94114d5/-/blob/maintain.go#L457
https://sourcegraph.com/github.com/mholt/certmagic@ecf5d6b59edba20b5c963c92ff342d43e94114d5/-/blob/maintain.go#L169-178

That is very important; more on this below.

I was quite explicit that caddy already had valid certificates.

We've discussed this _at length_ before, there is important context to understand: https://github.com/caddyserver/caddy/issues/1680

As I stated in my first post (which you apparently didn't read very closely)

I did read closely. In fact, I'm answering like this because I have been down this road before and have dealt with similar issues previously. I'm giving you a detailed explanation of what is going on that is only possible _because_ I understand your issue so well.

caddy should start serving immediately when it already has valid certificates, even if they're only valid for another few hours.

This is where you must be misunderstanding something (or we have a bug, in which case, this needs to become a bug report, with complete logs, actual and expected behavior, etc, with no redactions).

It is important to understand this:

If Caddy refuses to start due to an error maintaining certificates for names it already has certificates for, then it means Caddy has already been trying to renew your certificates for _at least 3 weeks_ to no avail. Either that, or use of the domain names in the configuration was discontinued during that time. Or the server was just off completely during that time.

If the troublesome names were in fact active during the last 3-4 weeks and the web server was running during that time, then I bet if you go into your logs, you should see errors or some indication of a problem in maintaining certificates for those names.

I didn't give caddy a bad configuration. It behaved badly of its own accord.

OR -- more likely -- it is alerting you to a problem which was previously gone unnoticed, which is not bad behavior.

It's not weird and confusing. You're merely proceeding from the incorrect assumption that I'm an idiot, as evidenced by your blaming it on "operator error" instead of, you know, acting in good faith and thinking.

Please don't misunderstand. Like I said, I've gone down this road before. I had to deal with this same complaint during one of my most stressful days of grad school. Don't take it personally. There are factors that any software cannot account for alone; "operator" doesn't necessarily mean you -- any number of changes made by anyone to a network or piece of infrastructure can cause ACME failures, including those outside of your own control. This is why Caddy tries to renew 30 days out and only refuses to start when ACME operations produce errors 7 days out. It gives you time to find the problems and assess the situation. I know of many small/personal and larger/business deployments that have relied on this fact to their benefit. And this change came about _because_ in that issue I had to first squeeze the constructive juices out of the complaints and noise (which is what we're doing here); once we did that, a much better result came of it.

It was down for five minutes while the server rebooted. How much "catching up" could there possibly be to do?

3 weeks or more, depending on how long errors have been occurring during certificate maintenance.

Anyway, if you post the full, unredacted process logs from the past 30 days (90 would be ideal, so we can be sure to see when the last successful cert operation was), then I'm sure we can get to the bottom of this more quickly.

mholt on 23 Sep 2019

If Caddy refuses to start due to an error maintaining certificates

Please try to understand: The problem isn't that caddy refused to start due to an error. The problem is that Caddy refuses to serve requests until it has updated certificates that do not need to be updated in order for Caddy to serve requests:

caddy[18081]: 2019/09/23 11:40:49 ... expires in 709h11m17.828879915s; attempting renewal
caddy[18081]: 2019/09/23 11:40:57 ... expires in 709h9m53.748007937s; attempting renewal
caddy[18081]: 2019/09/23 11:41:08 ... expires in 709h9m58.689769716s; attempting renewal
caddy[18081]: 2019/09/23 11:41:14 ... expires in 709h24m57.114919533s; attempting renewal
caddy[18081]: 2019/09/23 11:49:32 ... expires in 709h30m56.630442273s; attempting renewal
caddy[18081]: 2019/09/23 12:04:29 ... expires in 708h32m28.933654514s; attempting renewal
caddy[18081]: 2019/09/23 12:04:36 ... expires in 708h46m23.630403797s; attempting renewal
caddy[18081]: 2019/09/23 12:04:42 ... expires in 708h31m58.96196229s; attempting renewal
caddy[18081]: 2019/09/23 12:20:50 ... expires in 708h1m25.838814885s; attempting renewal
caddy[18081]: 2019/09/23 12:20:57 ... expires in 708h31m26.404918827s; attempting renewal
caddy[18081]: 2019/09/23 12:21:04 ... expires in 708h1m20.727002701s; attempting renewal
caddy[18081]: 2019/09/23 12:21:12 ... expires in 708h30m45.489772165s; attempting renewal
caddy[18081]: 2019/09/23 12:21:21 ... expires in 708h15m28.253586444s; attempting renewal
caddy[18081]: 2019/09/23 12:21:29 ... expires in 708h44m50.184341994s; attempting renewal
caddy[18081]: 2019/09/23 12:21:40 ... expires in 708h30m0.687983647s; attempting renewal
caddy[18081]: 2019/09/23 12:21:47 ... expires in 708h30m28.748713595s; attempting renewal
caddy[18081]: 2019/09/23 12:34:41 ... expires in 708h17m8.105060096s; attempting renewal
caddy[18081]: 2019/09/23 12:34:50 ... expires in 708h16m42.804472251s; attempting renewal
caddy[18081]: 2019/09/23 12:34:55 ... expires in 708h17m35.30678742s; attempting renewal
caddy[18081]: 2019/09/23 12:35:02 ... expires in 708h16m21.346697059s; attempting renewal
caddy[18081]: 2019/09/23 12:35:12 ... expires in 708h15m30.446041645s; attempting renewal
caddy[18081]: 2019/09/23 12:35:21 ... expires in 708h45m24.802825846s; attempting renewal
caddy[18081]: 2019/09/23 12:35:29 ... expires in 708h15m45.515672601s; attempting renewal
caddy[18081]: 2019/09/23 12:35:36 ... expires in 708h1m30.466095608s; attempting renewal
caddy[18081]: 2019/09/23 12:35:43 ... expires in 707h46m49.006811679s; attempting renewal
caddy[18081]: 2019/09/23 12:35:52 ... expires in 708h14m34.271097476s; attempting renewal
caddy[18081]: 2019/09/23 12:50:31 ... expires in 708h30m5.199944378s; attempting renewal
caddy[18081]: 2019/09/23 12:50:41 ... expires in 707h59m53.93006131s; attempting renewal
caddy[20885]: 2019/09/23 13:01:34 ... expires in 719h34m4.253208351s; attempting renewal
caddy[21016]: 2019/09/23 13:05:43 ... expires in 719h44m49.886556212s; attempting renewal
caddy[21166]: 2019/09/23 13:06:35 ... expires in 719h43m57.689160035s; attempting renewal
caddy[21559]: 2019/09/23 13:07:36 ... expires in 719h59m4.661532208s; attempting renewal
caddy[21559]: 2019/09/23 13:50:42 ... expires in 718h59m50.442869687s; attempting renewal
caddy[21559]: 2019/09/23 13:50:54 ... expires in 718h44m44.760484554s; attempting renewal

Over two hours it took Caddy to start serving requests for literally no good reason.

What it should have done is immediately start serving requests using the perfectly valid certs it already has and worry about updating them once it was up and running and doing its actual job. There is absolutely no reason it couldn't do exactly that.

Yes, there might have been some issues updating certificates (I don't know—systemd only has the logs since the last reboot, which was today), but that's beside the point.

The problem is that Caddy refuses to serve requests until it performs renewals that are _not necessary_ for it to serve requests.

I had to deal with this same complaint during one of my most stressful days of grad school.

Yeah, and unfortunately you don't seemed to have learnt much from it despite all the valid criticism and thumbs-down…

A webserver with aspirations of a role in production must, first and foremost, serve requests. Under no circumstances is it acceptable for it not to do so unless it absolutely can't, nor for an issue with one site/cert to interfere with others it's serving.

only refuses to start when ACME operations produce errors 7 days out. It gives you time to find the problems and assess the situation

This is broken behaviour, tbh. Caddy should absolutely throw a bunch of dire warnings if there's a persistent problem with cert renewal.

But it must absolutely not refuse to start or serve a site unless it genuinely can't. How on earth is unnecessarily shutting down a business for an extra week the "right" thing to do?

deanishe on 24 Sep 2019

👎1

@deanishe Your attitude and tone are inappropriate. This is an open source project and @mholt is under no obligation to help you. He's doing the best he can without pay. If this is so business-critical, please consider paying for support https://caddyserver.com/products/licenses

francislavoie on 24 Sep 2019

Please try to understand: The problem isn't that caddy refused to start due to an error.

I am hearing you, believe me. :) I definitely understand the difference here.

The problem is that Caddy refuses to serve requests until it has updated certificates that do not need to be updated in order for Caddy to serve requests

But the certificates _do_ need to be updated, because they are expiring soon. Critically soon. And they failed for 3 weeks. How much longer would you like to have postponed the actual problem?

What it should have done is immediately start serving requests using the perfectly valid certs it already has and worry about updating them once it was up and running and doing its actual job. There is absolutely no reason it couldn't do exactly that.

Yes, there might have been some issues updating certificates (I don't know—systemd only has the logs since the last reboot, which was today), but that's beside the point.

That's unfortunate, because it's really _not_ beside the point: it is the point.

Caddy can also be configured to emit logs directly to persistent files (or to stdout/stderr) which persist beyond reboots.

So, maybe we will never know. But I'm convinced that there would have been error messages in the logs for the last several weeks. Which leads me to this:

This is broken behaviour, tbh. Caddy should absolutely throw a bunch of dire warnings if there's a persistent problem with cert renewal.

There isn't enough information available to pinpoint the root cause.

So at this point---and I'm really really sorry to say this because I know this sucks---but given the circumstances, it could have been fixed by checking the logs.

Caddy 2 will probably be able to handle this differently because of how its configuration is loaded.

How on earth is unnecessarily shutting down a business for an extra week the "right" thing to do?

mholt on 24 Sep 2019

❤2

I'll tell ya what though, because config in Caddy 2 is loaded async, what we can do is change it so that the behavior of certificate management (ACME) operations are more configurable.

It's still unclear exactly how that would look/work, but at least we don't have to act as if the operator is present in all cases at startup (again, because of how config is loaded differently in Caddy 2 vs. Caddy 1). What do you think of that?

mholt on 24 Sep 2019

Also, it occurred to me that you should be receiving emails from Let's Encrypt if your certs get close to expiring (this is why we recommend always providing an email address when running caddy the first time). Did you not get any? If not, we should look into why and make sure to fix that.

mholt on 24 Sep 2019

Your attitude and tone are inappropriate.

You're right. I'm sorry @mholt.

But the certificates do need to be updated, because they are expiring soon. Critically soon.

You didn't look at the logs I posted, did you?

The oldest certificate was 29.5 days from expiry. There had been no problems with renewals. No "operator error". They simply hadn't been due for renewal.

deanishe on 25 Sep 2019

❤1

@deanishe I did look at the logs, but they're mostly snippets and missing, so I really have no idea what is going on. What are the full logs?

The oldest certificate was 29.5 days from expiry.

Caddy will attempt to renew certificates that are less than 30 days to expiration. _Failures_ of renewals at startup are only fatal if the certificate is a few days from expiration; hence failures renewing for the last ~3 weeks already.

There had been no problems with renewals.

Are you sure? How do you know?

mholt on 25 Sep 2019

What are the full logs?

IRRELEVANT. Will you please stop trying to pretend that Caddy's poor behaviour is something else's fault.

Failures of renewals at startup are only fatal if the certificate is a few days from expiration

FFS, please try to understand before one of us dies.

The problem is not that Caddy refuses to start at all. The problem is that it does not start serving requests until it has finished updating certificates that are still very, very valid. 29-point-something days and only just qualifying for a renewal in this case.

I didn't realize you were using this free software for your business!

I'm not. But you're trying to sell to businesses, and your dissembling in this issue is not exactly confidence-inspiring, is it?

deanishe on 26 Sep 2019

Okay, well, this thread has not been constructive, and there's not enough information here to fully understand the problem and make the nuanced changes in Caddy that would be required to satisfy your demands. So I'm going to close this.

I'm not. But you're trying to sell to businesses, and your dissembling in this issue ...

I'm not concealing anything. I'm the one asking for more logs so I can understand what happened, but you refuse to provide them.

is not exactly confidence-inspiring, is it?

I think this goes to show that my approach to debugging is more rigorous than what you want, and I'm sorry that's so frustrating for you. But I've taken more relaxed approaches before and caused everyone including me a lot more grief, so I've learned I can't do that anymore.

I am confident in the project's reliability after so many community contributions over the years. It probably still has bugs, so that's what I'm trying to understand here so we can fix them, but I can't do that if you insist on withholding the required information.