Caddy: Caddy unusable when acme server is down

Created on 19 May 2017 · 42Comments · Source: caddyserver/caddy

1. What version of Caddy are you using (`caddy -version`)?

0.10.0

2. What are you trying to do?

Start caddy

3. What is your entire Caddyfile?

(any caddyfile with tls enabled)

4. How did you run Caddy (give the full command and describe the execution environment)?

/opt/caddy/caddy --log stdout --agree=true --conf=/opt/caddy/Caddyfile --root=/var/tmp [email protected]

7. What did you see instead (give full error messages and/or log)?

Activating privacy features...2017/05/19 09:47:19 get directory at 'https://acme-v01.api.letsencrypt.org/directory': failed to get json "https://acme-v01.api.letsencrypt.org/directory": Get https://acme-v01.api.letsencrypt.org/directory: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

8. How can someone who is starting from scratch reproduce the bug as minimally as possible?

With any basic caddyfile that has tls on, caddy will fail if the url https://acme-v01.api.letsencrypt.org/directory fails to give a response before the timeout.

There doesn't appear to be a workaround for this. Caddy should ignore the error if a certificate is already present and valid.

discussion

Source

jleclanche

👍9

Most helpful comment

I should be finishing my paper for NIPS that is due at 1pm today.

Caddy requires certificates for sites that do not have one, have only an expired one cached locally, or have one cached locally that is expiring soon. Caddy's renewal window is 30 days before expiration.

In 410ece8 I've changed the "fatal" renewal window to 7 days. This gives you about 3 weeks of downtime or blockage before Caddy will refuse to start.

I'm rolling out an emergency release 0.10.3 in a few minutes.

mholt on 19 May 2017

👍27 🎉13 ❤12

All 42 comments

I worked around this by temporarily specifying -ca https://acme-staging.api.letsencrypt.org/directory, which is currently up and running...

jleclanche on 19 May 2017

Hi Jerome, thanks for the question. When a CA is down, that's really a problem, because it means Caddy can't obtain credentials it requires to serve your site securely. And serving a site insecurely is a bad idea.

There doesn't appear to be a workaround for this.

There are several, they're just not great. One is to provide your own certificates with the tls directive. Another is to disable HTTPS entirely by specifying your site address with http:// in it. Another is to, as you said, change CAs. However, changing to the staging CA provides an invalid certificate that isn't trusted.

Caddy should ignore the error if a certificate is already present and valid.

I disagree, this is a security and uptime issue that demands your attention.

So, this is not a bug and all is working as intended.

mholt on 19 May 2017

👎73 😕13

As someone considering using Caddy...

@mholt are you saying that Caddy needs to redownload the private certificate material every time it starts, even if it's within the validity window of a previous issuance? That it grabs and stores them in memory at start-up?

tobz on 19 May 2017

👍2

I disagree, this is a security and uptime issue that demands your attention.

I think the underlying problem is not a lack of security, the certificate is already present and valid.

I suggest that if there is a minimum amount of time left on the certificate (for example atleast 21 days) then Caddy can safely continue operating existing certificates.

The site can be securely services for some time, even without the ACME provider being online and I see no problem with ignoring the error as long as sufficient time is left on the cert.

Otherwise, the uptime of the backend is directly tied to the uptime of the ACME provider, if they go down, Caddy goes down.

At minimum the error of an unavailable ACME provider should not be fatal one if a valid cert is present for all sites with a minimum lifetime of 21 days. There is no security issue in that situation as long as no new certificates are required.

tscs37 on 19 May 2017

👍4

@mholt Caddy behaves properly as long as the acme server was up when started; it doesn't need a permanent connection to the acme server, nor should it if the certs have already been created established.

Please understand this makes Caddy impossible to start if the ACME server is down.

jleclanche on 19 May 2017

👍1

@mholt Agreed with the others. If the local cert is still valid, and the OCSP data is also current (I think it's refreshed weekly?), then why shouldn't caddy continue to start and serve up using them?

The CA should only matter at the time of renewal of either OCSP or the cert.

This way, Caddy could start and continue to retry in the background, like it does normally with renewals.

atonse on 19 May 2017

👍11

I disagree, this is a security and uptime issue that demands your attention.

What nonsense is this? A CA being down for a few minutes does not invalidate already cached certificates. Those are still perfectly valid for a few more weeks. Use them.

Until this is addressed I seriously can't imagine ever choosing Caddy for anything again. Simply ridiculous.

ghost on 19 May 2017

👍30

This really impacts my perception of Caddy as production-ready software.

devlinzed on 19 May 2017

👍35

@r04r @devlinzed agreed. I'm using it on two production sites now. And I was already bitten because the server wouldn't start because one of the sites couldn't get an LE cert (it was an existing site moved to a new server, DNS hadn't propagated) and all sites (even unrelated ones) went down as a result, since the whole caddy server refused to start.

The tradeoffs that are being are a red flag for any web server that aims to be production-ready.

atonse on 19 May 2017

👍7

Back to nginx we go. Thanks!

GiorgioG on 19 May 2017

👎6 👍6 😄4 😕3 🎉3 ❤1

Anyone looking for an alternative solution, which can issue LE certificates on the fly: take a look at the openresty project in combination with lua-resty-auto-ssl. We've been using it in production for a couple of months now.

luto on 19 May 2017

👍3 🎉2

Caddy should ignore the error if a certificate is already present and valid.

I disagree, this is a security and uptime issue that demands your attention.

@mholt As long as the local cert is valid, what's the security issue?

GiorgioG on 19 May 2017

I disagree, this is a security and uptime issue that demands your attention.

Can you explain in a little more detail? I'm kind of on the fence here.

If you have a valid (non-expired) certificate, that vhost should be safe to start. If you need to fetch a new one, then that is definitely worth erroring out and refusing to start, until it's resolved.

Otherwise, you risk turning LetsEncrypt into a DoS amplification vector for nearly all Caddy users on the Internet.

paragonie-scott on 19 May 2017

👍5

My database doesn't blow up when I only have 300mb of disk space left on the data volume, why should my web server stop working when there are X days left on my local cert because the CA is down? This position is absurd and logically indefensible.

GiorgioG on 19 May 2017

👍2

@GiorgioG You keep commenting without giving @mholt a chance to respond, and your latest comment seems increasingly agitated.

Maybe take a break from the computer for a bit? Cooler heads prevail.

paragonie-scott on 19 May 2017

👍12

Geez I wonder why I might be agitated...production site down because of a non-critical issue. I'll file this under 'Security theater' and switch back to nginx. Thanks... see ya.

GiorgioG on 19 May 2017

😕2 👎2

I should be finishing my paper for NIPS that's due today but, since this is garnering a LOT of attention from HN... I should clarify some things.

@tobz

are you saying that Caddy needs to redownload the private certificate material every time it starts, even if it's within the validity window of a previous issuance?

No, definitely not. Caddy does _not_ require access to a CA while the certificate is valid and not expiring. If it is expiring soon (30 days), Caddy will attempt to renew so that your site will stay online.

That it grabs and stores them in memory at start-up?

No, they're stored on disk first, then loaded into memory.

If the local cert is still valid, and the OCSP data is also current (I think it's refreshed weekly?), then why shouldn't caddy continue to start and serve up using them?

Because there is an error that _will cause your site to go down_ and while the server is still starting, the operator (you) are there to handle the error. Caddy will not serve a site that it believes will go down while you are there to address the issue.

What if the certificate expires in 20 minutes or 20 seconds instead of 20 days? Where do you draw the line? We draw it at a conservative 30. Where ever we draw it, people are going to get bit when their CA is down.

It's important to note:

Caddy doesn't take your site offline because it can't renew a certificate. If you're starting Caddy for the first time, there _are_ no sites to take offline because they were already _not online._

If you're "restarting" Caddy by killing the process and restarting it, which _does_ take all your sites down, _stop it_ and use signal USR1 which is a graceful, zero-downtime restart that only applies successful reloads: https://caddyserver.com/docs/cli#usr1

mholt on 19 May 2017

👍15

I'm not a fan of this decision, but if you're down right now, you should be able to copy the cert and key from the .caddy directory and change to the manual setup of "tls /path/to/cert /path/to/key".

budgetneon on 19 May 2017

The Let's Encrypt OCSP responders were also having trouble. Note that Caddy is the only web server to staple OCSP by default. OCSP stapling errors are _not_ "fatal" to Caddy. Further, Caddy stores OCSP staples to disk to be able to weather downtime like this gracefully. Your OCSP is in better hands with Caddy than, say, a default nginx or Apache configuration. (Caddy checks OCSP every hour and updates it halfway through the validity window.)

mholt on 19 May 2017

👍6

What if the certificate expires in 20 minutes or 20 seconds instead of 20 days? Where do you draw the line? We draw it at a conservative 30. Where ever we draw it, people are going to get bit when their CA is down.

You could make it configurable so that people can draw the line for themselves and won't have a reason to complain to you

stephenwilliams on 19 May 2017

👍2

What if the certificate expires in 20 minutes or 20 seconds instead of 20 days? Where do you draw the line? We draw it at a conservative 30. Where ever we draw it, people are going to get bit when their CA is down.

You draw the line somewhere around what is a sensible downtime to expect from the ACME server. Let's encrypt being down for 30 days is not something we should expect to happen.

If you're "restarting" Caddy by killing the process and restarting it,

There might be reasons to restart the whole machine, though.

Chronial on 19 May 2017

👍4

@budgetneon

I'm not a fan of this decision, but if you're down right now, you should be able to copy the cert and key from the .caddy directory and change to the manual setup of "tls /path/to/cert /path/to/key".

No need to copy them, you can specify the paths directly as they are in $CADDYPATH (default ~/.caddy). I agree with you though, this is not a great idea in the long run.

mholt on 19 May 2017

@mholt thank you for reacting so fast <3

erdii on 19 May 2017

👍4

@stephenwilliams

You could make it configurable so that people can draw the line for themselves and won't have a reason to complain to you

People will always have a reason to complain. If the window was 7 days instead of 30, it'd just be a different subset of users complaining.

mholt on 19 May 2017

😄2 👍1

Replace Caddy with NGINX

ghost on 19 May 2017

👎17

@Chronial

You draw the line somewhere around what is a sensible downtime to expect from the ACME server. Let's encrypt being down for 30 days is not something we should expect to happen.

Exactly; which is why it's a conservative window. Note that downtime is not the only problem. We want to be more resistant to blocking attacks whereby packets between your server and the CA are blocked entirely. The attack would have to last a full 30 days to be successful.

@atonse

DNS hadn't propagated

You can use the DNS challenge, no DNS propagation necessary.

mholt on 19 May 2017

Because there is an error that will cause your site to go down and while the server is still starting, the operator (you) are there to handle the error. Caddy will not serve a site that it believes will go down while you are there to address the issue.

That's a reasonable position, but it does seem like there should be a way to override that in case the error is temporary, occurring for reasons beyond your control (i.e. Let's Encrypt being down), and you really do need to start the server _right now_ (e.g. because your site is currently experiencing downtime).

Ajedi32 on 19 May 2017

👍6

https://github.com/mholt/caddy/issues/1680#issuecomment-302705530
- "Back to nginx we go. Thanks!"
https://github.com/mholt/caddy/issues/1680#issuecomment-302708911
- "I'll file this under 'Security theater' and switch back to nginx. Thanks... see ya."

I know this probably won't get through to @GiorgioG at all, because arguing with an angry person is almost never going to happen, but I want to highlight for the rest of the community that this sort of behavior basically amounts to emotional blackmail.

If the window was 7 days instead of 30, it'd just be a different subset of users complaining.

Yep, totally. But if you let them specify the window, you'd only have the fringes of both subsets complaining. (If you default to 30, you'll probably see even less.)

paragonie-scott on 19 May 2017

👍14

This thread has gotten very distracted.

Nobody, to clarify, exactly zero people, on this thread, are arguing that Caddy should serve invalid TLS certs, etc. People are arguing that if there is a valid cert, and valid OCSP data, (even if it's valid for 10 seconds), caddy should not refuse to start.

Your question about 20 minutes, 20 seconds, my answer is, yes it should serve the site for 20 minutes, and then stop. By then, who knows, maybe the CA will come back up. The point is, just because the cert may expire some time in the FUTURE, doesn't make it any less valid at the present, even if for another 10 seconds.

If you have valid credentials, serve the site. If the credentials aren't valid, don't serve the site. What exactly is the debate here about 20 minutes, 20 hours, or 20 days?

atonse on 19 May 2017

👍13

Sorry @mholt, I didn't imagine the HN post would garner this much attention when I linked this there, didn't mean to drop a bombshell :/

If you're "restarting" Caddy by killing the process and restarting it, which does take all your sites down, stop it and use signal USR1 which is a graceful, zero-downtime restart that only applies successful reloads: https://caddyserver.com/docs/cli#usr1

Right, I got unlucky. My deployment uses ansible's service on Debian, which is an abstraction for systemctl restart caddy.service. The service file I'm using is the upstream one, which specifies ExecReload=/bin/kill -USR1 $MAINPID. So I do use USR1, however this was came out of a server hardware downgrade, which required a reboot. Therefore, Caddy had to boot from scratch.

Which brings me to the following point:

the operator (you) are there to handle the error

This is simply not true for larger-scale deployments, and if Caddy wants to accomodate those you cannot rely on that at all. Additionally, there is nothing to "handle"; if the acme server is down, there's just nothing anyone can do short of waiting for it to come back up. Using the ACME staging server was a shot in the dark and I was lucky it worked without serving bad certs :)

jleclanche on 19 May 2017

👍9

I should be finishing my paper for NIPS that is due at 1pm today.

In 410ece8 I've changed the "fatal" renewal window to 7 days. This gives you about 3 weeks of downtime or blockage before Caddy will refuse to start.

I'm rolling out an emergency release 0.10.3 in a few minutes.

mholt on 19 May 2017

👍27 🎉13 ❤12

@mholt Thanks so much!

Even among commercial software it's rare for _paying_ customers to be able to directly contact the primary developer of the software at all, let alone have a conversation with them and get a fix released in under 6 hours. This is some pretty amazing turnaround time, especially considering Caddy is completely free. So again, thanks a lot!

Ajedi32 on 19 May 2017

👍15 👎2

You're welcome. I'm sorry for the trouble.

Now I'm going back to my NIPS paper.

mholt on 19 May 2017

👍27

@enilfodne The theory is that, because Caddy starts attempting to auto-renew certs within 30 days of their expiration, the acme server would have to be down three weeks in order for the situation to arise.

In practice there's always the possibility that caddy was offline/unused all that time, but most of the time it matches up with the expectation.

jleclanche on 19 May 2017

@enilfodne

if the server is restarted/started with a LE cert, that have less than 7 days left to renewal, the same issue will manifest if the LE infrastructure has issues.

If Caddy was running before, it should have renewed the certificate in the 3 weeks prior. If not, either the blocking attack or downtime is so long it's basically hopeless for the next 7 days anyway OR your site has been down for 3+ weeks already.

This effectively cuts the certificate lifetime with 7 days and puts strain on LE's infrastructure for no arguable increase in certificate safety.

No, because Caddy renews certificates 30 days out when Caddy is running. If it gets to the point where it still needs to renew and it's only 7 days out, then something is seriously, seriously wrong. Either you need a new CA or you're under attack. Both demand your direct attention. And note this only applies to process startup, not continuous running and not USR1 reloads, which are graceful (use them!).

Always using the existing files and periodically running upgrade (in the background, not on startup) to "refresh" the certificates.

This is exactly what Caddy does, in addition to checking on startup. That check at startup is essential.

i expect everyone in this thread is interested in the future of Caddy.

Not everyone, unfortunately. Some people are "switching back to nginx" to try to make a point. 🤷‍♂️

mholt on 19 May 2017

👍7

@mholt focus on NIPS, you've done more than enough here!

theonewolf on 19 May 2017

👍5

@mholt Because servers never reboot, right? Why is this issue still closed?

ghost on 19 May 2017

@r04r Explain. What's the problem?

mholt on 19 May 2017

I should be finishing my paper for NIPS that's due today

What's the paper about? No offense, never perceived you as a NIPS guy. 👍🏻

sebastianmarkow on 19 May 2017

Wow, I'm impressed. @mholt closed an issue, fought off the unreasonable people, discussed with the reasonable people, allowed them to change his mind, pushed an emergency release with a great design that makes everybody happy and does not impact security in any way, and just about finished that NIPS paper - all in the span of a few hours.

This really impacts my perception of Caddy as production-ready software.

eteeselink on 19 May 2017

😄15 ❤7 👎1 👍1

Just to add my 2¢ here:

we're using Caddy in production since more than a year. It provided us great service and helped us to avoid all the hassle of doing certificate management manually/properly which avoided costs, saved our minds' sanity and allowed us to rapidly deploy a lot of infrastructure from scratch. In case our 2nd funding round works out as expected, Caddy is my top-priority FLOSS project to receive a donation.
Caddy isn't perfect, but which software is? At least I haven't found a software before which puts so much focus on "simply works" while still providing a lot of power and flexibility. The bugs/issues we ran into were negligible and far below the baseline of what I expected from such a young software.
The support through GitHub issues, Gitter discussions etc was awesome, always helpful and based on technical, not ideological talking points.
Stuff like "I'm back to nginx" helps no-one and just creates useless tension in such a discussion

Now for a technical comment:
In our case, we're using Caddy (amongst other scenarios) as TLS endpoint/load-balancer which does nothing but to serve a lot of different domains and route them through to various backends (most of them fronted by Caddy as well where it handles rewrites etc.).
The configuration is generated through our config management, so it changes relatively often, which also means: new hosts/domains are added on a regular basis.
In this case, the new domains didn't have any pre-existing certificates yet, so Caddy failed completely (yes, I know about the SIGUSR1 stuff, but we also have to consider the reboot scenario and by default our config management also used restart instead of reload) on startup.

What I'd love to see: being able to configure Caddy in a way to startup despite a few hosts failing to acquire their certificate via ACME, so at least those with an existing certificate keep being properly served while the remaining ones just log their issues and fail gracefully, without tearing the whole Caddy process down, e.g.: