Caddy: Suggestion - Raise the default timeout in Caddy's graceful shutdown

Created on 28 Jul 2017  路  19Comments  路  Source: caddyserver/caddy

1. What version of Caddy are you using (caddy -version)?

Caddy 0.10.4

2. What are you trying to do?

Use Caddy as a Kubernetes ingress controller & load balancer

3. What is your entire Caddyfile?

localhost:81 {
  status 200 /status
}

:12015 {
  status 200 /healthz
}

      myhost.mywebsite.com/ {
        log / stdout "{combined} {upstream}"
        tls {$ACME_EMAIL} {
          # https://github.com/mholt/caddy/issues/189
          alpn http/1.1
        }
        proxy / {
          policy round_robin
          header_upstream -X-Forwarded-For
          keepalive 0

          upstream 10.48.75.4:8080
          upstream 10.48.76.3:8080
          upstream 10.48.72.4:8080
          upstream 10.48.70.3:8080
          upstream 10.48.49.4:8080
          upstream 10.48.73.3:8080
          upstream 10.48.71.3:8080
        }
      }

4. How did you run Caddy (give the full command and describe the execution environment)?

Running via https://github.com/wehco/caddy-ingress-controller

/usr/bin/caddy -conf /etc/Caddyfile -log stdout

5. Please paste any relevant HTTP request(s) here.

68.xx..xx.xx - - [28/Jul/2017:15:58:20 +0000] "GET /redacted HTTP/1.1" 502 16 "-" "http-kit/2.0" http://10.48.66.3:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:21 +0000] "GET /redacted HTTP/1.1" 200 97 "-" "http-kit/2.0" http://10.48.49.4:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:23 +0000] "GET /redacted HTTP/1.1" 502 16 "-" "http-kit/2.0" http://10.48.56.3:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:33 +0000] "GET /redacted HTTP/1.1" 200 98 "-" "http-kit/2.0" http://10.48.70.3:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:38 +0000] "GET /redacted HTTP/1.1" 502 16 "-" "http-kit/2.0" http://10.48.59.5:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:51 +0000] "GET /redacted HTTP/1.1" 502 16 "-" "http-kit/2.0" http://10.48.68.5:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:53 +0000] "GET /redacted HTTP/1.1" 502 16 "-" "http-kit/2.0" http://10.48.57.3:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:54 +0000] "GET /redacted HTTP/1.1" 200 97 "-" "http-kit/2.0" http://10.48.49.4:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:54 +0000] "GET /redacted HTTP/1.1" 200 97 "-" "http-kit/2.0" http://10.48.49.4:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:54 +0000] "GET /redacted HTTP/1.1" 200 107 "-" "http-kit/2.0" http://10.48.73.3:8080
68.xx..xx.xx - - [28/Jul/2017:15:58:54 +0000] "GET /redacted HTTP/1.1" 200 105 "-" "http-kit/2.0" http://10.48.71.3:8080

6. What did you expect to see?

Caddy should only proxy to upstreams in the current Caddyfile after a SIGUSR1 reload

7. What did you see instead (give full error messages and/or log)?

Caddy proxies to upstreams which were present in previously loaded Caddyfiles, but since removed and reloaded.

8. How can someone who is starting from scratch reproduce the bug as minimally as possible?

Configure Caddyfile to proxy round robin to multiple upstreams, remove and/or add an upstream, and reload the Caddyfile with SIGUSR1. Repeat this a few times. Observe 502s to removed backends.

feature request

Most helpful comment

I've at last resolved this issue. The graceful http server shutdown was timing out, having a default timeout (-grace) of 5s, while the upstream requests it was proxying to were hanging, with a write timeout of 20s. Changing the http server -grace timeout to 30s resolved the issue and the 502s going to ghost upstreams never came back.

@mholt I think it would be a good idea to change the default timeout in Caddy's graceful shutdown to 30s here: https://github.com/mholt/caddy/blob/master/caddyhttp/httpserver/plugin.go#L27 since it should be at least as long as the default request timeout, in my opinion, to prevent this kind of race/error.

All 19 comments

Hi, thanks for filling out the issue template. Can you also show the full process logs before and after the reload? (Run caddy with -log ... to get a process log written somewhere.)

Sure, give me some time to reset and collect, and I'll get a comprehensive log.

caddy-redacted.log.gz

@mholt The first 502 coincides with this error:

2017/07/28 19:40:39 [INFO] SIGUSR1: Reloading
2017/07/28 19:40:39 [INFO] Reloading
2017/07/28 19:40:40 http: Server closed
2017/07/28 19:40:40 http: Server closed
2017/07/28 19:40:40 http: Server closed
68.0.0.0 - - [28/Jul/2017:19:40:42 +0000] "GET /redacted HTTP/1.1" 502 16 "-" "http-kit/2.0" http://10.48.70.3:8080
68.0.0.0 - - [28/Jul/2017:19:40:42 +0000] "GET /redacted HTTP/1.1" 200 595 "-" "http-kit/2.0" http://10.48.75.4:8080
68.0.0.0 - - [28/Jul/2017:19:40:43 +0000] "GET /redacted HTTP/1.1" 200 98 "-" "http-kit/2.0" http://10.48.73.3:8080
2017/07/28 19:40:45 [ERROR] Stopping :443: context deadline exceeded
2017/07/28 19:40:45 [INFO] Reloading complete
2017/07/28 19:40:45 http: Server closed
68.0.0.0 - - [28/Jul/2017:19:40:48 +0000] "GET /redacted HTTP/1.1" 200 862 "-" "http-kit/2.0" http://10.48.72.4:8080
68.0.0.0 - - [28/Jul/2017:19:40:48 +0000] "GET /redacted HTTP/1.1" 200 595 "-" "http-kit/2.0" http://10.48.76.3:8080
68.0.0.0 - - [28/Jul/2017:19:40:53 +0000] "GET /redacted HTTP/1.1" 200 98 "-" "http-kit/2.0" http://10.48.71.3:8080
68.0.0.0 - - [28/Jul/2017:19:40:54 +0000] "GET /redacted HTTP/1.1" 200 885 "-" "http-kit/2.0" http://10.48.49.4:8080
68.0.0.0 - - [28/Jul/2017:19:40:54 +0000] "GET /redacted HTTP/1.1" 200 596 "-" "http-kit/2.0" http://10.48.75.4:8080
68.0.0.0 - - [28/Jul/2017:19:41:00 +0000] "GET /redacted HTTP/1.1" 200 888 "-" "http-kit/2.0" http://10.48.73.3:8080
68.0.0.0 - - [28/Jul/2017:19:41:00 +0000] "GET /redacted HTTP/1.1" 200 596 "-" "http-kit/2.0" http://10.48.72.4:8080
68.0.0.0 - - [28/Jul/2017:19:41:03 +0000] "GET /redacted HTTP/1.1" 200 98 "-" "http-kit/2.0" http://10.48.76.3:8080
68.0.0.0 - - [28/Jul/2017:19:41:04 +0000] "GET /redacted HTTP/1.1" 502 16 "-" "http-kit/2.0" http://10.48.70.3:8080

@sundbry It takes up to -grace seconds for the reload to finish; notice that reloading does not complete until 3 seconds after your request came in. If all your backends go down before reloading is complete, that would cause a 502, for sure. What's your Caddyfile before and after? From this I can't tell which backends were added/removed. And are all backends available before and after the reload?

@mholt Please take a look at the attached caddy-redacted.log.gz, these 502s are continuing much after the reload has settled (this was just the first instance of 502 in the log snippet there). The log also includes the printed Caddyfile as it changes over time. You can see it connecting to removed servers hours after they were removed and reloaded.

@mholt I understand you run on support contracts, btw, not trying to weasel free support out of you here. The purpose for my effort is to make Caddy viable as a Kubernetes ingress controller in https://github.com/wehco/caddy-ingress-controller/ where backend instances will frequently come and go (you can see my recent contributions there)

I understand you run on support contracts, btw, not trying to weasel free support out of you here.

I know; don't worry, if it's public on GitHub issues it's "free game". If my reply seemed short/uneducated, it's because I've been out and about and didn't have a chance to inspect your logs, only scan your initial post.

I've tried to recreate the issue using your instructions above, and I can't replicate it: no 502 errors, and Caddy doesn't route to the hosts that have been removed, and it _does_ route to new hosts that were added.

However, I noticed your logs have something mine doesn't:

[ERROR] Stopping :443: context deadline exceeded

This indicates an error while reloading. I'm not positive, since I can't seem to determine the origin of this, but it's possible that reloading isn't actually finishing successfully (despite "completing" or being done even if it failed) -- meaning it's not applying the new configuration at all. The way reloads work is that Caddy attempts to set up the new configuration entirely independently of the old configuration, and if the new configuration fails for some reason, OR if stopping the old configuration fails, it should continue running as if nothing happened. This is how Caddy achieves zero downtime, even if a config is faulty or a reload fails.

So my next goal is to try to get the same error you're seeing. What is the exact sequence of saving the modified Caddyfile, taking down upstreams and spinning up new ones, and signaling Caddy with USR1? Also, what are the values of the Caddyfile before/after each change to it?

EDIT: I do see the Caddyfile contents in the log, which is really helpful; so actually I guess I don't need the Caddyfile contents, but I'm still trying to ascertain the exact sequence of saving, loading, signaling, and taking upstreams up/down.

This is running on a 1 core virtual server, that could be exposing some race condition? (CPU load is averaging around 45%) Google Cloud Platform f1-micro (1 vCPU, 0.6 GB memory)

The write Caddyfile/reload doesn't seem like it's a problem, they are happening in the same thread, and the reloads are at least a full minute apart before the first 502s start happening.

This is the routine
https://github.com/wehco/caddy-ingress-controller/blob/master/pkg/cmd/controller/caddy.go#L147

Thanks for the info. When, in relation to that function, are the backends coming up (being provisioned, or whatever) / going down (being decomissioned)? (You'll have to forgive my ignorance, I'm not familiar with Kubernetes.)

In the meantime, I've verified, on a production server actually, that [ERROR] Stopping :443: context deadline exceeded does _not_ stop the reload from taking place. I saw that error when reloading a Caddyfile and the new Caddyfile was properly applied.

When, relative to reloading, are the backends being taken up/down?

The backends are going offline just _before_ the reload, and then when a new one comes online to replace it (e.g. during a rolling restart or machine failure), a Caddy reload will be triggered once the new server has become responsive.

The backends are going offline just before the reload

That seems like bad timing; shouldn't you wait until _after_ the reload when they're not needed anymore?

But that is an aside from the original issue.

When Caddy reloads, it literally creates a new caddy.Instance from scratch (from your raw Caddyfile input) and throws away the old one; the two configurations don't even know about each other, they don't share state. So the behavior you're seeing goes against my intuition.

I need more time and/or some help in diagnosing this.

Another aside: Are you sure you want to run Caddy as a separate process? If you're already controlling it with a Go program, why not just import the Caddy package and run it that way?

In addition, your logs have lines like this:

I0728 19:40:39.594096       1 controller.go:439] ingress backend successfully reloaded...

But I cannot trace their origin. And they seem to appear _before_ Caddy reloads.

This is very strange. The line by itself doesn't make much sense to me. Is it cut in half?

Another aside -- not that it matters for this issue (or does it?) - How do you know that this variable is thread-safe? Does the race detector show anything? https://github.com/wehco/caddy-ingress-controller/blob/6d98d246ab3f77d26952c2b885940772143ebce9/pkg/cmd/controller/caddy.go#L43

I'm starting to think something might be wrong on your end instead, but I can't quite place what.

Ping, @sundbry - have you been able to look into this more? Do you have any more information since my replies?

@mholt This morning I rewrote the controller to embed caddy in the go program per your instructions on the wiki (thank you, it was quite simple to do).

Using the Loader interface, the caddyfile is never persisted to disk now (in case it was some weird docker filesystem bug), however, the issue still remains. I can send logs but they look more or less the same. Caddy still attemps to connect to proxy upstreams that are no longer in the reloaded caddy file.

I'm now sure it has to do with the "context deadline exceeded" error. Looking into it. I'm just starting to get the hang of golang here...

2017/08/11 21:50:40 http: Server closed
2017/08/11 21:50:40 http: Server closed
2017/08/11 21:50:40 http: Server closed
2017/08/11 21:50:40 http: Server closed
2017/08/11 21:50:45 [ERROR] Stopping :443: context deadline exceeded
2017/08/11 21:50:45 [INFO] Reloading complete

I appreciate you looking into it. Keep me posted what you deduce!

I've at last resolved this issue. The graceful http server shutdown was timing out, having a default timeout (-grace) of 5s, while the upstream requests it was proxying to were hanging, with a write timeout of 20s. Changing the http server -grace timeout to 30s resolved the issue and the 502s going to ghost upstreams never came back.

@mholt I think it would be a good idea to change the default timeout in Caddy's graceful shutdown to 30s here: https://github.com/mholt/caddy/blob/master/caddyhttp/httpserver/plugin.go#L27 since it should be at least as long as the default request timeout, in my opinion, to prevent this kind of race/error.

@sundbry Oooo nice catch. Thanks for the detective work! I was beginning to wonder if there were timeouts involved but in what I had tried, I couldn't replicate long timeouts like that.

Lemme see if we can do something about this automatically... I will close the issue once I've looked into it!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

muhammadmuzzammil1998 picture muhammadmuzzammil1998  路  3Comments

wayneashleyberry picture wayneashleyberry  路  3Comments

xfzka picture xfzka  路  3Comments

PhilmacFLy picture PhilmacFLy  路  3Comments

mschneider82 picture mschneider82  路  3Comments