Caddy: Memory/cpu leak in proxy directive when max_fails 0 and backend fails

Created on 25 Aug 2016 · 16Comments · Source: caddyserver/caddy

1. What version of Caddy are you running (`caddy -version`)?

v0.9.1

2. What are you trying to do?

Serving traffic via the proxy directive using max_fails 0 as only one backend is available.

3. What is your entire Caddyfile?

:2015 {
  proxy / 127.0.0.1:8080 {
    transparent
    max_fails 0
  }
}

4. How did you run Caddy (give the full command and describe the execution environment)?

./caddy --agree --conf=caddyfile_test

5. What did you expect to see?

No memory leak and no 100% CPU usage on 4 cores.

6. What did you see instead (give full error messages and/or log)?

4 cores at 100%, 4GB memory filled in under 1 minute after the first backend failure arose.

Also stdout seems to go crazy with:

goroutine 2826341 [sleep, 2 minutes]:

time.Sleep(0x2540be400)

        /usr/local/go/src/runtime/time.go:59 +0xe1

github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP.func1(0xc4202c5420, 0x2540be400)

        /tmp/custombuild_111_575206995/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:157 +0x2b

created by github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP

        /tmp/custombuild_111_575206995/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:159 +0x44e



goroutine 2831798 [sleep, 2 minutes]:

time.Sleep(0x2540be400)

        /usr/local/go/src/runtime/time.go:59 +0xe1

github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP.func1(0xc4202c5420, 0x2540be400)

        /tmp/custombuild_111_575206995/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:157 +0x2b

created by github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP

        /tmp/custombuild_111_575206995/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:159 +0x44e

7. How can someone who is starting from scratch reproduce this behavior as minimally as possible?

use proxy with max_fails 0
have a high traffic server
force a backend unreachable error
see how it goes crazy

As always edge cases \o/

The fix for now was to use max_fails to 50, which doesn't get reached under usual circumstances.

Source

stp-ip

All 16 comments

I think this is a known and basically duplicate issue of https://github.com/mholt/caddy/issues/754.

Try proxying to something not on the local network with max_fails 0.

@nemothekid We really need to get rid of max_fails 0 for localhost proxying. Or any/all proxying.

Although I'm not sure what that stdout output is all about (is the OS killing threads to free up memory?).

mholt on 25 Aug 2016

Didn't correlate this with #754, thought it might be something new or not yet discovered.

Not sure on the OS killing threads theory.

Thanks for taking the time to consider it. Nothing with high priority I would say. More a good to know, better to fix in the long run kinda thing.

stp-ip on 25 Aug 2016

Yeah the stdout errors are caused by it launching as many goroutines as possible in that infinite loop (which explains why its eating memory).

max_fails 0 likely shouldn't be a valid directive, max_fails 1 should be the minimum.

_Or_ we can change the meaning of max_fails 0. If the backend fails and max_fails is 0, we immediately return http.StatusBadGateway, but we don't mark the backend down (so the next request can still try that backend).

nemosupremo on 25 Aug 2016

👍1

Something conceptually like:

diff --git a/proxy.go b/proxy2.go
index 89fa21a..c5c05c3 100644
--- a/proxy.go
+++ b/proxy2.go
@@ -157,6 +157,9 @@ func (p Proxy) ServeHTTP(w http.ResponseWriter, r *http.Request) (int, error) {
                        time.Sleep(timeout)
                        atomic.AddInt32(&host.Fails, -1)
                }(host, timeout)
+               if su, ok := upstream.(*staticUpstream); ok && su.MaxFails == 0 {
+                       return http.StatusBadGateway, errUnreachable
+               }
        }

        return http.StatusBadGateway, errUnreachable

nemosupremo on 25 Aug 2016

👍1

Yeah, I agree with you that max_fails 0 is always an error. My reasoning: Who is the site owner to say that their backend is _never_ down? The only authoritative answer to that question is the proxy that makes the request and gets a bad response (or none at all). I think max_fails 0 as-is gives users enough rope to hang themselves.

Or we can change the meaning of max_fails 0. If the backend fails and max_fails is 0, we immediately return http.StatusBadGateway, but we don't mark the backend down (so the next request can still try that backend).

I think that's money right there. Do you think the changed behavior should go in the CheckDown function instead? (I realize your diff is just a proof of concept.)

mholt on 25 Aug 2016

(thinking out loud)

I think that's money right there. Do you think the changed behavior should go in the CheckDown function instead? (I realize your diff is just a proof of concept.)

Not sure that would work as then CheckDown would need some sort of context for each request. The only thing each request has to work with in the Upstream interface:

type Upstream interface {
    // The path this upstream host should be routed on
    From() string
    // Selects an upstream host to be routed to.
    Select(*http.Request) *UpstreamHost
    // Checks if subpath is not an ignored path
    AllowedPath(string) bool
}

We could add a ShouldRetry() bool function, and then the staticUpstream impl would have something like

func (u *foo) ShouldRetry() bool {
  return u.MaxFails != 0
}

This works for single host proxies when max_fails is 0 (every request tries the backend), but if you have multiple backends, you will only try one backend before you quit. We could upgrade too:

func (u *foo) ShouldRetry() bool {
  if u.MaxFails == 0 {
    pool := u.Hosts
    for _, host := range pool {
        if host.Fails == 0 {
            return true
        }
     }
    return false
  } else {
    return true
  }
}

This way we will always retry as long as there is a host that hasn't had a failing request within FailTimeout. However this now means we end up with the infinite loop issue again if FailTimeout is short enough. We could also end up with the infinite loop issue if we keep selecting the same backend (because we won't stop selecting it because max_fails is 0).

nemosupremo on 25 Aug 2016

Yeah, I agree with you that max_fails 0 is always an error. My reasoning: Who is the site owner to say that their backend is never down? The only authoritative answer to that question is the proxy that makes the request and gets a bad response (or none at all). I think max_fails 0 as-is gives users enough rope to hang themselves.

@mholt May I ask when is a backend considered down in this context? The documentation of proxy directive is not clear - is it only if a connection error occurs or also when the backend returns an HTTP status outside of the 200-399 range? If it is the latter case, the max_fails = 0 being marked as invalid would be a problem for some use cases...

tomasdeml on 5 Sep 2016

@tomasdeml

is it only if a connection error occurs or also when the backend returns an HTTP status outside of the 200-399 range?

Either of those -- an error response or an actual network error. So yes, I agree, max_fails 0 is problematic. :wink:

mholt on 5 Sep 2016

I got the same issue (memory leak when using max_fails 0).

I have a service that responds with some 403 or 404 and without max_fails 0 it marks the backend as bad. I really don't get where this 200-399 comes from? Internal errors are >499 afaik.

xarg on 5 Sep 2016

is it only if a connection error occurs or also when the backend returns an HTTP status outside of the 200-399 range?

A backend is only marked down when a connection error occurs. Any other response is treated as successful.

nemosupremo on 5 Sep 2016

@nemothekid this code here seems to mark upstream as unhealthy when it's outside 200-399: https://github.com/mholt/caddy/blob/master/caddyhttp/proxy/upstream.go#L346

xarg on 5 Sep 2016

That's only if you are using the health_check directive

nemosupremo on 5 Sep 2016

@nemothekid got it, thanks. What does connection error mean however? I only get backend unreachable. I have no idea how to debug this. I can connect just fine to the backend using telnet.

xarg on 5 Sep 2016

I feel like people's backend services are more flaky than one thinks... just because you can connect once with telnet doesn't mean that amid the thousands of requests before that, one request didn't fail.

I do agree we need better error reporting in this case.

@nemothekid After re-reading your "thinking out loud" post above, what if we make a few changes:

An upstream is marked as down only if there is a connection error/timeout, or 5xx error response, and only if there is at least one other healthy backend.
If there is an error, we log the actual error message rather than just "backend unreachable".
max_fails 0 is not an acceptable value, ever -- we just throw an error at loading time.

^ These are just the way I'm leaning so far. What do you (anyone!) think?

mholt on 5 Sep 2016

👍1

Okay, I've pushed a change to the proxyerrs branch that improves error reporting. Still looking into improving the max_fails 0 situation and the single upstream host scenarios.

mholt on 24 Sep 2016

@stp-ip See PR #1135.

mholt on 25 Sep 2016

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Caddyfile lexer bug

dafanasiev · 3Comments

Rewriting URL's for Laravel

wayneashleyberry · 3Comments

Trying Caddy with own Boulder

mschneider82 · 3Comments

Prevent configuration files (dotfiles) from being served

muhammadmuzzammil1998 · 3Comments

caddy always put certificate files in ~/.local/share/certmagic

crvv · 3Comments

Caddy: Memory/cpu leak in proxy directive when max_fails 0 and backend fails

1. What version of Caddy are you running (caddy -version)?

2. What are you trying to do?

3. What is your entire Caddyfile?

4. How did you run Caddy (give the full command and describe the execution environment)?

5. What did you expect to see?

6. What did you see instead (give full error messages and/or log)?

7. How can someone who is starting from scratch reproduce this behavior as minimally as possible?

All 16 comments

Related issues

1. What version of Caddy are you running (`caddy -version`)?