caddy -version)?v0.9.1
Serving traffic via the proxy directive using max_fails 0 as only one backend is available.
:2015 {
proxy / 127.0.0.1:8080 {
transparent
max_fails 0
}
}
./caddy --agree --conf=caddyfile_test
No memory leak and no 100% CPU usage on 4 cores.
4 cores at 100%, 4GB memory filled in under 1 minute after the first backend failure arose.
Also stdout seems to go crazy with:
goroutine 2826341 [sleep, 2 minutes]:
time.Sleep(0x2540be400)
/usr/local/go/src/runtime/time.go:59 +0xe1
github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP.func1(0xc4202c5420, 0x2540be400)
/tmp/custombuild_111_575206995/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:157 +0x2b
created by github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP
/tmp/custombuild_111_575206995/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:159 +0x44e
goroutine 2831798 [sleep, 2 minutes]:
time.Sleep(0x2540be400)
/usr/local/go/src/runtime/time.go:59 +0xe1
github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP.func1(0xc4202c5420, 0x2540be400)
/tmp/custombuild_111_575206995/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:157 +0x2b
created by github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP
/tmp/custombuild_111_575206995/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:159 +0x44e
As always edge cases \o/
The fix for now was to use max_fails to 50, which doesn't get reached under usual circumstances.
I think this is a known and basically duplicate issue of https://github.com/mholt/caddy/issues/754.
Try proxying to something not on the local network with max_fails 0.
@nemothekid We really need to get rid of max_fails 0 for localhost proxying. Or any/all proxying.
Although I'm not sure what that stdout output is all about (is the OS killing threads to free up memory?).
Didn't correlate this with #754, thought it might be something new or not yet discovered.
Not sure on the OS killing threads theory.
Thanks for taking the time to consider it. Nothing with high priority I would say. More a good to know, better to fix in the long run kinda thing.
Yeah the stdout errors are caused by it launching as many goroutines as possible in that infinite loop (which explains why its eating memory).
max_fails 0 likely shouldn't be a valid directive, max_fails 1 should be the minimum.
_Or_ we can change the meaning of max_fails 0. If the backend fails and max_fails is 0, we immediately return http.StatusBadGateway, but we don't mark the backend down (so the next request can still try that backend).
Something conceptually like:
diff --git a/proxy.go b/proxy2.go
index 89fa21a..c5c05c3 100644
--- a/proxy.go
+++ b/proxy2.go
@@ -157,6 +157,9 @@ func (p Proxy) ServeHTTP(w http.ResponseWriter, r *http.Request) (int, error) {
time.Sleep(timeout)
atomic.AddInt32(&host.Fails, -1)
}(host, timeout)
+ if su, ok := upstream.(*staticUpstream); ok && su.MaxFails == 0 {
+ return http.StatusBadGateway, errUnreachable
+ }
}
return http.StatusBadGateway, errUnreachable
Yeah, I agree with you that max_fails 0 is always an error. My reasoning: Who is the site owner to say that their backend is _never_ down? The only authoritative answer to that question is the proxy that makes the request and gets a bad response (or none at all). I think max_fails 0 as-is gives users enough rope to hang themselves.
Or we can change the meaning of max_fails 0. If the backend fails and max_fails is 0, we immediately return http.StatusBadGateway, but we don't mark the backend down (so the next request can still try that backend).
I think that's money right there. Do you think the changed behavior should go in the CheckDown function instead? (I realize your diff is just a proof of concept.)
(thinking out loud)
I think that's money right there. Do you think the changed behavior should go in the CheckDown function instead? (I realize your diff is just a proof of concept.)
Not sure that would work as then CheckDown would need some sort of context for each request. The only thing each request has to work with in the Upstream interface:
type Upstream interface {
// The path this upstream host should be routed on
From() string
// Selects an upstream host to be routed to.
Select(*http.Request) *UpstreamHost
// Checks if subpath is not an ignored path
AllowedPath(string) bool
}
We could add a ShouldRetry() bool function, and then the staticUpstream impl would have something like
func (u *foo) ShouldRetry() bool {
return u.MaxFails != 0
}
This works for single host proxies when max_fails is 0 (every request tries the backend), but if you have multiple backends, you will only try one backend before you quit. We could upgrade too:
func (u *foo) ShouldRetry() bool {
if u.MaxFails == 0 {
pool := u.Hosts
for _, host := range pool {
if host.Fails == 0 {
return true
}
}
return false
} else {
return true
}
}
This way we will always retry as long as there is a host that hasn't had a failing request within FailTimeout. However this now means we end up with the infinite loop issue again if FailTimeout is short enough. We could also end up with the infinite loop issue if we keep selecting the same backend (because we won't stop selecting it because max_fails is 0).
Yeah, I agree with you that max_fails 0 is always an error. My reasoning: Who is the site owner to say that their backend is never down? The only authoritative answer to that question is the proxy that makes the request and gets a bad response (or none at all). I think max_fails 0 as-is gives users enough rope to hang themselves.
@mholt May I ask when is a backend considered down in this context? The documentation of proxy directive is not clear - is it only if a connection error occurs or also when the backend returns an HTTP status outside of the 200-399 range? If it is the latter case, the max_fails = 0 being marked as invalid would be a problem for some use cases...
@tomasdeml
is it only if a connection error occurs or also when the backend returns an HTTP status outside of the 200-399 range?
Either of those -- an error response or an actual network error. So yes, I agree, max_fails 0 is problematic. :wink:
I got the same issue (memory leak when using max_fails 0).
I have a service that responds with some 403 or 404 and without max_fails 0 it marks the backend as bad. I really don't get where this 200-399 comes from? Internal errors are >499 afaik.
- is it only if a connection error occurs or also when the backend returns an HTTP status outside of the 200-399 range?
A backend is only marked down when a connection error occurs. Any other response is treated as successful.
@nemothekid this code here seems to mark upstream as unhealthy when it's outside 200-399: https://github.com/mholt/caddy/blob/master/caddyhttp/proxy/upstream.go#L346
That's only if you are using the health_check directive
@nemothekid got it, thanks. What does connection error mean however? I only get backend unreachable. I have no idea how to debug this. I can connect just fine to the backend using telnet.
I feel like people's backend services are more flaky than one thinks... just because you can connect once with telnet doesn't mean that amid the thousands of requests before that, one request didn't fail.
I do agree we need better error reporting in this case.
@nemothekid After re-reading your "thinking out loud" post above, what if we make a few changes:
^ These are just the way I'm leaning so far. What do you (anyone!) think?
Okay, I've pushed a change to the proxyerrs branch that improves error reporting. Still looking into improving the max_fails 0 situation and the single upstream host scenarios.
@stp-ip See PR #1135.