Caddy: Random 502 errors with Caddy 0.9.1 in reverse proxy mode, no load

Created on 31 Aug 2016  路  35Comments  路  Source: caddyserver/caddy

1. What version of Caddy are you running (caddy -version)?

Caddy 0.9.1 (+e8e5595 Mon Aug 29 16:15:56 UTC 2016)

2. What are you trying to do?

Use Caddy as a reverse proxy in front of a Go web server

3. What is your entire Caddyfile?

utiliz.co {
   redir https://www.utiliz.co
}

www.utiliz.co {
  tls <email>
  gzip
  proxy / localhost:<XXXX> {
    header_upstream Host {host}
    header_upstream X-Real-IP {remote}
    header_upstream X-Forwarded-For {remote}
    header_upstream X-Forwarded-Proto {scheme}
  }
}

admin.utiliz.co {
  tls <email>
  gzip
  proxy / localhost:<YYYY> {
    header_upstream Host {host}
    header_upstream X-Real-IP {remote}
    header_upstream X-Forwarded-For {remote}
    header_upstream X-Forwarded-Proto {scheme}
  }
}

4. How did you run Caddy (give the full command and describe the execution environment)?

It's running under supervisord:

[program:caddy]
command=/path/caddy/caddy
directory=/path/caddy

5. What did you expect to see?

No errors

6. What did you see instead (give full error messages and/or log)?

Today I upgraded caddy from 0.8.3 to 0.9.1. Subsequently during the day I got complaints from users saying they received 502 errors. The errors happened on both the www.utiliz.co and admin.utiliz.co sites. There was never any significant load, maybe 1-2 concurrent users. Retrying immediately always worked fine. There is no problem at all with the back end server (to confirm it was not a load issue I slammed the site with 500 concurrent users via boom and did not get any 502s at all.) I never saw any 502 errors when using 0.8.3.

This is the pattern of 502s during the day today.

30/Aug/2016:15:25:11 +0000 [ERROR 502 /ic/checkout/ordersummary] unreachable backend
30/Aug/2016:15:25:11 +0000 [ERROR 502 /checkout/save] unreachable backend
30/Aug/2016:15:25:11 +0000 [ERROR 502 /favicon.ico] unreachable backend
30/Aug/2016:15:25:17 +0000 [ERROR 502 /features] unreachable backend
30/Aug/2016:16:10:58 +0000 [ERROR 502 /login/action] unreachable backend
30/Aug/2016:16:10:58 +0000 [ERROR 502 /login/action] unreachable backend
30/Aug/2016:16:10:59 +0000 [ERROR 502 /favicon.ico] unreachable backend
30/Aug/2016:16:11:03 +0000 [ERROR 502 /] unreachable backend
30/Aug/2016:22:37:07 +0000 [ERROR 502 /users] unreachable backend
30/Aug/2016:22:37:07 +0000 [ERROR 502 /] unreachable backend
30/Aug/2016:22:37:08 +0000 [ERROR 502 /favicon.ico] unreachable backend
31/Aug/2016:00:27:12 +0000 [ERROR 502 /checkout] unreachable backend
31/Aug/2016:00:27:18 +0000 [ERROR 502 /checkout] unreachable backend
31/Aug/2016:00:27:19 +0000 [ERROR 502 /checkout] unreachable backend

7. How can someone who is starting from scratch reproduce this behavior as minimally as possible?

I don't know, it seems random.

For now I am downgrading back to Caddy 0.8.3, which is disappointing.

help wanted

Most helpful comment

I just merged #1135 which drastically changes the logic regarding some of these parameters, and improves error reporting. I'm going to close this issue now in light of that, and would ask that all who have experienced this issue to pull the latest Caddy and try again -- note the changes in that PR may require a few minor changes to your Caddyfile for proper failover handling. (See the PR description for details.)

Also, unless a correlation can be exhibited between the 502 errors and the concurrent map accesses, I am treating them separately and will confine this issue to the 502 errors; we'll handle the issue with concurrent map accesses in another issue.

The changes will also go out with Caddy 0.9.3, so if you're not able to build from source, just wait for that release (coming any day now). Thank you!

All 35 comments

I'm affected by this too:

https://i.imgur.com/4JJlkzy.png

This doesn't hit the backend at all. It looks like a race condition, to me.

This affects multiple servers, too.

This also seems to only happen over HTTPS, or that's how I've managed to reproduce it. Possibly because HTTP is faster.

This error occurs when the backend returns a non-success status code or some other error happens.

Will look into it later when I have a chance; someone else is welcome to help out!

I added a one-second delay to the view, and Caddy sometimes returns within 40ms with a 502, sometimes waits the full second and returns the response. Maybe something is making it think that the timeout was much less than it was?

@mholt It's not a non-success status code, as the request doesn't hit the backend at all (there's nothing in the log).

Also, this happens to me with both Ruby and Python backends, on two unrelated servers.

It also never happens with cURL, whereas it happens 90% of the time on the same request with Firefox. Combined with the fact that it only happens over TLS, I'd suspect session resumption or HTTP/2.

@mholt my backend server isn't returning an error code. Something else is going on. I would be happy to build and run a custom caddyserver with extra debugging if it would help.

@kmanley That'd be helpful! Thanks. Try printing backendErr here: https://github.com/mholt/caddy/blob/a12230419611951692ee0bb5cd407e6e884be2c2/caddyhttp/proxy/proxy.go#L146

If you want to print it only when it's non-nil, try moving the print statement down after the nil check.

@mholt do you want me to build a the 0.9.1 tag or head?

@mholt in fact would there be any downside to just including that change in head?

Head, please.

Oh, and we looked at writing the actual error by returning it instead of the generic "unreachable" error message. That's a change we could probably make too.

Okay I'm running the custom build that prints backendErr in my prod environment now, will let you know what I find out.

Error was not triggered yesterday but it has happened twice this morning. In both cases backendErr is "net/http: request canceled"
Is that helpful info?

Yes, that's good to know. Thanks! Hmm, so why is the request being canceled...

I guess the next step will be for me to add debug printing to ReverseProxy.ServeHTTP

Haven't had a chance to deploy a new version with more debug printing, but have seen more of the same errors. Some are the same as above: "net/http: request canceled" But now I'm also seeing "net/http: request canceled while waiting for connection" The machine is idle when this happens and the backend is up and working fine.

I'm also running into this. Running caddy in a docker container acting as a front end for a go server. Caddyfile:

www.url.com url.com {
  tls [email protected]   
  gzip
  proxy / localhost:3000 {
    transparent
  }
}

It is not intermittent for me. If you need some debugging output I can help out. Here are the logs.

Activating privacy features... done.
http://www.url.com
http://url.com
https://www.url.com
https://url.com
05/Sep/2016:17:17:29 +0000 [ERROR 502 /] unreachable backend
05/Sep/2016:17:17:39 +0000 [ERROR 502 /] unreachable backend
05/Sep/2016:17:17:39 +0000 [ERROR 502 /favicon.ico] unreachable backend
05/Sep/2016:17:17:40 +0000 [ERROR 502 /] unreachable backend
05/Sep/2016:17:17:40 +0000 [ERROR 502 /favicon.ico] unreachable backend
05/Sep/2016:17:17:41 +0000 [ERROR 502 /] unreachable backend
05/Sep/2016:17:17:42 +0000 [ERROR 502 /favicon.ico] unreachable backend
05/Sep/2016:17:17:42 +0000 [ERROR 502 /] unreachable backend
05/Sep/2016:17:17:42 +0000 [ERROR 502 /favicon.ico] unreachable backend

This is happening for me as well, when I click on link on the page and right away click another link. Chrome cancels the first request and continue with a second one. Nginx(proxy) sends a 499 response, then caddy marks proxy upstream as down and I have to wait 10 seconds to get proper response again. I tried to set max_fails to 0 but requests then took about 3 seconds to respond. I went with fail_timeout 2s for now.

I have Caddy 0.9.1 configured as proxy to an nginx server.

# nginx log
web_1        | 172.17.0.1 - - [06/Sep/2016:18:17:49 +0000] "GET /?page=6 HTTP/1.1" 499 0 "https://site.com/?page=7" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"

@mochja that's a different issue; if your backend returns an error then caddy marks it unreachable, that's working as designed. @jackzampolin if it's not intermittent for you then your backend is likely also returning an error status.

The issue described in here is random (or at least seems to be random) and the backend is _not_ returning an error status.

@kmanley Okay so that part with ngnix is different, but it has to do something with request canceling. @mholt user can easily cancel a request in chrome.

Having similar issue using Caddy(9.1, debian) as reverse proxy for elixir/erlang app.

  1. If I explicitly declare http version, it works fine.
  2. If I use standard https rerouting, it crashes because of lack of Connection header (though it reaches the server in my case)
  3. If I manually add proxy_header Connection upgrade it works fine, but https seems to be off(address contains it, but no lock in address bar).

I've seen issue with websocket preset, but it has been closed and supposedly working.

EDIT: Sorry for trouble, in my case the only problem is lack of Connection header in websocket preset. I should probably move it to appropriate issue.

I figured a workaround, by disabling http2 -http2=false

Just to follow up, I still see this once a day or so. I instrumented ServerHTTP and the error happens in Roundtrip. I'm not convinced all the comments on this thread are the same issue as what I'm seeing. What I see are completely random cancelations when the backend is under no load.

I see the same, and the backend never even gets contacted.

Edit: This already reported as #1045

We have also encountered this problem this morning. In our case it seems caddy decides the backend is down when it returns lots of 404s.

We then set the proxy max_fails property to 0 because we considered this false positives. We then started seeing errors like this:

sep 09 09:38:52 vcs caddy[7902]: fatal error: concurrent map read and map write
sep 09 09:38:52 vcs caddy[7902]: goroutine 15786038 [running]:
sep 09 09:38:52 vcs caddy[7902]: runtime.throw(0xa24c1d, 0x21)
sep 09 09:38:52 vcs caddy[7902]: /usr/local/go/src/runtime/panic.go:566 +0x95 fp=0xc42be28cb8 sp=0xc42be28c98
sep 09 09:38:52 vcs caddy[7902]: runtime.mapaccess2_faststr(0x9d96e0, 0xc44cd5aa20, 0xa16536, 0xa, 0xc42be28dd8, 0x1)
sep 09 09:38:52 vcs caddy[7902]: /usr/local/go/src/runtime/hashmap_fast.go:306 +0x52b fp=0xc42be28d18 sp=0xc42be28cb8
sep 09 09:38:52 vcs caddy[7902]: net/http.(_Request).write(0xc4200cc3c0, 0xca4b80, 0xc437bb7700, 0x0, 0x0, 0x0, 0x0, 0x0)
sep 09 09:38:52 vcs caddy[7902]: /usr/local/go/src/net/http/request.go:506 +0x530 fp=0xc42be28e58 sp=0xc42be28d18
sep 09 09:38:52 vcs caddy[7902]: net/http.(_persistConn).writeLoop(0xc439791600)
sep 09 09:38:52 vcs caddy[7902]: /usr/local/go/src/net/http/transport.go:1644 +0x1ac fp=0xc42be28fa8 sp=0xc42be28e58
sep 09 09:38:52 vcs caddy[7902]: runtime.goexit()
sep 09 09:38:52 vcs caddy[7902]: /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42be28fb0 sp=0xc42be28fa8
sep 09 09:38:52 vcs caddy[7902]: created by net/http.(_Transport).dialConn
sep 09 09:38:52 vcs caddy[7902]: /usr/local/go/src/net/http/transport.go:1058 +0x50e
sep 09 09:38:52 vcs caddy[7902]: goroutine 1 [semacquire, 2 minutes]:
sep 09 09:38:52 vcs caddy[7902]: sync.runtime_Semacquire(0xc4200fe1bc)
sep 09 09:38:52 vcs caddy[7902]: /usr/local/go/src/runtime/sema.go:47 +0x30
sep 09 09:38:52 vcs caddy[7902]: sync.(_WaitGroup).Wait(0xc4200fe1b0)
sep 09 09:38:52 vcs caddy[7902]: /usr/local/go/src/sync/waitgroup.go:131 +0x97
sep 09 09:38:52 vcs caddy[7902]: github.com/mholt/caddy.(*Instance).Wait(0xc42019e5b0)
sep 09 09:38:52 vcs caddy[7902]: /tmp/custombuild_6597_179179303/src/github.com/mholt/caddy/caddy.go:379 +0x2f
sep 09 09:38:52 vcs caddy[7902]: github.com/mholt/caddy/caddy/caddymain.Run()
sep 09 09:38:52 vcs caddy[7902]: /tmp/custombuild_6597_179179303/src/github.com/mholt/caddy/caddy/caddymain/run.go:106 +0x28d
sep 09 09:38:52 vcs caddy[7902]: main.main()
sep 09 09:38:52 vcs caddy[7902]: /tmp/custombuild_6597_179179303/src/github.com/mholt/caddy/caddy/main.go:13 +0x1b
sep 09 09:38:52 vcs caddy[7902]: goroutine 5 [syscall, 2 minutes]:
sep 09 09:38:52 vcs caddy[7902]: os/signal.signal_recv(0x0)
sep 09 09:38:52 vcs caddy[7902]: /usr/local/go/src/runtime/sigqueue.go:116 +0x157
sep 09 09:38:52 vcs caddy[7902]: os/signal.loop()
sep 09 09:38:52 vcs caddy[7902]: /usr/local/go/src/os/signal/signal_unix.go:22 +0x22
sep 09 09:38:52 vcs caddy[7902]: created by os/signal.init.1
sep 09 09:38:52 vcs caddy[7902]: /usr/local/go/src/os/signal/signal_unix.go:28 +0x41
sep 09 09:38:52 vcs caddy[7902]: goroutine 6 [select, 2 minutes]:
sep 09 09:38:52 vcs caddy[7902]: github.com/mholt/caddy/caddytls.maintainAssets(0xc42000a900)
sep 09 09:38:52 vcs caddy[7902]: /tmp/custombuild_6597_179179303/src/github.com/mholt/caddy/caddytls/maintain.go:47 +0x498
sep 09 09:38:52 vcs caddy[7902]: created by github.com/mholt/caddy/caddytls.init.1

(I have so far been unable to find a simple testcase that produces this problem)

After a while it ended up eating 100% cpu and spamming thousands of these errors:


> sep 09 09:47:54 vcs caddy[7902]: goroutine 14702054 [sleep, 9 minutes]:
> sep 09 09:47:54 vcs caddy[7902]: time.Sleep(0x2540be400)
> sep 09 09:47:54 vcs caddy[7902]: /usr/local/go/src/runtime/time.go:59 +0xe1
> sep 09 09:47:54 vcs caddy[7902]: github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP.func1(0xc42021c770, 0x2540be400)
> sep 09 09:47:54 vcs caddy[7902]: /tmp/custombuild_6597_179179303/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:157 +0x2b
> sep 09 09:47:54 vcs caddy[7902]: created by github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP
> sep 09 09:47:54 vcs caddy[7902]: /tmp/custombuild_6597_179179303/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:159 +0x44e
> sep 09 09:47:54 vcs caddy[7902]: goroutine 15021495 [sleep, 9 minutes]:
> sep 09 09:47:54 vcs caddy[7902]: time.Sleep(0x2540be400)
> sep 09 09:47:54 vcs caddy[7902]: /usr/local/go/src/runtime/time.go:59 +0xe1
> sep 09 09:47:54 vcs caddy[7902]: github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP.func1(0xc42021c770, 0x2540be400)
> sep 09 09:47:54 vcs caddy[7902]: /tmp/custombuild_6597_179179303/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:157 +0x2b
> sep 09 09:47:54 vcs caddy[7902]: created by github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP
> sep 09 09:47:54 vcs caddy[7902]: /tmp/custombuild_6597_179179303/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:159 +0x44e
> sep 09 09:47:54 vcs caddy[7902]: goroutine 14621905 [sleep, 9 minutes]:
> sep 09 09:47:54 vcs caddy[7902]: time.Sleep(0x2540be400)
> sep 09 09:47:54 vcs caddy[7902]: /usr/local/go/src/runtime/time.go:59 +0xe1
> sep 09 09:47:54 vcs caddy[7902]: github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP.func1(0xc42021c770, 0x2540be400)
> sep 09 09:47:54 vcs caddy[7902]: /tmp/custombuild_6597_179179303/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:157 +0x2b
> sep 09 09:47:54 vcs caddy[7902]: created by github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP
> sep 09 09:47:54 vcs caddy[7902]: /tmp/custombuild_6597_179179303/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:159 +0x44e
> sep 09 09:47:54 vcs caddy[7902]: goroutine 15094623 [sleep, 9 minutes]:
> sep 09 09:47:54 vcs caddy[7902]: time.Sleep(0x2540be400)
> sep 09 09:47:54 vcs caddy[7902]: /usr/local/go/src/runtime/time.go:59 +0xe1
> sep 09 09:47:54 vcs caddy[7902]: github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP.func1(0xc42021c770, 0x2540be400)
> sep 09 09:47:54 vcs caddy[7902]: /tmp/custombuild_6597_179179303/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:157 +0x2b
> sep 09 09:47:54 vcs caddy[7902]: created by github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP
> sep 09 09:47:54 vcs caddy[7902]: /tmp/custombuild_6597_179179303/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:159 +0x44e
> sep 09 09:47:54 vcs caddy[7902]: goroutine 14763055 [sleep, 9 minutes]:
> sep 09 09:47:54 vcs caddy[7902]: time.Sleep(0x2540be400)
> sep 09 09:47:54 vcs caddy[7902]: /usr/local/go/src/runtime/time.go:59 +0xe1
> sep 09 09:47:54 vcs caddy[7902]: github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP.func1(0xc42021c770, 0x2540be400)
> sep 09 09:47:54 vcs caddy[7902]: /tmp/custombuild_6597_179179303/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:157 +0x2b
> sep 09 09:47:54 vcs caddy[7902]: created by github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP
> sep 09 09:47:54 vcs caddy[7902]: /tmp/custombuild_6597_179179303/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:159 +0x44e
> sep 09 09:47:54 vcs caddy[7902]: goroutine 15109933 [sleep, 9 minutes]:
> sep 09 09:47:54 vcs caddy[7902]: time.Sleep(0x2540be400)
> sep 09 09:47:54 vcs caddy[7902]: /usr/local/go/src/runtime/time.go:59 +0xe1
> sep 09 09:47:54 vcs caddy[7902]: github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP.func1(0xc42021c770, 0x2540be400)
> sep 09 09:47:54 vcs caddy[7902]: /tmp/custombuild_6597_179179303/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:157 +0x2b
> sep 09 09:47:54 vcs caddy[7902]: created by github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP
> sep 09 09:47:54 vcs caddy[7902]: /tmp/custombuild_6597_179179303/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:159 +0x44e
> sep 09 09:47:54 vcs caddy[7902]: goroutine 15322344 [sleep, 9 minutes]:

I realise this isn't exacly a stellar report and I'll see if i can get some better info.

Aside from any bugs there may be, I think it should be possible to disable the health checking. When there is only one backend it should probably be disabled by default.

@martijnve Your issue may be a duplicate of #1045.

That seems to be true, sorry, I've updated my post.

@kmanley Does this still happen if you disable HTTP/2? (-http2=false)

@mholt at your suggestion I disabled http/2 in my prod environment, will monitor for several days and report back. Thanks for the idea.

I tried v0.9.1 with -http2=false and I'm getting less 502 errors, I tried proxy flag max_fails 0 and is working as expected, my backends are docker v1.12 swarm mode and I don't need to specify another backend at Caddyfile, docker swarm mode is load balancing internally.

@mholt I am running in production with -http2=false; I did get one spurious 502 today but the underlying error was different. It came from ServeHTTP and the error was EOF.

0.9.2 did not fix this issue, but -http2=false still helps

Just wanted to stop by and mention that we appear to be getting these occasional 502's for a few internal services we run (but not all of them) including a python index (pypicloud) and Sentry.

Just upgraded to 0.9.2 so not sure if still happening, but above comment seems to answer that ;(

I just merged #1135 which drastically changes the logic regarding some of these parameters, and improves error reporting. I'm going to close this issue now in light of that, and would ask that all who have experienced this issue to pull the latest Caddy and try again -- note the changes in that PR may require a few minor changes to your Caddyfile for proper failover handling. (See the PR description for details.)

Also, unless a correlation can be exhibited between the 502 errors and the concurrent map accesses, I am treating them separately and will confine this issue to the 502 errors; we'll handle the issue with concurrent map accesses in another issue.

The changes will also go out with Caddy 0.9.3, so if you're not able to build from source, just wait for that release (coming any day now). Thank you!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kilpatty picture kilpatty  路  3Comments

PhilmacFLy picture PhilmacFLy  路  3Comments

lorddaedra picture lorddaedra  路  3Comments

mikolysz picture mikolysz  路  3Comments

jgsqware picture jgsqware  路  3Comments