caddy -version)?v0.8.3 / v0.9.0
OS Windows Server 2012 R2 running on Azure instance 'Standard D3 v2' (4 cores, 14 GB memory)
Measure performance of caddy reverse proxy middleware under heavy load using Apache Bench.
caddyfile_proxy:
http:// {
errors CaddyErrors.log
header / {
-Server
}
proxy / localhost:580 {
proxy_header Host {host}
proxy_header X-Real-IP {remote}
proxy_header X-Forwarded-Proto {scheme}
}
}
caddyfile_upstream:
http://:580 {
errors CaddyErrors_upstream.log
header / {
-Server
}
root WebRoot
}
The WebRoot folder contains file index.html:
<!DOCTYPE html><html><head><meta charset="utf-8"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="width=device-width,initial-scale=1"><title>1234567 12345</title></head><body></body></html>
caddy.exe -conf=Caddyfile_proxy
and then in new shell:
caddy.exe -conf=Caddyfile_upstream
Apache Bench completing with 0 failed requests.
Apache Bench with the following result (note the failed requests):
This is ApacheBench, Version 2.3 <$Revision: 1748469 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking *** (be patient)
Server Software:
Server Hostname: ***
Server Port: 80
Document Path: /
Document Length: 224 bytes
Concurrency Level: 10000
Time taken for tests: 104.036 seconds
Complete requests: 1000000
Failed requests: 953057
(Connect: 0, Receive: 0, Length: 953057, Exceptions: 0)
Non-2xx responses: 953057
Keep-Alive requests: 1000000
Total transferred: 179237926 bytes
HTML transferred: 25764144 bytes
Requests per second: 9612.05 [#/sec] (mean)
Time per request: 1040.361 [ms] (mean)
Time per request: 0.104 [ms] (mean, across all concurrent requests)
Transfer rate: 1682.46 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.3 0 20
Processing: 62 445 1122.4 219 65240
Waiting: 62 445 1122.4 219 65240
Total: 62 445 1122.4 219 65240
Percentage of the requests served within a certain time (ms)
50% 219
66% 313
75% 359
80% 422
90% 656
95% 1391
98% 2416
99% 3957
100% 65240 (longest request)
CaddyErrors.log:
19/Jul/2016:07:33:31 +0000 [ERROR 502 /] unreachable backend
19/Jul/2016:07:33:31 +0000 [ERROR 502 /] unreachable backend
19/Jul/2016:07:33:31 +0000 [ERROR 502 /] unreachable backend
...
Run Apache Bench on another machine with the following parameters (note 10k concurrent connections):
ab.exe -n 1000000 -c 10000 -k http://remote-machine-running-caddy/
Hypothesis
The failed requests are most likely caused by the exhaustion of ephemeral TCP ports by the reverse proxy middleware. When querying the remote machine I get:
PS C:\Users\localadmin> Get-NetTCPConnection
LocalAddress LocalPort RemoteAddress RemotePort State
------------ --------- ------------- ---------- -----
127.0.0.1 64381 127.0.0.1 580 TimeWait
127.0.0.1 64380 127.0.0.1 580 TimeWait
127.0.0.1 64379 127.0.0.1 580 TimeWait
127.0.0.1 64378 127.0.0.1 580 TimeWait
127.0.0.1 64377 127.0.0.1 580 TimeWait
127.0.0.1 64376 127.0.0.1 580 TimeWait
127.0.0.1 64375 127.0.0.1 580 TimeWait
127.0.0.1 64374 127.0.0.1 580 TimeWait
127.0.0.1 64373 127.0.0.1 580 TimeWait
127.0.0.1 64372 127.0.0.1 580 TimeWait
127.0.0.1 64371 127.0.0.1 580 TimeWait
127.0.0.1 64370 127.0.0.1 580 TimeWait
...
(followed by thousands of entries)
PS C:\Users\localadmin> Get-NetTCPConnection -RemotePort 580 -LocalAddress '::1' | measure
Count : 0
Average :
Sum :
Maximum :
Minimum :
Property :
PS C:\Users\localadmin> Get-NetTCPConnection -RemotePort 580 -LocalAddress '127.0.0.1' | measure
Count : 15629
Average :
Sum :
Maximum :
Minimum :
Property :
PS C:\Users\localadmin> netsh int ipv4 show dynamicportrange tcp
Protocol tcp Dynamic Port Range
---------------------------------
Start Port : 49152
Number of Ports : 16384
Once the exhaustion occurs, the proxy starts failing by returning 502 Bad gateway errors and recovers only after the ports are recycled.
The behaviour is worse on Caddy v0.9.0 as the ports seem to be exhausted faster. Also after benchmark of v0.9.0, I get a large number of IPv6 connections as well:
PS C:\Users\localadmin> Get-NetTCPConnection -RemotePort 580 -LocalAddress '::1' | measure
Count : 11697
Average :
Sum :
Maximum :
Minimum :
Property :
PS C:\Users\localadmin> Get-NetTCPConnection -RemotePort 580 -LocalAddress '127.0.0.1' | measure
Count : 8286
Average :
Sum :
Maximum :
Minimum :
Property :
I can reproduce the issue with Gatling as well. To rule off issues with the caddy_upstream instance, I ran the benchmark against the caddy_upstream instance directly and received no failing requests.
I know it is possible to reconfigure OS to use more ephemeral ports and shorten the TIME_WAIT period. The condition could also be mitigated by having multiple upstream hosts but once the condition occurs, the proxy is basically dead until the ports are recycled.
Does caddy implement TCP connection pooling or is a new connection opened for every upstream request? Maybe there should be some connection limit to prevent the condition from happening.
When a proxy upstream has an error with one request, that upstream is marked as down for a certain period of time (default is 10s) to give it a chance to recover without suffering performance hits.
We're aware this could be improved; perhaps @nemothekid would have some ideas. Probably related to #667.
I'm not sure why 0.9 would be performing much differently here though.
In this case, since the ports are exhausted, every upstream will technically be marked as down, (which is causes the 503). The actual reverse proxy is handled by the golang net/http code and that should support connection pooling (via keep-alive, see the DisableKeepAlive value on Transport). I don't immediately see why this wouldn't work - it will probably take a deeper dive in the pooling code in net/http
@mholt
When a proxy upstream has an error with one request, that upstream is marked as down for a certain period of time (default is 10s) to give it a chance to recover without suffering performance hits.
Thank you for the information, I re-tested using _boom_ (1M reqs, 20k users) with proxy options fail_timeout 0 and max_fails 0 and the behavior changed. Now there are no errors in the log and all requests seem to succeed :+1: (albeit after a while). This seems to be a partial fix as an attempt to contact the upstream is made every time.
However even with this fix, the proxy seems to be making only a small number of requests to the upstream caddy as most of the connections lingers in TIME_WAIT state. So it still seems there is an unbounded connection pool where a new upstream connection is opened for every incoming request :-(
Seeing the same in CoreDNS. I'm going to add some socket pooling for this.
On 20 Jul 2016 9:55 a.m., "Tomáš Deml" [email protected] wrote:
@mholt https://github.com/mholt
When a proxy upstream has an error with one request, that upstream is
marked as down for a certain period of time (default is 10s) to give it a
chance to recover without suffering performance hits.Thank you for the information, I re-tested using _boom_ with proxy
options fail_timeout 0 and max_fails 0 and the behavior changed. Now
there are no errors in the log and all requests seem to succeed 👍
(albeit after a while). This seems to be a partial fix as an attempt to
contact the upstream is made every time.However even with this fix, the proxy seems to be making only a small
number of requests to the upstream caddy as most of the connections lingers
in TIME_WAIT state. So it still seems there is an unbounded connection
pool where a new upstream connection is opened for every incoming request
:-(—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/mholt/caddy/issues/938#issuecomment-233891704, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAVkW5d-kmdECtmgJ3vGTbxfqFAVWL1Lks5qXeKPgaJpZM4JPhdl
.
@mholt, @nemothekid Could you reproduce the issue? I would be happy to provide more information if you need some.
Also may I ask if you plan to address the issue in near future? IMO with default OS configuration, the proxy may not be very usable under production workloads.
Sorry, been busy with a few things -- I will get around to this soon! Anyone is welcome to take a look at this in the meantime.
I may be seeing a related issue when trying to reverse proxy a rethinkdb admin ui. I created a branch in my repo where one can reproduce the issue using docker containers here:
https://github.com/ubergarm/stack-in-a-box/tree/proxy-issue
$ docker --version
Docker version 1.11.2, build b9f10c9
$ docker-compose --version
docker-compose version 1.8.0-rc1, build 9bf6bc6
$ docker run --rm -it ubergarm/caddy --version
Caddy 0.9.0
$ cat /proc/version
Linux version 3.10-2-amd64 ([email protected]) (gcc version 4.7.3 (Debian 4.7.3-6) ) #1 SMP Debian 3.10.7-1 (2013-08-17)
# also confirmed the problem exists on new Ubuntu 16.04 desktop
Reverse proxy rethinkdb Admin interface.
http://localhost {
redir https://localhost{uri}
}
https://localhost {
root /html
proxy /admin rdb:8080 {
without /admin
}
log stdout
errors stderr
tls self_signed
}
From the Dockerfile.caddy you can see it is calling caddy_linux_amd64 -agree. The rethinkdb is linked to the Caddy container so it can find it at hostname rdb.
I expected to see all 200 responses.
Sometimes I would get a 502 error instead, but this never happens when hitting
the service directly instead of through the reverse proxy.
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:30:49 +0000] "POST /admin/ajax/reql/?conn_id=G+r4LTLCk3MAw0arPi4GAQ== HTTP/2.0" 200 86
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:30:50 +0000] "POST /admin/ajax/reql/?conn_id=G+r4LTLCk3MAw0arPi4GAQ== HTTP/2.0" 200 86
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:30:52 +0000] "POST /admin/ajax/reql/?conn_id=G+r4LTLCk3MAw0arPi4GAQ== HTTP/2.0" 200 86
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:30:52 +0000] "POST /admin/ajax/reql/?conn_id=BhMc83y3tIBZ3hAEbwakVg== HTTP/2.0" 200 308
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:30:52 +0000] "POST /admin/ajax/reql/?conn_id=fOvF/MDLQxOqZ2CzXClMUA== HTTP/2.0" 200 451
caddy_1 | 30/Jul/2016:17:30:52 +0000 [ERROR 502 /ajax/reql/] unreachable backend
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:30:52 +0000] "POST /admin/ajax/reql/?conn_id=ZLD9NCq4/ytwEolR8YHn4Q== HTTP/2.0" 502 16
caddy_1 | 30/Jul/2016:17:30:53 +0000 [ERROR 502 /admin/ajax/reql/] unreachable backend
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:30:53 +0000] "POST /admin/ajax/reql/?conn_id=G+r4LTLCk3MAw0arPi4GAQ== HTTP/2.0" 502 16
caddy_1 | 30/Jul/2016:17:30:57 +0000 [ERROR 502 /admin/ajax/reql/] unreachable backend
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:30:57 +0000] "POST /admin/ajax/reql/?conn_id=BhMc83y3tIBZ3hAEbwakVg== HTTP/2.0" 502 16
caddy_1 | 30/Jul/2016:17:30:57 +0000 [ERROR 502 /admin/ajax/reql/] unreachable backend
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:30:57 +0000] "POST /admin/ajax/reql/?conn_id=fOvF/MDLQxOqZ2CzXClMUA== HTTP/2.0" 502 16
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:31:20 +0000] "POST /admin/ajax/reql/close-connection?conn_id=fOvF/MDLQxOqZ2CzXClMUA== HTTP/2.0" 200 0
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:31:20 +0000] "POST /admin/ajax/reql/close-connection?conn_id=G+r4LTLCk3MAw0arPi4GAQ== HTTP/2.0" 200 0
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:31:21 +0000] "POST /admin/ajax/reql/open-new-connection?cacheBuster=0.8526856273987615 HTTP/2.0" 200 24
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:31:21 +0000] "POST /admin/ajax/reql/?conn_id=JR35LdPBV83pMJiV+ulBew== HTTP/2.0" 200 136
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:31:21 +0000] "POST /admin/ajax/reql/close-connection?conn_id=JR35LdPBV83pMJiV+ulBew== HTTP/2.0" 200 0
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:31:21 +0000] "POST /admin/ajax/reql/open-new-connection?cacheBuster=0.06600066906892221 HTTP/2.0" 200 24
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:31:21 +0000] "POST /admin/ajax/reql/open-new-connection?cacheBuster=0.3583739526723704 HTTP/2.0" 200 24
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:31:21 +0000] "POST /admin/ajax/reql/open-new-connection?cacheBuster=0.9304607503322764 HTTP/2.0" 200 24
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:31:22 +0000] "POST /admin/ajax/reql/?conn_id=JmLkP3jf4Xfc7MEr6Lyyyw== HTTP/2.0" 200 308
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:31:22 +0000] "POST /admin/ajax/reql/?conn_id=WQdrwkyUaT+BZB1z3lKxbg== HTTP/2.0" 200 86
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:31:22 +0000] "POST /admin/ajax/reql/?conn_id=HisapNMBfQhcmne46hqqPg== HTTP/2.0" 200 450
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:31:23 +0000] "POST /admin/ajax/reql/?conn_id=WQdrwkyUaT+BZB1z3lKxbg== HTTP/2.0" 200 86
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:31:24 +0000] "POST /admin/ajax/reql/?conn_id=WQdrwkyUaT+BZB1z3lKxbg== HTTP/2.0" 200 86
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:31:25 +0000] "POST /admin/ajax/reql/?conn_id=WQdrwkyUaT+BZB1z3lKxbg== HTTP/2.0" 200 86
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:31:26 +0000] "POST /admin/ajax/reql/?conn_id=WQdrwkyUaT+BZB1z3lKxbg== HTTP/2.0" 200 85
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:31:27 +0000] "POST /admin/ajax/reql/?conn_id=JmLkP3jf4Xfc7MEr6Lyyyw== HTTP/2.0" 200 308
caddy_1 | 172.18.0.1 - [30/Jul/2016:17:31:27 +0000] "POST /admin/ajax/reql/?conn_id=HisapNMBfQhcmne46hqqPg== HTTP/2.0" 200 451
git clone https://github.com/ubergarm/stack-in-a-box
cd stack-in-a-box
git checkout proxy-issue
docker-compose build
docker-compose up
Keep an eye on TCP port status (see how TIME WAIT (tw) increases):
watch -n.5 -d cat /proc/net/sockstat
Open a browser to https://localhost/admin.
Each reload of the main page adds just under 100 to tw.
Reload the main page and click around on the UI or just wait and it
seems that as the number of TCP tw sockets grows it will eventually miss
a request. Then after the default 10 seconds timeout the OS might have
recycled some sockets and it will start working again.
Thanks @tomasdeml , setting fail_timeout 0 and max_fails 0 alleviates the symptoms for now.
So I was able to actually "fix" the issue by setting MaxIdleConnsPerHost on the transport to something relatively high (I set it to 8192, the default is 2).
diff --git a/caddyhttp/proxy/reverseproxy.go b/caddyhttp/proxy/reverseproxy.go
index e6f759d..2ef45cb 100644
--- a/caddyhttp/proxy/reverseproxy.go
+++ b/caddyhttp/proxy/reverseproxy.go
@@ -167,7 +167,7 @@ func (p *ReverseProxy) ServeHTTP(rw http.ResponseWriter, outreq *http.Request, r
if transport == nil {
transport = http.DefaultTransport
}
-
+ transport.(*http.Transport).MaxIdleConnsPerHost = 8192
p.Director(outreq)
outreq.Proto = "HTTP/1.1"
outreq.ProtoMajor = 1
b@cass3:~$ ab -n 1000000 -c 10000 -k http://10.128.0.5:4040/
This is ApacheBench, Version 2.3 <$Revision: 1528965 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 10.128.0.5 (be patient)
Completed 100000 requests
Completed 200000 requests
Completed 300000 requests
Completed 400000 requests
Completed 500000 requests
Completed 600000 requests
Completed 700000 requests
Completed 800000 requests
Completed 900000 requests
Completed 1000000 requests
Finished 1000000 requests
Server Software:
Server Hostname: 10.128.0.5
Server Port: 4040
Document Path: /
Document Length: 226 bytes
Concurrency Level: 10000
Time taken for tests: 75.011 seconds
Complete requests: 1000000
Failed requests: 0
Keep-Alive requests: 1000000
Total transferred: 458000000 bytes
HTML transferred: 226000000 bytes
Requests per second: 13331.30 [#/sec] (mean)
Time per request: 750.115 [ms] (mean)
Time per request: 0.075 [ms] (mean, across all concurrent requests)
Transfer rate: 5962.63 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 3 32.5 0 439
Processing: 7 743 300.9 687 3477
Waiting: 7 733 286.8 684 3376
Total: 7 746 302.4 688 3477
Percentage of the requests served within a certain time (ms)
50% 688
66% 797
75% 884
80% 945
90% 1134
95% 1317
98% 1538
99% 1775
100% 3477 (longest request)
Like I suspected, the "problematic" code isn't with Caddy, but with golang's http pooling implementation (in transport.go). When there are large number of incoming connections, the number of pooled idle connections essentially doesn't matter and a new connection is opened (and maybe closed?) for every request.
The pooling functionality has changed a bit in tip, however I'm not sure if the new code fixes this issue.
Possible suggestion for moving forward.
ab to work - I'm not sure what the optimal number should be).I recommend we go with option 1, despite any other changes as its probably a tuning parameter we need to expose.
w.r.t option 2, I'm not 100% if the current pooling code is performant. When I tried with 2048 MaxIdleConnsPerHost I still got errors, and I'm not sure if a server like nginx would be able to handle the same benchmark with the same amount of idle connections. I'd also need more info on how nginx manages idle connections to know for sure (for example, does nginx temporarily increase the pool, and only close after some timeout?)
https://engineering.gosquared.com/optimising-nginx-node-js-and-networking-for-heavy-workloads (see nginx section)
If you are hitting this error, consider increasing MaxIdleConnsPerHost option (if I end up writing this PR, it will likely be called keepalivelike nginx). I'm not sure what the optimal values are, but if you have 10,000 concurrents, the default pool size (2) is likely not enough.
Setting fail_timeout 0 and max_fails 0 only masks the issue, it simply keeps retrying to contact the upstream server until the OS finally clears some of the sockets in TIME_WAIT, at which point it can communicate again.
Most helpful comment
So I was able to actually "fix" the issue by setting
MaxIdleConnsPerHoston the transport to something relatively high (I set it to 8192, the default is 2).Like I suspected, the "problematic" code isn't with Caddy, but with golang's http pooling implementation (in transport.go). When there are large number of incoming connections, the number of pooled idle connections essentially doesn't matter and a new connection is opened (and maybe closed?) for every request.
The pooling functionality has changed a bit in
tip, however I'm not sure if the new code fixes this issue.Possible suggestion for moving forward.
abto work - I'm not sure what the optimal number should be).