Caddy: Exhaustion of dynamic TCP ports by reverse proxy middleware

Created on 19 Jul 2016  ·  10Comments  ·  Source: caddyserver/caddy

1. What version of Caddy are you running (caddy -version)?

v0.8.3 / v0.9.0
OS Windows Server 2012 R2 running on Azure instance 'Standard D3 v2' (4 cores, 14 GB memory)

2. What are you trying to do?

Measure performance of caddy reverse proxy middleware under heavy load using Apache Bench.

3. What is your entire Caddyfile?

caddyfile_proxy:

http:// {
    errors CaddyErrors.log
    header / {
        -Server
    }
    proxy / localhost:580 {
        proxy_header Host {host}
        proxy_header X-Real-IP {remote}
        proxy_header X-Forwarded-Proto {scheme}
    }
}

caddyfile_upstream:

http://:580 {
    errors CaddyErrors_upstream.log
    header / {
        -Server
    }
    root WebRoot
}

The WebRoot folder contains file index.html:

<!DOCTYPE html><html><head><meta charset="utf-8"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="width=device-width,initial-scale=1"><title>1234567 12345</title></head><body></body></html>

4. How did you run Caddy (give the full command and describe the execution environment)?

caddy.exe -conf=Caddyfile_proxy

and then in new shell:

caddy.exe -conf=Caddyfile_upstream

5. What did you expect to see?

Apache Bench completing with 0 failed requests.

6. What did you see instead (give full error messages and/or log)?

Apache Bench with the following result (note the failed requests):

This is ApacheBench, Version 2.3 <$Revision: 1748469 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking *** (be patient)

Server Software:        
Server Hostname:        ***
Server Port:            80

Document Path:          /
Document Length:        224 bytes

Concurrency Level:      10000
Time taken for tests:   104.036 seconds
Complete requests:      1000000
Failed requests:        953057
   (Connect: 0, Receive: 0, Length: 953057, Exceptions: 0)
Non-2xx responses:      953057
Keep-Alive requests:    1000000
Total transferred:      179237926 bytes
HTML transferred:       25764144 bytes
Requests per second:    9612.05 [#/sec] (mean)
Time per request:       1040.361 [ms] (mean)
Time per request:       0.104 [ms] (mean, across all concurrent requests)
Transfer rate:          1682.46 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.3      0      20
Processing:    62  445 1122.4    219   65240
Waiting:       62  445 1122.4    219   65240
Total:         62  445 1122.4    219   65240

Percentage of the requests served within a certain time (ms)
  50%    219
  66%    313
  75%    359
  80%    422
  90%    656
  95%   1391
  98%   2416
  99%   3957
 100%  65240 (longest request)

CaddyErrors.log:

19/Jul/2016:07:33:31 +0000 [ERROR 502 /] unreachable backend
19/Jul/2016:07:33:31 +0000 [ERROR 502 /] unreachable backend
19/Jul/2016:07:33:31 +0000 [ERROR 502 /] unreachable backend
...

7. How can someone who is starting from scratch reproduce this behavior as minimally as possible?

Run Apache Bench on another machine with the following parameters (note 10k concurrent connections):

ab.exe -n 1000000 -c 10000 -k http://remote-machine-running-caddy/

Hypothesis
The failed requests are most likely caused by the exhaustion of ephemeral TCP ports by the reverse proxy middleware. When querying the remote machine I get:

PS C:\Users\localadmin> Get-NetTCPConnection

LocalAddress                        LocalPort RemoteAddress                       RemotePort State
------------                        --------- -------------                       ---------- -----
127.0.0.1                           64381     127.0.0.1                           580        TimeWait
127.0.0.1                           64380     127.0.0.1                           580        TimeWait
127.0.0.1                           64379     127.0.0.1                           580        TimeWait
127.0.0.1                           64378     127.0.0.1                           580        TimeWait
127.0.0.1                           64377     127.0.0.1                           580        TimeWait
127.0.0.1                           64376     127.0.0.1                           580        TimeWait
127.0.0.1                           64375     127.0.0.1                           580        TimeWait
127.0.0.1                           64374     127.0.0.1                           580        TimeWait
127.0.0.1                           64373     127.0.0.1                           580        TimeWait
127.0.0.1                           64372     127.0.0.1                           580        TimeWait
127.0.0.1                           64371     127.0.0.1                           580        TimeWait
127.0.0.1                           64370     127.0.0.1                           580        TimeWait
...
(followed by thousands of entries)

PS C:\Users\localadmin> Get-NetTCPConnection -RemotePort 580 -LocalAddress '::1' | measure
Count    : 0
Average  :
Sum      :
Maximum  :
Minimum  :
Property :
PS C:\Users\localadmin> Get-NetTCPConnection -RemotePort 580 -LocalAddress '127.0.0.1' | measure
Count    : 15629
Average  :
Sum      :
Maximum  :
Minimum  :
Property :
PS C:\Users\localadmin> netsh int ipv4 show dynamicportrange tcp
Protocol tcp Dynamic Port Range
---------------------------------
Start Port      : 49152
Number of Ports : 16384

Once the exhaustion occurs, the proxy starts failing by returning 502 Bad gateway errors and recovers only after the ports are recycled.

The behaviour is worse on Caddy v0.9.0 as the ports seem to be exhausted faster. Also after benchmark of v0.9.0, I get a large number of IPv6 connections as well:

PS C:\Users\localadmin> Get-NetTCPConnection -RemotePort 580 -LocalAddress '::1' | measure
Count    : 11697
Average  :
Sum      :
Maximum  :
Minimum  :
Property :
PS C:\Users\localadmin> Get-NetTCPConnection -RemotePort 580 -LocalAddress '127.0.0.1' | measure
Count    : 8286
Average  :
Sum      :
Maximum  :
Minimum  :
Property :

I can reproduce the issue with Gatling as well. To rule off issues with the caddy_upstream instance, I ran the benchmark against the caddy_upstream instance directly and received no failing requests.

I know it is possible to reconfigure OS to use more ephemeral ports and shorten the TIME_WAIT period. The condition could also be mitigated by having multiple upstream hosts but once the condition occurs, the proxy is basically dead until the ports are recycled.

Does caddy implement TCP connection pooling or is a new connection opened for every upstream request? Maybe there should be some connection limit to prevent the condition from happening.

Most helpful comment

So I was able to actually "fix" the issue by setting MaxIdleConnsPerHost on the transport to something relatively high (I set it to 8192, the default is 2).

diff --git a/caddyhttp/proxy/reverseproxy.go b/caddyhttp/proxy/reverseproxy.go
index e6f759d..2ef45cb 100644
--- a/caddyhttp/proxy/reverseproxy.go
+++ b/caddyhttp/proxy/reverseproxy.go
@@ -167,7 +167,7 @@ func (p *ReverseProxy) ServeHTTP(rw http.ResponseWriter, outreq *http.Request, r
        if transport == nil {
                transport = http.DefaultTransport
        }
-
+       transport.(*http.Transport).MaxIdleConnsPerHost = 8192
        p.Director(outreq)
        outreq.Proto = "HTTP/1.1"
        outreq.ProtoMajor = 1
b@cass3:~$ ab -n 1000000 -c 10000 -k http://10.128.0.5:4040/
This is ApacheBench, Version 2.3 <$Revision: 1528965 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 10.128.0.5 (be patient)
Completed 100000 requests
Completed 200000 requests
Completed 300000 requests
Completed 400000 requests
Completed 500000 requests
Completed 600000 requests
Completed 700000 requests
Completed 800000 requests
Completed 900000 requests
Completed 1000000 requests
Finished 1000000 requests


Server Software:        
Server Hostname:        10.128.0.5
Server Port:            4040

Document Path:          /
Document Length:        226 bytes

Concurrency Level:      10000
Time taken for tests:   75.011 seconds
Complete requests:      1000000
Failed requests:        0
Keep-Alive requests:    1000000
Total transferred:      458000000 bytes
HTML transferred:       226000000 bytes
Requests per second:    13331.30 [#/sec] (mean)
Time per request:       750.115 [ms] (mean)
Time per request:       0.075 [ms] (mean, across all concurrent requests)
Transfer rate:          5962.63 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    3  32.5      0     439
Processing:     7  743 300.9    687    3477
Waiting:        7  733 286.8    684    3376
Total:          7  746 302.4    688    3477

Percentage of the requests served within a certain time (ms)
  50%    688
  66%    797
  75%    884
  80%    945
  90%   1134
  95%   1317
  98%   1538
  99%   1775
 100%   3477 (longest request)

Like I suspected, the "problematic" code isn't with Caddy, but with golang's http pooling implementation (in transport.go). When there are large number of incoming connections, the number of pooled idle connections essentially doesn't matter and a new connection is opened (and maybe closed?) for every request.

The pooling functionality has changed a bit in tip, however I'm not sure if the new code fixes this issue.

Possible suggestion for moving forward.

  1. Expose a keepalive directive (like nginx) that allows you to specify the number of maximum idle connections you want to keep around (although I had to set it pretty high for ab to work - I'm not sure what the optimal number should be).
  2. See if there are any optimization you can make in golang's connection pooling to alleviate this issue.

All 10 comments

When a proxy upstream has an error with one request, that upstream is marked as down for a certain period of time (default is 10s) to give it a chance to recover without suffering performance hits.

We're aware this could be improved; perhaps @nemothekid would have some ideas. Probably related to #667.

I'm not sure why 0.9 would be performing much differently here though.

In this case, since the ports are exhausted, every upstream will technically be marked as down, (which is causes the 503). The actual reverse proxy is handled by the golang net/http code and that should support connection pooling (via keep-alive, see the DisableKeepAlive value on Transport). I don't immediately see why this wouldn't work - it will probably take a deeper dive in the pooling code in net/http

@mholt

When a proxy upstream has an error with one request, that upstream is marked as down for a certain period of time (default is 10s) to give it a chance to recover without suffering performance hits.

Thank you for the information, I re-tested using _boom_ (1M reqs, 20k users) with proxy options fail_timeout 0 and max_fails 0 and the behavior changed. Now there are no errors in the log and all requests seem to succeed :+1: (albeit after a while). This seems to be a partial fix as an attempt to contact the upstream is made every time.

However even with this fix, the proxy seems to be making only a small number of requests to the upstream caddy as most of the connections lingers in TIME_WAIT state. So it still seems there is an unbounded connection pool where a new upstream connection is opened for every incoming request :-(

Seeing the same in CoreDNS. I'm going to add some socket pooling for this.

On 20 Jul 2016 9:55 a.m., "Tomáš Deml" [email protected] wrote:

@mholt https://github.com/mholt

When a proxy upstream has an error with one request, that upstream is
marked as down for a certain period of time (default is 10s) to give it a
chance to recover without suffering performance hits.

Thank you for the information, I re-tested using _boom_ with proxy
options fail_timeout 0 and max_fails 0 and the behavior changed. Now
there are no errors in the log and all requests seem to succeed 👍
(albeit after a while). This seems to be a partial fix as an attempt to
contact the upstream is made every time.

However even with this fix, the proxy seems to be making only a small
number of requests to the upstream caddy as most of the connections lingers
in TIME_WAIT state. So it still seems there is an unbounded connection
pool where a new upstream connection is opened for every incoming request
:-(


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/mholt/caddy/issues/938#issuecomment-233891704, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAVkW5d-kmdECtmgJ3vGTbxfqFAVWL1Lks5qXeKPgaJpZM4JPhdl
.

@mholt, @nemothekid Could you reproduce the issue? I would be happy to provide more information if you need some.

Also may I ask if you plan to address the issue in near future? IMO with default OS configuration, the proxy may not be very usable under production workloads.

Sorry, been busy with a few things -- I will get around to this soon! Anyone is welcome to take a look at this in the meantime.

I may be seeing a related issue when trying to reverse proxy a rethinkdb admin ui. I created a branch in my repo where one can reproduce the issue using docker containers here:

https://github.com/ubergarm/stack-in-a-box/tree/proxy-issue

1. What version of Caddy are you running (caddy -version)?

$ docker --version
Docker version 1.11.2, build b9f10c9
$ docker-compose --version
docker-compose version 1.8.0-rc1, build 9bf6bc6
$ docker run --rm -it ubergarm/caddy --version
Caddy 0.9.0
$ cat /proc/version
Linux version 3.10-2-amd64 ([email protected]) (gcc version 4.7.3 (Debian 4.7.3-6) ) #1 SMP Debian 3.10.7-1 (2013-08-17)
# also confirmed the problem exists on new Ubuntu 16.04 desktop

2. What are you trying to do?

Reverse proxy rethinkdb Admin interface.

3. What is your entire Caddyfile?

http://localhost {
    redir https://localhost{uri}
}

https://localhost {
    root /html

    proxy /admin rdb:8080 {
        without /admin
    }

    log stdout
    errors stderr

    tls self_signed
}

4. How did you run Caddy (give the full command and describe the execution environment)?

From the Dockerfile.caddy you can see it is calling caddy_linux_amd64 -agree. The rethinkdb is linked to the Caddy container so it can find it at hostname rdb.

5. What did you expect to see?

I expected to see all 200 responses.

6. What did you see instead (give full error messages and/or log)?

Sometimes I would get a 502 error instead, but this never happens when hitting
the service directly instead of through the reverse proxy.

caddy_1  | 172.18.0.1 - [30/Jul/2016:17:30:49 +0000] "POST /admin/ajax/reql/?conn_id=G+r4LTLCk3MAw0arPi4GAQ== HTTP/2.0" 200 86
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:30:50 +0000] "POST /admin/ajax/reql/?conn_id=G+r4LTLCk3MAw0arPi4GAQ== HTTP/2.0" 200 86
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:30:52 +0000] "POST /admin/ajax/reql/?conn_id=G+r4LTLCk3MAw0arPi4GAQ== HTTP/2.0" 200 86
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:30:52 +0000] "POST /admin/ajax/reql/?conn_id=BhMc83y3tIBZ3hAEbwakVg== HTTP/2.0" 200 308
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:30:52 +0000] "POST /admin/ajax/reql/?conn_id=fOvF/MDLQxOqZ2CzXClMUA== HTTP/2.0" 200 451
caddy_1  | 30/Jul/2016:17:30:52 +0000 [ERROR 502 /ajax/reql/] unreachable backend
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:30:52 +0000] "POST /admin/ajax/reql/?conn_id=ZLD9NCq4/ytwEolR8YHn4Q== HTTP/2.0" 502 16
caddy_1  | 30/Jul/2016:17:30:53 +0000 [ERROR 502 /admin/ajax/reql/] unreachable backend
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:30:53 +0000] "POST /admin/ajax/reql/?conn_id=G+r4LTLCk3MAw0arPi4GAQ== HTTP/2.0" 502 16
caddy_1  | 30/Jul/2016:17:30:57 +0000 [ERROR 502 /admin/ajax/reql/] unreachable backend
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:30:57 +0000] "POST /admin/ajax/reql/?conn_id=BhMc83y3tIBZ3hAEbwakVg== HTTP/2.0" 502 16
caddy_1  | 30/Jul/2016:17:30:57 +0000 [ERROR 502 /admin/ajax/reql/] unreachable backend
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:30:57 +0000] "POST /admin/ajax/reql/?conn_id=fOvF/MDLQxOqZ2CzXClMUA== HTTP/2.0" 502 16
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:31:20 +0000] "POST /admin/ajax/reql/close-connection?conn_id=fOvF/MDLQxOqZ2CzXClMUA== HTTP/2.0" 200 0
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:31:20 +0000] "POST /admin/ajax/reql/close-connection?conn_id=G+r4LTLCk3MAw0arPi4GAQ== HTTP/2.0" 200 0
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:31:21 +0000] "POST /admin/ajax/reql/open-new-connection?cacheBuster=0.8526856273987615 HTTP/2.0" 200 24
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:31:21 +0000] "POST /admin/ajax/reql/?conn_id=JR35LdPBV83pMJiV+ulBew== HTTP/2.0" 200 136
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:31:21 +0000] "POST /admin/ajax/reql/close-connection?conn_id=JR35LdPBV83pMJiV+ulBew== HTTP/2.0" 200 0
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:31:21 +0000] "POST /admin/ajax/reql/open-new-connection?cacheBuster=0.06600066906892221 HTTP/2.0" 200 24
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:31:21 +0000] "POST /admin/ajax/reql/open-new-connection?cacheBuster=0.3583739526723704 HTTP/2.0" 200 24
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:31:21 +0000] "POST /admin/ajax/reql/open-new-connection?cacheBuster=0.9304607503322764 HTTP/2.0" 200 24
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:31:22 +0000] "POST /admin/ajax/reql/?conn_id=JmLkP3jf4Xfc7MEr6Lyyyw== HTTP/2.0" 200 308
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:31:22 +0000] "POST /admin/ajax/reql/?conn_id=WQdrwkyUaT+BZB1z3lKxbg== HTTP/2.0" 200 86
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:31:22 +0000] "POST /admin/ajax/reql/?conn_id=HisapNMBfQhcmne46hqqPg== HTTP/2.0" 200 450
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:31:23 +0000] "POST /admin/ajax/reql/?conn_id=WQdrwkyUaT+BZB1z3lKxbg== HTTP/2.0" 200 86
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:31:24 +0000] "POST /admin/ajax/reql/?conn_id=WQdrwkyUaT+BZB1z3lKxbg== HTTP/2.0" 200 86
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:31:25 +0000] "POST /admin/ajax/reql/?conn_id=WQdrwkyUaT+BZB1z3lKxbg== HTTP/2.0" 200 86
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:31:26 +0000] "POST /admin/ajax/reql/?conn_id=WQdrwkyUaT+BZB1z3lKxbg== HTTP/2.0" 200 85
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:31:27 +0000] "POST /admin/ajax/reql/?conn_id=JmLkP3jf4Xfc7MEr6Lyyyw== HTTP/2.0" 200 308
caddy_1  | 172.18.0.1 - [30/Jul/2016:17:31:27 +0000] "POST /admin/ajax/reql/?conn_id=HisapNMBfQhcmne46hqqPg== HTTP/2.0" 200 451

7. How can someone who is starting from scratch reproduce this behavior as minimally as possible?o reproduce the issue do:

git clone https://github.com/ubergarm/stack-in-a-box
cd stack-in-a-box
git checkout proxy-issue
docker-compose build
docker-compose up

Keep an eye on TCP port status (see how TIME WAIT (tw) increases):

watch -n.5 -d cat /proc/net/sockstat

Open a browser to https://localhost/admin.

Each reload of the main page adds just under 100 to tw.

Hypothesis

Reload the main page and click around on the UI or just wait and it
seems that as the number of TCP tw sockets grows it will eventually miss
a request. Then after the default 10 seconds timeout the OS might have
recycled some sockets and it will start working again.

Quick Fix

Thanks @tomasdeml , setting fail_timeout 0 and max_fails 0 alleviates the symptoms for now.

So I was able to actually "fix" the issue by setting MaxIdleConnsPerHost on the transport to something relatively high (I set it to 8192, the default is 2).

diff --git a/caddyhttp/proxy/reverseproxy.go b/caddyhttp/proxy/reverseproxy.go
index e6f759d..2ef45cb 100644
--- a/caddyhttp/proxy/reverseproxy.go
+++ b/caddyhttp/proxy/reverseproxy.go
@@ -167,7 +167,7 @@ func (p *ReverseProxy) ServeHTTP(rw http.ResponseWriter, outreq *http.Request, r
        if transport == nil {
                transport = http.DefaultTransport
        }
-
+       transport.(*http.Transport).MaxIdleConnsPerHost = 8192
        p.Director(outreq)
        outreq.Proto = "HTTP/1.1"
        outreq.ProtoMajor = 1
b@cass3:~$ ab -n 1000000 -c 10000 -k http://10.128.0.5:4040/
This is ApacheBench, Version 2.3 <$Revision: 1528965 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 10.128.0.5 (be patient)
Completed 100000 requests
Completed 200000 requests
Completed 300000 requests
Completed 400000 requests
Completed 500000 requests
Completed 600000 requests
Completed 700000 requests
Completed 800000 requests
Completed 900000 requests
Completed 1000000 requests
Finished 1000000 requests


Server Software:        
Server Hostname:        10.128.0.5
Server Port:            4040

Document Path:          /
Document Length:        226 bytes

Concurrency Level:      10000
Time taken for tests:   75.011 seconds
Complete requests:      1000000
Failed requests:        0
Keep-Alive requests:    1000000
Total transferred:      458000000 bytes
HTML transferred:       226000000 bytes
Requests per second:    13331.30 [#/sec] (mean)
Time per request:       750.115 [ms] (mean)
Time per request:       0.075 [ms] (mean, across all concurrent requests)
Transfer rate:          5962.63 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    3  32.5      0     439
Processing:     7  743 300.9    687    3477
Waiting:        7  733 286.8    684    3376
Total:          7  746 302.4    688    3477

Percentage of the requests served within a certain time (ms)
  50%    688
  66%    797
  75%    884
  80%    945
  90%   1134
  95%   1317
  98%   1538
  99%   1775
 100%   3477 (longest request)

Like I suspected, the "problematic" code isn't with Caddy, but with golang's http pooling implementation (in transport.go). When there are large number of incoming connections, the number of pooled idle connections essentially doesn't matter and a new connection is opened (and maybe closed?) for every request.

The pooling functionality has changed a bit in tip, however I'm not sure if the new code fixes this issue.

Possible suggestion for moving forward.

  1. Expose a keepalive directive (like nginx) that allows you to specify the number of maximum idle connections you want to keep around (although I had to set it pretty high for ab to work - I'm not sure what the optimal number should be).
  2. See if there are any optimization you can make in golang's connection pooling to alleviate this issue.

I recommend we go with option 1, despite any other changes as its probably a tuning parameter we need to expose.

w.r.t option 2, I'm not 100% if the current pooling code is performant. When I tried with 2048 MaxIdleConnsPerHost I still got errors, and I'm not sure if a server like nginx would be able to handle the same benchmark with the same amount of idle connections. I'd also need more info on how nginx manages idle connections to know for sure (for example, does nginx temporarily increase the pool, and only close after some timeout?)

https://engineering.gosquared.com/optimising-nginx-node-js-and-networking-for-heavy-workloads (see nginx section)

If you are hitting this error, consider increasing MaxIdleConnsPerHost option (if I end up writing this PR, it will likely be called keepalivelike nginx). I'm not sure what the optimal values are, but if you have 10,000 concurrents, the default pool size (2) is likely not enough.

Setting fail_timeout 0 and max_fails 0 only masks the issue, it simply keeps retrying to contact the upstream server until the OS finally clears some of the sockets in TIME_WAIT, at which point it can communicate again.

Was this page helpful?
0 / 5 - 0 ratings