Using gunicorn 19.4.5 inside a docker image, which is running on Ubuntu 14.04 LTS on AWS, I see constant worker timeouts when running with >1 sync worker and the default 30s timeout period. This may be related to #588 or #942, although this is a significantly newer version of gunicorn so I couldn't tell for sure.
Circumstantial evidence over several runs seems to suggest that one of the workers is not affected, and it's only the N-1 remaining workers that keep failing and restarting. However, I was only able to get a few data points that suggest this, so don't take it as a strong signal.
Here are things I checked:
stat -L
on deleted files, but preventing gunicorn from unlinking the file also results in stat -L
showing that ctime is being updated as usual.I added some code to workertmp.py
to log the read ctime on each check, and here is one such log (interleaved between all workers):
[2016-01-29 00:21:38 +0000] [3238] [INFO] Starting gunicorn 19.4.0
[2016-01-29 00:21:38 +0000] [3238] [INFO] Listening at: http://0.0.0.0:5000 (3238)
[2016-01-29 00:21:38 +0000] [3238] [INFO] Using worker: sync
[2016-01-29 00:21:38 +0000] [3243] [INFO] Booting worker with pid: 3243
[2016-01-29 00:21:38 +0000] [3244] [INFO] Booting worker with pid: 3244
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026899.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026899.03
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026900.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026899.03
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026900.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026899.03
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026900.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026899.03
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026900.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026899.03
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026904.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026899.03
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026904.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026905.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026904.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026905.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026904.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026905.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026904.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026905.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026904.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026909.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026910.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026909.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026910.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026909.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026910.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026909.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026910.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026909.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026933.73
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026934.74
FOOBAR: modify /tmp/wgun[2016-01-29 00:22:25 +0000] [3238] [CRITICAL] WORKER TIMEOUT (pid:3243)
[2016-01-29 00:22:25 +0000] [3243] [INFO] Worker exiting (pid: 3243)
[2016-01-29 00:22:25 +0000] [3330] [INFO] Booting worker with pid: 3330
icorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026935.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026935.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026935.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026935.0
FOOBAR: modify /tmp/wgunic54026934.74
FOOBAR: modify /tmp/wgun[2016-01-29 00:22:25 +0000] [3238] [CRITICAL] WORKER TIMEOUT (pid:3243)
[2016-01-29 00:22:25 +0000] [3243] [INFO] Worker exiting (pid: 3243)
[2016-01-29 00:22:25 +0000] [3330] [INFO] Booting worker with pid: 3330
icorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time:fy] [3238] [CRITICAL] WORKER TIMEOUT (pid:3243)
[2016-01-29 00p/2:25 +0000] [3243] [INFO] Worker exiting (pid: 3243)
[2016-0ic29 00:22:25 +0000] [3330] [INFO] Booting worker with pid: 33myicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorntiyjLwI time: 1454026935.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt45ime: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026945.82
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026946.01
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026946.01
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026946.01
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8corn-myjLwI time:fy] [3238] [CRITICAL] WORKER TIMEOUT (pid:32BA)
[2016-01-29 00p/2:25 +0000] [3243] [INFO] Worker exiting (pdify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 timeodify /tmp/wgunicorn-myjLwI time: 1454026964.74
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026965.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026965.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026965.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026965.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026969.74
FOO[2016-01-29 00:22:59 +0000] [3238] [CRITICAL] WORKER TIMEOUT (pid:3330)
[2016-01-29 00:22:59 +0000] [3330] [INFO] Worker exiting (pid: 3330)
icorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026935.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026935.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026935.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026935.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026939.74
FOOBAR: modify /tmp/wgunicorn-LwI time: 1454026965.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 tI time: 1454026940.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026940.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026940.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026940.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026944.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
[2016-01-29 00:22:59 +0000] [3396] [INFO] Booting worker with pid: 3396
BAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjL26949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjL26949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjL26949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjL26949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjL26949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjL26949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/w79.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026979.97
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
80.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454029.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454029.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454029.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
WORKER TIMEOUT (pid:3396)
[2016-01-29 00:23:31 +0000] [3396] [INFO] Worker exiting (pid: 3396)
BAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026974.74
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
FOOBAR: modify /tmp/wgunicorn-Mw6nicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027005.0
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027010.0
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027010.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027011.08
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027011.28
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027011.28
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027011.28
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmprn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-Znicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027028.98
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027030.0
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027030.0
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wguicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027030.0
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027030.0
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wguicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027030.0
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027030.0
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wguicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027030.0
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn--ZmVaVS time: 1454027014.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027035.0
FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
FOOBAR: modify /t[2016-01-29 00:24:05 +0000] [3238] [CRITICAL] WORKER TIMEOUT (pid:3453)
[2016-01-29 00:24:05 +0000] [3453] [INFO] Worker exiting (pid: 3453)
hrmmm from the log above I can see that the time is never changing. I will check
Thanks, I appreciate it!
I might be affected by this bug.
I am running gunicorn inside a Docker image on AWS (Elastic Beanstalk + EC2 Container Service).
There are constant worker timeouts that start happening almost immediately after initialization, and only when running the Docker image on AWS.
The same HTTP request takes sometimes a few ms, sometimes 20-30 seconds. Doesn't matter if it's for a non-existent page, a simple non-authenticated "Hello world"-string page or a page that fetches records from a database.
Additional details:
Environments _not_ affected:
Amazon container logs:
[2016-06-15 16:46:15 +0000] [1] [DEBUG] Current configuration:
cert_reqs: 0
post_request: <function PostRequest.post_request at 0x7fb13e187c80>
worker_connections: 1000
tmp_upload_dir: None
enable_stdio_inheritance: False
timeout: 30
pre_fork: <function Prefork.pre_fork at 0x7fb13e1871e0>
worker_int: <function WorkerInt.worker_int at 0x7fb13e1876a8>
backlog: 2048
max_requests: 0
max_requests_jitter: 0
access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
logconfig: None
syslog_addr: udp://localhost:514
preload_app: False
pidfile: None
umask: 0
secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
paste: None
proxy_protocol: False
worker_class: sync
on_starting: <function OnStarting.on_starting at 0x7fb13e414c80>
worker_abort: <function WorkerAbort.worker_abort at 0x7fb13e187840>
worker_exit: <function WorkerExit.worker_exit at 0x7fb13e187e18>
config: config/gunicorn.py
nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7fb13e18a048>
daemon: False
accesslog: None
errorlog: -
loglevel: debug
syslog_prefix: None
ssl_version: 3
suppress_ragged_eofs: True
limit_request_field_size: 8190
reload: False
logger_class: gunicorn.glogging.Logger
statsd_host: None
keyfile: None
raw_env: []
threads: 1
django_settings: None
proc_name: None
proxy_allow_ips: ['127.0.0.1']
limit_request_fields: 100
ciphers: TLSv1
check_config: False
do_handshake_on_connect: False
post_worker_init: <function PostWorkerInit.post_worker_init at 0x7fb13e187510>
graceful_timeout: 30
worker_tmp_dir: None
certfile: None
sendfile: None
keepalive: 2
chdir: /app/testapp
when_ready: <function WhenReady.when_ready at 0x7fb13e187048>
ca_certs: None
on_exit: <function OnExit.on_exit at 0x7fb13e18a1e0>
spew: False
bind: [':8000']
post_fork: <function Postfork.post_fork at 0x7fb13e187378>
limit_request_line: 4094
syslog_facility: user
workers: 3
syslog: False
pre_request: <function PreRequest.pre_request at 0x7fb13e187b70>
user: 0
group: 0
forwarded_allow_ips: ['127.0.0.1']
pythonpath: None
on_reload: <function OnReload.on_reload at 0x7fb13e414e18>
pre_exec: <function PreExec.pre_exec at 0x7fb13e1879d8>
default_proc_name: config.wsgi
statsd_prefix:
[2016-06-15 16:46:15 +0000] [1] [INFO] Starting gunicorn 19.5.0
[2016-06-15 16:46:15 +0000] [1] [DEBUG] Arbiter booted
[2016-06-15 16:46:15 +0000] [1] [INFO] Listening at: http://0.0.0.0:8000 (1)
[2016-06-15 16:46:15 +0000] [1] [INFO] Using worker: sync
[2016-06-15 16:46:15 +0000] [8] [INFO] Booting worker with pid: 8
[2016-06-15 16:46:15 +0000] [9] [INFO] Booting worker with pid: 9
[2016-06-15 16:46:15 +0000] [10] [INFO] Booting worker with pid: 10
[2016-06-15 16:46:15 +0000] [1] [DEBUG] 3 workers
[2016-06-15 16:46:52 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:9)
[2016-06-15 16:46:52 +0000] [9] [INFO] Worker exiting (pid: 9)
[2016-06-15 16:46:52 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:10)
[2016-06-15 16:46:52 +0000] [10] [INFO] Worker exiting (pid: 10)
[2016-06-15 16:46:53 +0000] [20] [INFO] Booting worker with pid: 20
[2016-06-15 16:46:53 +0000] [1] [DEBUG] 2 workers
[2016-06-15 16:46:53 +0000] [21] [INFO] Booting worker with pid: 21
[2016-06-15 16:46:53 +0000] [1] [DEBUG] 3 workers
[2016-06-15 16:47:23 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:8)
(...)
`
Thanks.
I am seeing this too. My suspicion is the Elastic Load Balancer. We used to use paste
in a previous company and this was causing issues with the ELB. I believe it re-uses connections somehow and this was causing hickups in paste
.
I am now in a new job where we use gunicorn and everything was fine until we begann to put the server behind an ELB, the webapp is now often very unresponsive and seemingly hangs.
Try using the gevent
workers instead, this has solved the issue for me.
So is this an issue with the ELB keeping connections alive and tying up the sync worker? That would strike me as very weird since we deliberately signal that the connection should be closed with the sync worker: https://github.com/benoitc/gunicorn/blob/master/gunicorn/workers/sync.py#L168
We could close the connection deliberately, though. If anyone wants to test moving the client.close()
call to be unconditional that might be an interesting test: https://github.com/benoitc/gunicorn/blob/master/gunicorn/workers/sync.py#L199
You could also make sure that the idle timeout on the ELB is lower than the worker timeout, as a quick sanity check that this is even the cause of the issue: http://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-idle-timeout.html
@tilgovi the elb thing is my current best guess, because a) it happens only if the gunicorn is behind an ELB and b) the issue is fixed by using the gevent
worker. But then again, who knows..
it happens only behind ELB, and if protocol HTTP or HTTPS is set on ELB.
If you change protocol to TCP or SSL the problem disappears.
Does anyone know how to fix it still using sync workers? I used the fix I described (TCP instead of HTTP) until now, when I'm going to monitor access logs and see X-Forwarded-For header, which is enabled only for HTTP/HTTPS on ELB :)
Strange, just to make sure we are on the same page, this means there is some kind of optimization when using the HTTP/HTTPS option in the ELB? And this optimization is causing issues with the sync workers?
@ecs I think in case of HTTP/HTTPS load balancer reuse the same TCP connection, which is kept by ELB alive long time - some sort of that. But it doesn't explains why gevent worker works fine... (although I haven't test gevent yet, just read from these comments)
Btw, there're certainly a lot of factors (gunicorn version, worker class, inside docker or not, ELB protocol)which can influence, so it could be great if someone will test and confirm that TCP protocol on ELB works unlike HTTP. I'm testing on docker hosted on Ubuntu 14.04 and gunicorn version 19.3.0
I have the same problem:
[2016-10-03 12:13:17 +0000] [6] [CRITICAL] WORKER TIMEOUT (pid:16)
[2016-10-03 12:13:17 +0000] [16] [INFO] Worker exiting (pid: 16)
[2016-10-03 12:13:17 +0000] [6] [CRITICAL] WORKER TIMEOUT (pid:20)
[2016-10-03 12:13:17 +0000] [20] [INFO] Worker exiting (pid: 20)
[2016-10-03 12:13:17 +0000] [33] [INFO] Booting worker with pid: 33
My setup.
I am using kube-aws to set up a kubernetes cluster with core-os on AWS: https://github.com/coreos/coreos-kubernetes/tree/master/multi-node/aws
I have a python app using gunicorn and sync workers. The service for them uses an HTTPS load balancer with my sites certificate. All workers are considered timed out after the default timeout time of 30s it seems. They are eventually all restarted. Kubernetes eventually restarts the container when it exits.. these restarts happen continuously. Someone above mentioned that gevent solves the problem i may try that out but I'd really like to understand why this is happening.
We're seeing this issue too currently, I'll try and post more details when I get a chance. It is related to the sync worker as moving to gevent does indeed solve the issue. This is on AWS with ECS and ELB.
Had the same weird problem as well with one of our apps. Changing listeners from HTTP to tcp indeed solves the problem.
anyone can provide me step-by-step way to reproduce the issue? (config step and such) Or at least point me to a simple way to setup that environment :) I will have a look this week on it.
Relying on HTTP for such balancer generally requires to handle keep-alive behind, (except for tcp) so anything using a sync worker may be an issue until the sync worker will be updated for its support.
Anyway let me know.
Just set up a basic Flask app and run it with gunicorn with sync workers inside docker on AWS, with HTTP load-balancing and the configuration from above:
Using gunicorn 19.4.5 inside a docker image, which is running on Ubuntu 14.04 LTS on AWS, I see constant worker timeouts when running with >1 sync worker and the default 30s timeout period.
You don't have to make any requests to the application, just monitor the logs and you'll see workers timing out.
I am currently using gevent workers with 65s timeout period (AWS ELB timeout is 60s, AFAIK). I still sometimes experience 504 errors even when only one client makes requests, with 1 request in flight at a time, but I wasn't able to pin down a reason just yet.
@obi1kenobi Did you try to increase the keepalive to beyond the ELB timeout option?
@lenucksi the keepalive is already higher than the ELB timeout by 5s -- hopefully the servers' clocks don't drive quite that much :)
I too encountered this issue running gunicorn 19.6 inside a Docker container on an EC2 instance behind an ELB (using the default HTTP protocol for reverse-proxying to the EC2 instance).
As suggested earlier, setting the worker class to gevent resolved this issue for me (where previously I had been using the default sync
worker class).
I'll try and document a minimal example for reproducing the issue.
To corroborate what others have said, the ELB seems to be opening a connection to the EC2 instance and holding it open in a way that locks up gunicorn from handling additional requests.
When that held-open connection terminates due to gunicorn timing out the request from the ELB, all pending requests queued up behind it also fail (including the ELB's health check, causing the instance to be removed from the ELB, making this harder to debug ;) ).
This symptom suggests this is a duplicate of what's been observed in this previous issue that is unrelated to Docker: https://github.com/benoitc/gunicorn/issues/588
@jelis Do you know if this is related to some kubernetes configuration? Was your issue solved?
I'm also on an ELB with SSL termination. I'm seeing this in my logs:
[017-03-01 07:36:32 +0000] [1249] [DEBUG] GET /healthcheck
[01/Mar/2017:07:36:32 +0000] "GET /healthcheck HTTP/1.1" 200 22 "-" "ELB-HealthChecker/1.0"
[017-03-01 07:36:37 +0000] [12] [CRITICAL] WORKER TIMEOUT (pid:1203)
[017-03-01 07:36:37 +0000] [12] [CRITICAL] WORKER TIMEOUT (pid:1225)
[017-03-01 07:36:37 +0000] [12] [CRITICAL] WORKER TIMEOUT (pid:1234)
[017-03-01 07:36:37 +0000] [12] [CRITICAL] WORKER TIMEOUT (pid:1243)
[017-03-01 07:36:37 +0000] [1203] [INFO] Worker exiting (pid: 1203)
[017-03-01 07:36:37 +0000] [1225] [INFO] Worker exiting (pid: 1225)
[017-03-01 07:36:37 +0000] [1243] [INFO] Worker exiting (pid: 1243)
[017-03-01 07:36:37 +0000] [1234] [INFO] Worker exiting (pid: 1234)
[017-03-01 07:36:37 +0000] [1256] [INFO] Booting worker with pid: 1256
[017-03-01 07:36:37 +0000] [12] [DEBUG] 14 workers
[017-03-01 07:36:37 +0000] [1257] [INFO] Booting worker with pid: 1257
[017-03-01 07:36:37 +0000] [1262] [INFO] Booting worker with pid: 1262
[017-03-01 07:36:37 +0000] [1265] [INFO] Booting worker with pid: 1265
[017-03-01 07:36:37 +0000] [12] [DEBUG] 16 workers
[017-03-01 07:36:40 +0000] [1262] [DEBUG] GET /healthcheck
Update - I just switched from gunicorn to gevent (http://flask.pocoo.org/docs/0.12/deploying/wsgi-standalone/#gevent) and the problem is resolved. This is in reference to my above comment https://github.com/benoitc/gunicorn/issues/1194#issuecomment-283269046
I have been hammering on this issue for a couple of days. Everyone's notes here are fantastic, and are giving us a number of avenues to pursue fixes/workarounds/mitigations. (the log of my work is in the bug we're tracking here)
I just attempted the use of gevent with a boost in the worker count and that didn't help us at all - in fact it surprisingly made it look worse - all five gevent workers died off at the same time, which is a new behaviour from what I've read above:
[2017-03-30 23:38:37 +0000] [5] [INFO] Starting gunicorn 19.7.1
[2017-03-30 23:38:37 +0000] [5] [INFO] Listening at: http://0.0.0.0:8000 (5)
[2017-03-30 23:38:37 +0000] [5] [INFO] Using worker: eventlet
[2017-03-30 23:38:38 +0000] [8] [INFO] Booting worker with pid: 8
[2017-03-30 23:38:40 +0000] [9] [INFO] Booting worker with pid: 9
[2017-03-30 23:38:41 +0000] [11] [INFO] Booting worker with pid: 11
[2017-03-30 23:38:42 +0000] [12] [INFO] Booting worker with pid: 12
[2017-03-30 23:38:42 +0000] [10] [INFO] Booting worker with pid: 10
[2017-03-30 23:40:08 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:8)
[2017-03-30 23:40:12 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:9)
[2017-03-30 23:40:12 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:10)
[2017-03-30 23:40:12 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:11)
[2017-03-30 23:40:12 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:12)
[2017-03-30 23:40:13 +0000] [8] [INFO] Worker exiting (pid: 8)
[2017-03-30 23:40:13 +0000] [10] [INFO] Worker exiting (pid: 10)
[2017-03-30 23:40:16 +0000] [16] [INFO] Booting worker with pid: 16
[2017-03-30 23:40:16 +0000] [17] [INFO] Booting worker with pid: 17
[2017-03-30 23:40:18 +0000] [18] [INFO] Booting worker with pid: 18
[2017-03-30 23:40:20 +0000] [19] [INFO] Booting worker with pid: 19
[2017-03-30 23:40:23 +0000] [20] [INFO] Booting worker with pid: 20
This is likely some screw-up on my part, and wasn't the same behaviour as one of my project counterparts (with a similar but not identical codebase in a near-identical AWS/CloudFormation/ALB/ECS deployment). However I figured I'd report it in case it inspires any new insights into the interactions between ALB, ECS, Docker and gunicorn.
I have a similar setup (convox running ELB->ECS Docker Container->Gunicorn->Django) and I can confirm that the following configuration changes seem to have helped:
keepalive = 75 # needs to be longer than the ELB idle timeout
timeout = 90
worker_class = 'gevent'
I set my ELB idle timeout to 60 seconds.
Prior to changing these settings I was using the defaults for keepalive, timeout, and the sync worker.
Interestingly, this issue started exhibiting at 4am UTC April 10. The app had been running fine without any changes since the previous Thursday.
just piling on, no docker, but bare aws instance was having same issue behind an ELB. works NP on osx, but ubuntu16 with an ELB same issue cropped up. gevent
worker class seems to fix it [for me].
Can anyone in this thread check this?
https://serverfault.com/questions/782022/keepalive-setting-for-gunicorn-behind-elb-without-nginx
@erikcw seems to have found this solution to work. I'd loke to close this issue. If anyone can reproduce this after raising the Gunicorn worker timeout and keepalive timeout, please say so.
Alternatively, lower the idle timeout for ELB.
Couldn't a client in the wild (or a nefarious client) act like the ELB and cause the same issue?
@six8 yes. ELB doesn't protect you from this either. The documentation discusses this: http://docs.gunicorn.org/en/latest/design.html#choosing-a-worker-type
Some people don't care about this threat and just want to deploy Gunicorn, at the edge or behind an ELB. The issue discussed here is worker timeouts caused by the ELB itself. I would like to provide recommended configuration that makes gunicorn work behind ELB with sync workers and without an additional reverse proxy.
I tried @erikcw's solution as well by changing the gunicorn invocation to:
CMD [ "gunicorn", "-k", "gevent", "--timeout", "10", "--keep-alive", "75", "--paste", "/etc/jzoo/app.ini" ]
Unfortunately that does not help:
[2017-07-04 08:36:23 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:24)
[2017-07-04 08:38:11 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:28)
[2017-07-04 08:39:39 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:29)
[2017-07-04 08:44:14 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:25)
[2017-07-04 08:44:55 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:31)
I have an application running on ECS behind an ALB with Gunicorn sync workers. I've not seen any issue yet.
Can anyone provide a reproducible deployment?
piling on, as well. seeing this issue using Kubernetes on AWS with ELB. Currently have 8 workers running. Per https://serverfault.com/questions/782022/keepalive-setting-for-gunicorn-behind-elb-without-nginx/, we're going to up keepalive
, and per this thread going to try to use gevent
workers instead. But right now, what's failing are some performance tests for api calls that shouldn't be blocking / causing timeouts on their own. would like to know what's going on here with sync
workers. even if we get this working per some of the good guidelines in this thread, we'd like a bit more confidence as to what's going on before deploying to production.
Setting the --worker-class to--worker-class gevent
worked for me. Thanks for the comments and help!
@wichert this is very strange. With the gevent worker, the workers should notify the arbiter of their liveness on a separate greenlet from any connections. Your keep-alive setting should not affect this at all. Something is either wrong with your application, such that gevent is not working as designed, or something about the setup is problematic that might be common to these other scenarios.
Please, everyone on this thread, provide an application that reproduces this problem. I have not been able to reproduce it. Without steps to reproduce, I am tempted to close. There are a mix of recommendations and maybe very different issues buried within these comments. It is hard to know what to do with this issue.
@tilgovi I experienced this problem when not specifying the worker class to gevent (e.g. using the default sync worker). My setup is virtualenv using pypy3 (PyPy 5.8.0-beta0), falcon 1.2, etc (understanding Falcon is not officially supported with pypy3). I used the cookiecutter from https://github.com/7ideas/cookiecutter-falcon and upgraded to all the latest packages and in trying to load the sample_resource in the browser and hitting it many times with httpie would intermittently see the worker crash error (e.g. the request would be delayed and then I would see the error; if I didn't cancel the request it would eventually respond). This only happened with the default worker class of sync. When I specified it to use gevent, everything worked.
Thank you, @jhillhouse92, I will give it a try.
I'm continuing to see problems with timeouts, @tilgovi. I will be troubleshooting this more tomorrow and will post an update with more precise information. Please do keep this issue open; you're right that there may be a variety of problems going on, but the general theme is that gunicorn workers (maybe just with the sync worker class, tbd) are having timeout issues when using gunicorn with an application deployed using AWS/ELB. There is something wrong here, even if we haven't isolated the exact problem. It's worth digging further. I'll update tomorrow.
Worth noting, I experienced this issue locally not within Docker/AWS/ELB with the above stack I mentioned.
So I'm using gevent, trying to capture my access logs, and trying to deploy my app using AWS/ELB.
dockerfile commands:
FROM cortex/base
CMD ["gunicorn", "-w", "8", "-b", "0.0.0.0:5000", "--capture-output", "--enable-stdio-inheritance", "--access-logfile", "-", "--worker-class", "gevent", "-t", "60", "--log-level", "debug", "app:app"]
I'm getting crazy OOM errors in relation to Error code 14, regardless of what I set the limits to. Has anyone else seen this in the context of this issue? Previously, I was able to see worker timeouts. With my current configs, I can't reproduce the timeouts / get that far.
[2017-07-19 18:27:39 +0000] [34] [ERROR] Exception in worker process
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/gunicorn/arbiter.py", line 578, in spawn_worker
worker.init_process()
File "/usr/local/lib/python3.5/site-packages/gunicorn/workers/ggevent.py", line 190, in init_process
super(GeventWorker, self).init_process()
File "/usr/local/lib/python3.5/site-packages/gunicorn/workers/base.py", line 126, in init_process
self.load_wsgi()
File "/usr/local/lib/python3.5/site-packages/gunicorn/workers/base.py", line 135, in load_wsgi
self.wsgi = self.app.wsgi()
File "/usr/local/lib/python3.5/site-packages/gunicorn/app/base.py", line 67, in wsgi
self.callable = self.load()
File "/usr/local/lib/python3.5/site-packages/gunicorn/app/wsgiapp.py", line 65, in load
return self.load_wsgiapp()
File "/usr/local/lib/python3.5/site-packages/gunicorn/app/wsgiapp.py", line 52, in load_wsgiapp
return util.import_app(self.app_uri)
File "/usr/local/lib/python3.5/site-packages/gunicorn/util.py", line 376, in import_app
__import__(module)
File "/code/app/__init__.py", line 2, in <module>
from flask_alchy import Alchy
File "/usr/local/lib/python3.5/site-packages/flask_alchy.py", line 8, in <module>
from flask_sqlalchemy import SQLAlchemy
File "/usr/local/lib/python3.5/site-packages/flask_sqlalchemy/__init__.py", line 18, in <module>
import sqlalchemy
File "/usr/local/lib/python3.5/site-packages/sqlalchemy/__init__.py", line 9, in <module>
from .sql import (
File "/usr/local/lib/python3.5/site-packages/sqlalchemy/sql/__init__.py", line 8, in <module>
from .expression import (
File "/usr/local/lib/python3.5/site-packages/sqlalchemy/sql/expression.py", line 34, in <module>
from .functions import func, modifier, FunctionElement, Function
File "/usr/local/lib/python3.5/site-packages/sqlalchemy/sql/functions.py", line 11, in <module>
from . import sqltypes, schema
File "<frozen importlib._bootstrap>", line 969, in _find_and_load
File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 658, in exec_module
File "<frozen importlib._bootstrap_external>", line 763, in get_code
File "<frozen importlib._bootstrap_external>", line 816, in get_data
OSError: [Errno 14] Bad address
To reiterate, I don't see this locally without docker. I don't see this locally with Docker. This, and timeout issues, only become apparently in the Docker/AWS/ELB stack (for me).
@tilgovi One thing I initially did wrong (as can be seen in the snippet I posted above) is that I had the timeout wrongly set to 10 seconds. After changing that to 75 seconds the number of worker timeouts went down significantly to perhaps a couple per day. For completeness this is the exact command I am using to start our app:
gunicorn -k gevent --timeout 90 --keep-alive 75 --paste /etc/jzoo/app.ini "$@"
The .ini file looks like this:
[DEFAULT]
sqlalchemy.url = postgresql://XXXX
[app:main]
use = egg:jzoo.api.admin
[server:main]
use = egg:gunicorn#main
bind = 0.0.0.0:5000
workers = 4
preload_app = true
loglevel = warning
# Logging configuration
[loggers]
keys = root,sqlalchemy,alembic,jzoo
# .. more logging configuration
The amount of traffic does not seem to be a factor - I see this happening in our QA environment which only sees very little traffic.
Looking at how often this happens I am wondering if this could be related to app restarts?
I have a somewhat similar problem. After some investigations, I think it's probably network connection error, not gunicorn's.
Here's my setup:
Funnily, most requests from the app server to postgres server were OK, except one. That query seems to run forever, which timeout the gunicorn worker.
The problem is gone if I use the app locally.
The problem is also gone if I don't dockerize the app server.
My temporal fix is open all connection between the app server and the postgres server through AWS security group setting. Now everything works properly again.
My guess there's a connection problem between the Python container and AWS. I'll update after examining the AWS log stream thoroughly.
Thank you!! Changing ELB and Instance protocol from HTTP to TCP worked just fine 馃憤
The problem doesn't seem specific to AWS, but to Docker itself when combined with Gunicorn. I see this with gevent
as well, at least in our production environment.
@tilgovi Here's a simple app to recreate this error (every now and then, haven't found the pattern, perhaps you might):
app.py:
import falcon
class HelloResource:
def on_get(self, req, resp):
resp.media = {'info': 'hello world'}
app = falcon.API()
app.add_route('/', HelloResource())
Dockerfile:
FROM python:3.6
WORKDIR /code
RUN pip install gunicorn falcon
COPY app.py .
CMD ["gunicorn", "app:app", "--bind", "0.0.0.0:8000"]
To run: docker build -t gunicorn-and-docker . && docker run -p 8000:8000 -it --rm gunicorn-and-docker
The logs will look like this (at least for me on macOS Sierra using Docker 17.06.2-ce-mac27 (19124)):
[2017-09-23 22:31:00 +0000] [1] [INFO] Starting gunicorn 19.7.1
[2017-09-23 22:31:00 +0000] [1] [INFO] Listening at: http://0.0.0.0:8000 (1)
[2017-09-23 22:31:00 +0000] [1] [INFO] Using worker: sync
[2017-09-23 22:31:00 +0000] [9] [INFO] Booting worker with pid: 9
[2017-09-23 23:22:45 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:9)
[2017-09-23 23:22:45 +0000] [9] [INFO] Worker exiting (pid: 9)
[2017-09-23 23:22:45 +0000] [11] [INFO] Booting worker with pid: 11
[2017-09-23 23:39:46 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:11)
[2017-09-23 23:39:46 +0000] [11] [INFO] Worker exiting (pid: 11)
[2017-09-23 23:39:46 +0000] [13] [INFO] Booting worker with pid: 13
[2017-09-23 23:41:10 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:13)
[2017-09-23 23:41:10 +0000] [13] [INFO] Worker exiting (pid: 13)
[2017-09-23 23:41:10 +0000] [15] [INFO] Booting worker with pid: 15
[2017-09-23 23:42:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:15)
[2017-09-23 23:42:27 +0000] [15] [INFO] Worker exiting (pid: 15)
[2017-09-23 23:42:27 +0000] [17] [INFO] Booting worker with pid: 17
[2017-09-23 23:43:44 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:17)
[2017-09-23 23:43:44 +0000] [17] [INFO] Worker exiting (pid: 17)
[2017-09-23 23:43:44 +0000] [19] [INFO] Booting worker with pid: 19
[2017-09-24 18:37:12 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:19)
[2017-09-24 18:37:12 +0000] [19] [INFO] Worker exiting (pid: 19)
[2017-09-24 18:37:12 +0000] [21] [INFO] Booting worker with pid: 21
[2017-09-24 19:49:20 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:21)
[2017-09-24 19:49:20 +0000] [21] [INFO] Worker exiting (pid: 21)
[2017-09-24 19:49:20 +0000] [23] [INFO] Booting worker with pid: 23
[2017-09-24 19:50:37 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:23)
[2017-09-24 19:50:37 +0000] [23] [INFO] Worker exiting (pid: 23)
[2017-09-24 19:50:37 +0000] [25] [INFO] Booting worker with pid: 25
[2017-09-24 20:25:48 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:25)
[2017-09-24 20:25:48 +0000] [25] [INFO] Worker exiting (pid: 25)
[2017-09-24 20:25:48 +0000] [27] [INFO] Booting worker with pid: 27
Hi,
I've recently bumped into the same problem.
In my case (ECS + gunicorn 19.7.1 + Django 1.11.5) I (apparently) managed to solve it by simply increasing the amount of memory dedicated to the container in the task definition (from 128 to 256mb).
No more errors so far. I'll report here if it happens again.
Adding to the chorus here a bit, but I was able to solve this only by switching to gevent.
I tried to increase timeout to 90 and keepalive to 75 with the sync worker with no luck.
Also, interestingly, my ELB in this scenario is only doing TCP level forwarding and not HTTP.
Okay, I found some references that explain this. [1]
It could be solved with small changes on the Gunicorn side. If, in the sync worker, after accepting a new connection, we called select()
with a short timeout then we could handle these connections. It could get a little complicated, especially with multiple listening sockets. The simplest thing would be to have a timeout to first byte. I'm not sure how that would interact with SSL, either.
[1] https://cloudavail.com/2013/12/21/aws-elb-pre-open-connection-expose-part-1/
If you are experiencing this and do not want to use gevent/eventlet/asyncio, I recommend switching to the threaded worker (provide the --threads
setting).
@tilgovi The issue seems to exist with gevent worker as well, if you look at my example above.
@jmagnusson your example (from this comment: https://github.com/benoitc/gunicorn/issues/1194#issuecomment-331740187) seems to be using the sync worker.
[2017-09-23 22:31:00 +0000] [1] [INFO] Using worker: sync
@tilgovi True, but I've experienced this with the gevent
worker as well, in production. I'll try to find some time for reproducing with gevent.
This may just be me (silly), but in the EC2 security group -> inbound rules, I have a custom TCP rule for port 5000 (gunicorn here) to allow only a single IP (my IP: XX.XX.XX.XX/32), changing this to all IP will help.
I meet with the trouble, too.
[2018-01-02 16:38:03 +0800] [24355] [INFO] Starting gunicorn 19.7.1
[2018-01-02 16:38:03 +0800] [24355] [DEBUG] Arbiter booted
[2018-01-02 16:38:03 +0800] [24355] [INFO] Listening at: http://0.0.0.0:8080 (24355)
[2018-01-02 16:38:03 +0800] [24355] [INFO] Using worker: gevent
[2018-01-02 16:38:03 +0800] [24358] [INFO] Booting worker with pid: 24358
[2018-01-02 16:38:03 +0800] [24355] [DEBUG] 1 workers
[2018-01-02 16:38:03 +0800] [24358] [INFO] worker pid 24358 notify
[2018-01-02 16:38:04 +0800] [24358] [INFO] worker pid 24358 notify
[2018-01-02 16:38:05 +0800] [24358] [INFO] worker pid 24358 notify
[2018-01-02 16:38:06 +0800] [24358] [INFO] worker pid 24358 notify
[2018-01-02 16:38:07 +0800] [24358] [INFO] worker pid 24358 notify
[2018-01-02 16:38:08 +0800] [24358] [INFO] worker pid 24358 notify
[2018-01-02 16:38:09 +0800] [24358] [INFO] worker pid 24358 notify
[2018-01-02 16:38:10 +0800] [24358] [INFO] worker pid 24358 notify
[2018-01-02 16:38:10 +0800] [24358] [DEBUG] GET /v1/bj2/CC/uuid
[2018-01-02 16:38:10 +0800] [24358] [DEBUG] Closing connection.
[2018-01-02 16:38:41 +0800] [24355] [CRITICAL] WORKER TIMEOUT (pid:24358)
[2018-01-02 16:38:41 +0800] [24358] [INFO] Worker exiting (pid: 24358)
[2018-01-02 16:38:41 +0800] [24381] [INFO] Booting worker with pid: 24381
[2018-01-02 16:38:41 +0800] [24381] [INFO] worker pid 24381 notify
[2018-01-02 16:38:42 +0800] [24381] [INFO] worker pid 24381 notify
[2018-01-02 16:38:43 +0800] [24381] [INFO] worker pid 24381 notify
[2018-01-02 16:38:44 +0800] [24381] [INFO] worker pid 24381 notify
[2018-01-02 16:38:45 +0800] [24381] [INFO] worker pid 24381 notify
[2018-01-02 16:38:46 +0800] [24381] [INFO] worker pid 24381 notify
[2018-01-02 16:38:47 +0800] [24381] [INFO] worker pid 24381 notify
......
[2018-01-02 16:48:20 +0800] [24381] [INFO] worker pid 24381 notify
[2018-01-02 16:48:51 +0800] [24355] [CRITICAL] WORKER TIMEOUT (pid:24381)
[2018-01-02 16:48:51 +0800] [24381] [INFO] Worker exiting (pid: 24381)
[2018-01-02 16:48:51 +0800] [24703] [INFO] Booting worker with pid: 24703
[2018-01-02 16:48:51 +0800] [24703] [INFO] worker pid 24703 notify
[2018-01-02 16:48:51 +0800] [24703] [DEBUG] GET /v1/bj2/CC/uuid
[2018-01-02 16:48:51 +0800] [24703] [DEBUG] Closing connection.
I added the debug log worker pid {WORKER_PID} notify
.
According to the debug log,
SIGTERM
. When the graceful timeout reached, the master would kill it by SIGKILL
coercively.Python: 3.6.3
Gevent: 1.2.2
Gunicorn: 19.7.1
RUN CMD: gunicorn --worker-class gevent --log-level debug --bind 0.0.0.0:8080 app
NOTICE: When I use eventlet
as the worker, it's OK. I guess it's the question of gevent or gevent worker.
I'm going to close this issue since I think it's sort of out of date. As best I can determine, the observed behavior is due to AWS Classic Load Balancers doing a pre-open optimization. This optimization is only done by Classic Load Balancers, according to the documentation on how load balancing works. Switching to Application Load Balancers should resolve the issue. Additionally, as documented in this blog post, it's possible to request AWS to disable the pre-open optimization or to increase the worker timeout for Gunicorn.
@tilgovi FWIW I saw this with ALB. I've never used classic load balancers.
@tilgovi I too observed this repeatedly in ALB-only environments, never having used classic. Even with worker timeout adjustments, we continued to see this.
@tilgovi I met with this. I've never used AWS.
Someone has to help reproduce this, then, and try to debug it. I cannot reproduce it with ALB and I cannot see how to reproduce it outside AWS.
@xgfone you may want to open your own issue, the title of this issue is "docker + AWS"
Did anyone experience this with the AWS load balancers but without Docker? I'm a little confused, I have to admit :)
My impression was that the issue was really more related to Docker - at least I had exactly the same issues running multiple gunicorn sync workers inside Docker on my local machine.
This config worked fine without Docker on Ubuntu and RHEL machines, both bare metal and virtualized in Digital Ocean:
workers = 2
worker_class = 'sync'
worker_connections = 1000
timeout = 30
keepalive = 2
Inside Docker (also RHEL, everything else the same), the workers always ran into timeout and kept restarting permanently. Taking the recommendations from this thread, I've switched to this, which works perfectly fine:
workers = 1
threads = 8
worker_class = 'sync'
worker_connections = 1000
timeout = 30
keepalive = 2
If you think this is unrelated to this thread, I'm happy to remove it from here and put it someplace else ;-)
Best, Boris
@tilgovi
This error occurs in the arbiter.py, murder_workers
. The murder_workers
function is invoked when the SIG_QUEUE
is empty (e.g. while waiting for signals).
It would then check to see if the timeout was reached every second. The time remaining to live was always 1 sec or 0 secs because the gevent worker is in a controlled infinite loop while self.alive
invoked self.notify
which updates a file. The updates to the file are how gunicorn is checking when the last active time occurred.
Thus, the only way the time to live check in the murder_workers
function fails (e.g. says it's been idle too long) is if the gevent alive property was changed. The following methods control this:
handle_quit
handle_request
(if max requests were received, by default this is 2^64 bit of number of requests or 2^32 if 32-bit OS)
changed
(invoked from reloader if a file was changed)
handle_exit
handle_abort
These methods get invoked from signals
handle_quit
- SIGQUIT, SIGINT
handle_exit
- SIGTERM
handle_abort
- SIGABRT
Thus, there are 2 possible problems that I see:
I think the more probable cause is the 2nd one. The main source of signals is within the arbiter file and triggered by these exceptions:
StopIteration
KeyboardInterrupt
HaltServer
Exception
(unhandled exception)I think the StopIteration exception is most likely the culprit triggered in the async.py file at req = six.next(parser)
. It appears that this next method is just invoking parser.next().
Parser.next()
raises a StopIteration
exception on the following 2 conditions:
1. If the message indicates it should close:
1. http Connection close (and keep-alive false)
2. WSGI - request method not HEAD, response_length is none, and status code is > 200 (but not 204, or 304)
2. If self.mesg (Request message) is falsey
I鈥檓 not sure what exactly could be causing that unintentionally but out of 62 comments with this issue being seen across multiple apps and environments, I think this is worth investigating further and recommend re-opening it or if we don鈥檛 like the title, create a new global issue with workers critically timing out.
I'll re-open this and try to test these combinations:
If anyone wants to create a repo that can reproduce this, that would be helpful, but I can probably do it this weekend.
I'm having this problem and I'm not using AWS.
My setup is NGINX -> Docker container (Gunicorn with gevent workers + Django). I'm using the default timeouts of the Gunicorn.
Any tips?
Thanks.
Hi @davidmir,
I had the same problem, changing the worker type did not help. Switching to a single sync worker with multiple threads did the trick for me:
workers = 1
threads = 8
worker_class = 'sync'
worker_connections = 1000
timeout = 30
keepalive = 2
Best, Boris
While it's mentioned some above, to clarify from my own debugging of this issue, some of these are out of memory errors; especially if it's resolved by changing the worker count down to 1.
This is especially problematic on Docker if you have something like this in your config:
workers = multiprocessing.cpu_count() * 2 + 1
But are limiting the memory available to the container; the synchronous worker will start up multiple potentially heavyweight python interpreters. Note that CPU count is based on the overall CPU count of the machine, not what you've provisioned the docker container to use; if you're running a limited container you should base the worker count on that instead.
While it varies from application to application, my rule of thumb is 256 megs per worker in a lean configuration and 512 to be safe.
Faced same issue using Classic ELB HTTPS => HTTP Gunicorn Sync Workers
As others mentioned the easiest fix was to use -k gevent
argument instead of default sync worker.
We saw this same issue crop up again for an entirely new Docker container (hosted in ECS behind ALB), and found an entirely different (but in hindsight, should've-been-obvious) solution: allocate more memory to the container in CloudFormation template.
We suffered from this issue last year and never quite found a smoking gun root cause, but tonight I spotted it on a new container, and happened to notice that the container was eating up all the memory we'd allocated. As soon as I doubled the allocation, the containers stopped throwing this error (even before I allocated more than it would ultimately consume):
https://github.com/hackoregon/civic-devops/issues/157
Facing a similar issue here. the code is as simple as
while true:
sleep(10)
can lead worker dying (without this code our deploy is normal)
it is very puzzling, b/c sleep
should send a heartbeat already.
also have already changed to gevent, did not help. Wonder if anyone has any thought?
sample output:
[2018-06-16 16:00:16 +0000] [136] [CRITICAL] WORKER TIMEOUT (pid:521)
[2018-06-16 16:00:16 +0000] [136] [CRITICAL] WORKER TIMEOUT (pid:522)
[2018-06-16 16:00:16 +0000] [136] [CRITICAL] WORKER TIMEOUT (pid:523)
[2018-06-16 16:00:16 +0000] [136] [CRITICAL] WORKER TIMEOUT (pid:524)
[2018-06-16 16:00:16 +0000] [136] [CRITICAL] WORKER TIMEOUT (pid:525)
[2018-06-16 16:00:16 +0000] [524] [INFO] Worker exiting (pid: 524)
[2018-06-16 16:00:16 +0000] [521] [INFO] Worker exiting (pid: 521)
[2018-06-16 16:00:16 +0000] [523] [INFO] Worker exiting (pid: 523)
[2018-06-16 16:00:16 +0000] [522] [INFO] Worker exiting (pid: 522)
[2018-06-16 16:00:16 +0000] [525] [INFO] Worker exiting (pid: 525)
[2018-06-16 16:00:17 +0000] [531] [INFO] Booting worker with pid: 531
/usr/local/lib/python3.6/site-packages/gunicorn/workers/ggevent.py:65: MonkeyPatchWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. Please monkey-patch earlier. See https://github.com/gevent/gevent/issues/1016
monkey.patch_all(subprocess=True)
[2018-06-16 16:00:17 +0000] [136] [DEBUG] 1 workers
[2018-06-16 16:00:17 +0000] [532] [INFO] Booting worker with pid: 532
[2018-06-16 16:00:17 +0000] [533] [INFO] Booting worker with pid: 533
/usr/local/lib/python3.6/site-packages/gunicorn/workers/ggevent.py:65: MonkeyPatchWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. Please monkey-patch earlier. See https://github.com/gevent/gevent/issues/1016
monkey.patch_all(subprocess=True)
[2018-06-16 16:00:17 +0000] [534] [INFO] Booting worker with pid: 534
/usr/local/lib/python3.6/site-packages/gunicorn/workers/ggevent.py:65: MonkeyPatchWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. Please monkey-patch earlier. See https://github.com/gevent/gevent/issues/1016
monkey.patch_all(subprocess=True)
[2018-06-16 16:00:17 +0000] [535] [INFO] Booting worker with pid: 535
[2018-06-16 16:00:17 +0000] [136] [DEBUG] 5 workers
Sorry to muddy the waters here but just wanted to share my experience, my stack is HTTP ELB > Nginx > gunicorn/gevent on EC2
, processes run using circusd
The problem I found was that after instrumenting my django app with elastic-apm my HTTP healthchecks will start failing almost immediately and gunicorn would become unresponsive after a few seconds.
I tried a few local wrk
benchmarks against my healthcheck, sync was getting about 300rps for 1-5 workers, but gevent would get 1rps. Turning off apm brought gevent up to about 50rps and neglibable increase for sync.
Didn't notice any strange memory usage on valgrind, typically about 4MB for gunicorn and 40MB per worker
Anyway, updating gunicorn 19.6.0 > 19.9.0
and gevent 1.2.2 > 1.3.6
is giving me a rock steading ELB healthcheck again and ~300rps locally. I deliberately hadn't been updating gunicorn/gevent until I had some APM logging in place, seems that doing that has prompted some serendipity. Updating gevent gave the greatest boost, understandably, but then updating gunicorn seemed to add another 10% or so.
Haven't looked at the changelogs in much detail yet, perhaps the maintainers can shed light on most likely cause?
Anyway, updating gunicorn 19.6.0 > 19.9.0 and gevent 1.2.2 > 1.3.6 is giving me a rock steading ELB healthcheck again and ~300rps locally....Updating gevent gave the greatest boost
Wow, that's pretty cool!
closing issue. no activity since awhile.
FWIW, I'm still having this issue @benoitc :-\
Django 2.1.5
gunicorn 19.8.1
gevent 1.2.2
AKS 1.15.7 with 5 Nodes (standard B2's - 2 vcores and 2GB memory)
3 pods/docker containers, each of which has:
spec:
containers:
- name: web-container
command: ["newrelic-admin"]
args: [
"run-program",
"gunicorn",
"--limit-request-field_size=16380", # limit also specified in nginx
"--workers=3", # (2 x vcpu) + 1. vcpu < limit defined in resources
"-k=gevent",
"-b=0.0.0.0:8000",
"ourapp.wsgi",
"--timeout 60",
"--log-level=debug",
]
resources:
requests:
memory: "200Mi"
cpu: "250m"
limits:
memory: "3000Mi"
cpu: "2000m"
Same issue here with EKS, Nginx, and an NLB.
@salmicrosoft @sinkr OK but please open a new issue and provide any logs you have that could help us to figure what happen. Also what does your app? Do you have any long request?
I had the same issue on the ECS Fargate, but it was related to incorrect keepalive/timeout settings in Docker/gunicorn.conf.py, it was
keepalive (not set)
timeout=2
Before fixing it works well with a container running locally, but failed on the ECS Fargate, now it works correctly on both environments.
Most helpful comment
Try using the
gevent
workers instead, this has solved the issue for me.