I am seeing a weird behavior where Gunicorn does not shutdown immediately when sent a SIGTERM. I see this with Python 3.5 on Ubuntu 16.04. The following environments (that I have access to) _do not_ exhibit this issue:
Refer to https://gist.github.com/suriya/1d9a11ea9ad33a5b1df0b2ba7ed0cd0e on how to reproduce this issue.
After debugging a while, I found that the Gunicorn arbiter properly signals the child worker. The child workers handle the signal properly as well (on all versions). However, something prevents the loop in run_for_one
function https://github.com/benoitc/gunicorn/blob/9d158bec94c91c62cd5cc28a27268a5b8804d795/gunicorn/workers/sync.py#L58 from exiting right away with Python 3.5.
Could Python 3.5's change in the behavior of select.select()
https://docs.python.org/3/library/select.html#select.select explain this issue.
Do you have open connections when this happens? If so, that is expected. SIGTERM is a graceful shutdown and subject to the --graceful-timeout
setting (default 30s).
At present, at least on the master branch, I observe a worker exiting within 1s of sending SIGTERM to the arbiter, tested on Python 3.5.
There is no open connection. As shown in the gist, All that the script does is start gunicorn and send a SIGTERM.
Happens to me with the master branch as well. This is with Python 3.5 on Ubuntu 16.04. Maybe, this happens because of Ubuntu.
$ python --version
Python 3.5.1+
$ gunicorn --version
gunicorn (version 19.5.0)
$ time ./run-and-stop-gunicorn.sh
[2016-05-10 01:32:59 +0000] [8537] [INFO] Starting gunicorn 19.5.0
[2016-05-10 01:32:59 +0000] [8537] [INFO] Listening at: http://127.0.0.1:8000 (8537)
[2016-05-10 01:32:59 +0000] [8537] [INFO] Using worker: sync
[2016-05-10 01:32:59 +0000] [8541] [INFO] Booting worker with pid: 8541
[2016-05-10 01:33:00 +0000] [8537] [INFO] Handling signal: term
[2016-05-10 01:33:30 +0000] [8537] [INFO] Shutting down: Master
real 0m31.102s
user 0m0.320s
sys 0m0.012s
The same behavior happens with Python 3.5, Gunicorn (master, commit 9d158bec94c91c62cd5cc28a27268a5b8804d795), Centos 7.
$ time ./run-and-stop-gunicorn.sh
CentOS Linux release 7.2.1511 (Core)
Python 3.5.1
[2016-05-10 04:25:12 +0000] [1448] [INFO] Starting gunicorn 19.5.0
[2016-05-10 04:25:12 +0000] [1448] [INFO] Listening at: http://127.0.0.1:8000 (1448)
[2016-05-10 04:25:12 +0000] [1448] [INFO] Using worker: sync
[2016-05-10 04:25:12 +0000] [1452] [INFO] Booting worker with pid: 1452
[2016-05-10 04:25:13 +0000] [1448] [INFO] Handling signal: term
[2016-05-10 04:25:43 +0000] [1448] [INFO] Shutting down: Master
real 0m31.109s
user 0m0.231s
sys 0m0.053s
Python 3.4 does not have this issue.
$ time ./run-and-stop-gunicorn.sh
CentOS Linux release 7.2.1511 (Core)
Python 3.4.4
[2016-05-10 04:24:09 +0000] [1386] [INFO] Starting gunicorn 19.5.0
[2016-05-10 04:24:09 +0000] [1386] [INFO] Listening at: http://127.0.0.1:8000 (1386)
[2016-05-10 04:24:09 +0000] [1386] [INFO] Using worker: sync
[2016-05-10 04:24:09 +0000] [1390] [INFO] Booting worker with pid: 1390
[2016-05-10 04:24:09 +0000] [1386] [INFO] Handling signal: term
[2016-05-10 04:24:09 +0000] [1390] [INFO] Worker exiting (pid: 1390)
[2016-05-10 04:24:09 +0000] [1386] [INFO] Shutting down: Master
real 0m1.103s
user 0m0.240s
sys 0m0.058s
You're right. I do observe this on Python 3. Perhaps I made a mistake earlier when I was testing.
I also agree with the diagnosis.
I don't believe this affects any of the async workers or the threaded worker.
PEP 475 https://www.python.org/dev/peps/pep-0475/#backward-compatibility has some information how this should be addressed.
Applications relying on the fact that system calls are interrupted with InterruptedError will hang. The authors of this PEP don't think that such applications exist, since they would be exposed to other issues such as race conditions (there is an opportunity for deadlock if the signal comes before the system call). Besides, such code would be non-portable.
In any case, those applications must be fixed to handle signals differently, to have a reliable behaviour on all platforms and all Python versions. A possible strategy is to set up a signal handler raising a well-defined exception, or use a wakeup file descriptor.
For applications using event loops, signal.set_wakeup_fd() is the recommanded option to handle signals. Python's low-level signal handler will write signal numbers into the file descriptor and the event loop will be awaken to read them. The event loop can handle those signals without the restriction of signal handlers (for example, the loop can be woken up in any thread, not just the main thread).
Just tested it and was able to reproduce it. At least for me gunicorn exit is delayed. Removing the --timeout
arguments makes the problem gone.
After debugging a little, it appears that after receiving the signal, the select is indeed blocking there:
https://github.com/benoitc/gunicorn/blob/master/gunicorn/workers/sync.py#L35
I am not sure yet how to solve the issue, I will reread the PEP 475 carefully this morning to see if it can give a solution, thanks for the link!
@suriya @tilgovi please test #1257 and let me know :)
The test script run-and-stop-gunicorn.sh behaves identically with Python 3.5 and Python 3.4. The pull request addresses the SIGTERM issue.
@benoitc Are there other parts of the gunicorn that break because of PEP 475?
@suriya not to my knowledge, I tested with other workers and it was fine.
@suriya so the patch is fixing your issue right?
@benoitc Yes, absolutely fixing the issue. Thank you.
I was just wondering if there would be other places that might need to be fixed since the effects of PEP 475 seem to be pervasive.
@suriya normally not, but will keep an eye on it :-)
For those who arrive here and wonder in what version in gunicorn this was fixed, it's 19.5.0
Most helpful comment
For those who arrive here and wonder in what version in gunicorn this was fixed, it's 19.5.0