(Sorry for the monologue here: simple things got complicated and I ended up digging through the stack. Hopefully what I've documented is helpful for the reader, however.)
As I've understood, by default:
30
seconds (configurable with timeout
) of request processing, gunicorn master process sends SIGTERM
to the worker process, to initiate a graceful restart.30
seconds (configurable with graceful_timeout
), master process sends SIGKILL
. Seems like this signal is also sent when the worker _does_ gracefully shutdown during the graceful_timeout
period (https://github.com/benoitc/gunicorn/commit/d1a09732256fa8db900a1fe75a71466cf2645ef9).The questions:
SIGTERM
signal - in practice, what happens during request processing? Does it just set a flag for the WSGI application (on werkzeug level) that it should shutdown after the request processing is complete? Or does SIGTERM
already somehow affect ongoing request processing - kill IO connections or something to speed up request processing...?On SIGKILL
, I guess request processing is just forcefully aborted.
I could file a tiny PR to improve docs about this, if I get understanding how things actually work.
Hmm, I think https://github.com/benoitc/gunicorn/issues/1236#issuecomment-254059927 confirms my assumptions about SIGTERM
simply setting worker to shutdown after request processing is complete (and setting worker not to accept any new connections).
Seems like how I interpreted timeout
and graceful_timeout
is wrong. Both periods actually refer to time at the start of request processing. So, by default, because both settings are set to 30
seconds, there is no graceful restart enabled. If I do something like --graceful-timeout 15 --timeout 30
that should mean graceful restart is initiated at 15 seconds and worker is force killed at 30 seconds if request did not complete before that.
However, it seems like if response is returned between graceful_timeout
and timeout
, then the worker is not restarted after all? Shouldn't it?
I tested by app.py
:
import time
from flask import Flask
app = Flask(__name__)
@app.route('/foo')
def foo():
time.sleep(3)
return 'ok'
Then:
12:51 $ gunicorn app:app --timeout 5 --graceful-timeout 1
[2017-04-03 12:51:37 +0300] [356] [INFO] Starting gunicorn 19.6.0
[2017-04-03 12:51:37 +0300] [356] [INFO] Listening at: http://127.0.0.1:8000 (356)
[2017-04-03 12:51:37 +0300] [356] [INFO] Using worker: sync
[2017-04-03 12:51:37 +0300] [359] [INFO] Booting worker with pid: 359
Then I send curl localhost:8000/foo
, which returns after 3 seconds. But nothing happens in gunicorn - I see no trace of graceful restart being initiated or happened?
It seems that on timeout
, SystemExit(1,)
is thrown, aborting current request processing in Flask. What code or signal generates it, can't say.
This exception gets thrown through Flask stack, and any teardown_request
handlers catch it. There's enough time to log something, but if you do time.sleep(1)
or something else time consuming in the handler, it gets silently killed. It's as if there was 100-200ms time before the process is actually forcefully terminated and I am wondering what is this delay. It's not graceful timeout, that setting has no impact on the delay. I would expect the process to be just forcefully killed in place, instead of seeing SystemExit
being thrown through the stack, but then potentially killing the process mid-air anyway.
In fact, I don't see graceful_timeout
doing anything - maybe it's not supported for sync workers, or maybe it doesn't work "stand-alone" (or together with timeout
) - only when you manually send SIGTERM
?
Also what might be weird is that https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L392 does not check graceful
flag at all. I guess https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L390 ensures that self.WORKERS
is empty so graceful timeout is not waited when doing non-graceful stop.
@benoitc @tilgovi Care to give a hand here? Hopefully my writings above make sense...
@tuco86 The graceful timeout
is only available when you quit the arbiter, upgrade it (USR2), send an HUP signal to the arbiter or send a QUIT signal to the worker. Ie it's only used when the action is normal
The timeout is here to prevent busy workers to block others requests. If they don't notify the arbiter in a time less than the timeout
the worker is simply exited and the connection with the client closed.
Uh, alright. Does timeout
have any effect when you:
quit the arbiter, upgrade it (USR2), send an HUP signal to the arbiter or send a QUIT signal to the worker
I mean, what if the worker doesn't shutdown in graceful_timeout
- will timeout
kick in after that and workers get forcefully killed, or is it left up to user to call for SIGQUIT
in case they don't gracefully die?
QUIT signal to the worker
I assume you meant TERM
here (as QUIT
is documented as _quick shutdown_ for both master and workers)?
if the worker doesn't shutdown during the graceful time it will be killed without any other delay.
Of course. Thanks for clarifying things up!
@benoitc Asking in the context of this old ticket - what does the last sentence in timeout
documentation actually mean?
Generally set to thirty seconds. Only set this noticeably higher if you鈥檙e sure of the repercussions for sync workers. For the non sync workers it just means that the worker process is still communicating and is not tied to the length of time required to handle a single request.
Not being a native english speaker, I have hard time understanding this. Does it mean that timeout
is not supported for non-sync workers (because that's what I seem to be witnessing: I'm using gthread
workers and timeout is not kickin in and killing too slow requests)?
@tuukkamustonen --timeout
is not meant as a request timeout. It's meant as a liveness check for workers. For sync workers, this functions as a request timeout because the worker cannot do anything other than process the request. The asynchronous workers heartbeat even while they are handling long running requests, so unless the worker blocks/freezes it won't be killed.
Maybe it would be a good idea for us to change the name if other people find this confusing.
@tilgovi timeout
is just fine, although something like worker_timeout
might be more descriptive. I just got initially confused because timeout
and graceful_timeout
are declared next to each other in the documentation, so my brain assumed they are tightly connected, while they are actually not.
For sync workers, this functions as a request timeout because the worker cannot do anything other than process the request. The asynchronous workers heartbeat even while they are handling long running requests, so unless the worker blocks/freezes it won't be killed.
Would you have an example of when timeout
kicks in with non-sync workers? Is it something that should never happen, really - maybe only if there's a bug that causes the worker to block/freeze?
That's correct. An asynchronous worker that relies on an event loop core might perform a CPU intensive procedure that doesn't yield within the timeout.
Not only a bug, in other words. Although, sometimes it may indicate a bug, such as a call to a blocking I/O function when an asyncio protocol would be more appropriate.
Getting stuck in CPU intensive task is a good example, thanks.
Calling blocking I/O in async code is one as well, but I'm not sure how it applies to this context - I'm running a traditional Flask app with blocking code but running it with an async worker (gthread
) without any sort of monkey patching. And it works just ok. I know this isn't really in the context of this ticket anymore, but doesn't mixing and matching async/sync code like this cause problems?
Also, what is the heartbeat interval? What would be a sane value to use for timeout
with non-sync workers?
The gthread worker is not asynchronous, but it does have a main thread for the heartbeat so it won't timeout either. In the case of that worker, you probably won't see a timeout unless the worker is very overloaded or, more likely, you call a C extension module the doesn't release the GIL.
You probably don't have to change the timeout unless you start seeing worker timeouts.
Alright. Just one more thing:
The gthread worker is not asynchronous
It may be a little bit confusing that gthread
worker is not async but is listed as "AsyncIO" workers at http://docs.gunicorn.org/en/stable/design.html#asyncio-workers. Other than that, using "threads" doesn't need asyncio, so that also raises questions in the reader. Just saying this from a perspective of a naive user, I'm sure it's all technically well-founded.
In a nutshell, the gthread
worker is implemented with asyncio
lib but it spawns threads to handle sync code. Correct me if wrong.
Glad you asked!
The threaded worker does not use asyncio and does not inherit from the base asynchronous worker class.
We should clarify the documentation. I think it may have been listed as async because the worker timeout is handled concurrently, making it behave more like the async workers than the sync worker with respect to ability to handle long requests and concurrent requests.
It would be great to clarify the documentation and make it more accurately describe all the workers.
yeah the gthreads worker shouldn't listed in asyncio worker. maybe having a section that describe the design of each workers is better?
Re-opening this so we can track it as work to clarify the section on worker types and timeouts.
@tilgovi
--timeout is not meant as a request timeout. It's meant as a liveness check for workers. For sync workers, this functions as a request timeout because the worker cannot do anything other than process the request. The asynchronous workers heartbeat even while they are handling long running requests, so unless the worker blocks/freezes it won't be killed.
Is there a request timeout option available for async workers? In other words how to make arbiter kill a worker that did not process a request within a specified time?
@aschatten there is not, unfortunately. See also #1658.
kill a worker that did not process a request within a specified time
As a worker may be processing multiple requests concurrently, killing whole worker because one request times out sounds pretty extreme. Wouldn't that result in all the other requests getting killed in vain?
I recall uWSGI was planning to introduce thread-based killing in 2.1 or so, though probably even that applies for sync/threaded workers only (and my recollection on this is vague).
As a worker may be processing multiple requests concurrently, killing whole worker because one request times out sounds pretty extreme. Wouldn't that would result in all the other requests getting killed in vain?
The approach can be the same as for max_request
, where there is a separate implementation for every worker type.
We are working on a release this week, at which point it _may_ be time to branch for R20, where we plan to tackle a few major things. That might be the right time to make the current timeout into a proper request timeout for every worker type.
Commenting here instead of filing a separate issue since I'm trying to understand how timeout is supposed to work and I'm not sure whether this is a bug or not.
The IMO unexpected behaviour I'm seeing is this:
Every max-requests'th request (the one after which worker will be restarted) is timeouted, whereas the other requests are completed successfully. In the below example 4 requests are performed, requests 1, 2, and 4 are successful, whereas request 3 fails.
Relevant configuration:
import time
def app(environ, start_response):
start_response('200 OK', [('Content-type', 'text/plain; charset=utf-8')])
time.sleep(5)
return [b"Hello World\n"]
gunicorn:
gunicorn --log-level debug -k gthread -t 4 --max-requests 3 "app:app"
...
[2018-02-08 10:11:59 +0200] [28592] [INFO] Starting gunicorn 19.7.1
[2018-02-08 10:11:59 +0200] [28592] [DEBUG] Arbiter booted
[2018-02-08 10:11:59 +0200] [28592] [INFO] Listening at: http://127.0.0.1:8000 (28592)
[2018-02-08 10:11:59 +0200] [28592] [INFO] Using worker: gthread
[2018-02-08 10:11:59 +0200] [28595] [INFO] Booting worker with pid: 28595
[2018-02-08 10:11:59 +0200] [28592] [DEBUG] 1 workers
[2018-02-08 10:12:06 +0200] [28595] [DEBUG] GET /
[2018-02-08 10:12:11 +0200] [28595] [DEBUG] Closing connection.
[2018-02-08 10:12:15 +0200] [28595] [DEBUG] GET /
[2018-02-08 10:12:20 +0200] [28595] [DEBUG] Closing connection.
[2018-02-08 10:12:23 +0200] [28595] [DEBUG] GET /
[2018-02-08 10:12:23 +0200] [28595] [INFO] Autorestarting worker after current request.
[2018-02-08 10:12:27 +0200] [28592] [CRITICAL] WORKER TIMEOUT (pid:28595)
[2018-02-08 10:12:27 +0200] [28595] [INFO] Worker exiting (pid: 28595)
[2018-02-08 10:12:28 +0200] [28595] [DEBUG] Closing connection.
[2018-02-08 10:12:28 +0200] [28599] [INFO] Booting worker with pid: 28599
[2018-02-08 10:12:32 +0200] [28599] [DEBUG] GET /
[2018-02-08 10:12:37 +0200] [28599] [DEBUG] Closing connection.
^C[2018-02-08 10:12:39 +0200] [28592] [INFO] Handling signal: int
Client:
[salonen@mac ~]$ curl http://127.0.0.1:8000
Hello World
[salonen@mac ~]$ curl http://127.0.0.1:8000
Hello World
[salonen@mac ~]$ curl http://127.0.0.1:8000
curl: (52) Empty reply from server
[salonen@mac ~]$ curl http://127.0.0.1:8000
Hello World
what should be the plan there? I have in mind the following:
Should it be 20.0 or could we postpone it?
postponing.
Hey, so this won't be part of 20.0?
That might be the right time to make the current timeout into a proper request timeout for every worker type.
clarified. @lucas03 it's unclear what a request timeout is there. please open a ticket if you need something specific?.
Most helpful comment
@tuukkamustonen
--timeout
is not meant as a request timeout. It's meant as a liveness check for workers. For sync workers, this functions as a request timeout because the worker cannot do anything other than process the request. The asynchronous workers heartbeat even while they are handling long running requests, so unless the worker blocks/freezes it won't be killed.Maybe it would be a good idea for us to change the name if other people find this confusing.