Hi,
Environment:
Is there something that should be done with Gunicorn regarding idle connections?
I've stumbled upon a problem closely related to this, but somewhat worse. By default, Amazon Elastic Load Balancer (ELB) maintains a connection to the application server opened, closing it when there's no data sent after 60s. This exhausts the available Gunicorn workers very quickly, making new connections to wait up to a minute to be answered.
As a workaround, I've reduced the ELB timeout to 1s, but this doesn't solve every use case. If there's a high latency client, which takes more than a second to answer any step of the connection, he'll face a 504 - Gateway Timeout
error.
I've tried other approaches like changing Gunicorn worker class (gevent, eventlet, etc.) and putting it behind an nginx reverse proxy, but none of this helped. Does anyone have any ideas about what else can I do?
Regards,
Tiago.
What happened when you used gevent? That should have alleviated the problem, assuming you created enough workers (processes) and worker_connections (greenlets) to handle your concurrent load.
Ron,
The behavior didn't changed, whether using the default "sync" worker class or "gevent". The default number of worker connections is 1000, so this shouldn't be the problem (I was testing a non-production URL).
What is meant by "exhausts the ... workers"? The workers stop serving requests and just hang in keep-alive?
Randall,
Yes, exactly. There's no problem if I try to access the application directly using a web browser or curl
/wget
. But if I try to access it through an ELB, it can hang up to 60s (default idle timeout), waiting for an idle connection to be finished.
@myhro the number of worker_connections has no impact on the sync worker. It's more likely a keepalive issue. ELB can be tricked to not keep alive the connections.
Anyway do you have any logs on the gunicorn side that could help us? Also can you share your configuration of the ELB?
@myhro any news?????
Hi Benoit,
Sorry for the delay. Moving between jobs and cities here.
Anyway do you have any logs on the gunicorn side that could help us?
Started this example application:
$ gunicorn --bind 0.0.0.0:8000 --log-level debug web:app
(...)
[2016-07-27 14:43:50 +0000] [8340] [INFO] Starting gunicorn 19.6.0
[2016-07-27 14:43:50 +0000] [8340] [DEBUG] Arbiter booted
[2016-07-27 14:43:50 +0000] [8340] [INFO] Listening at: http://0.0.0.0:8000 (8340)
[2016-07-27 14:43:50 +0000] [8340] [INFO] Using worker: sync
[2016-07-27 14:43:50 +0000] [8345] [INFO] Booting worker with pid: 8345
[2016-07-27 14:43:50 +0000] [8340] [DEBUG] 1 workers
[2016-07-27 14:43:53 +0000] [8345] [DEBUG] GET /
[2016-07-27 14:43:56 +0000] [8345] [DEBUG] GET /
[2016-07-27 14:43:58 +0000] [8345] [DEBUG] GET /
[2016-07-27 14:44:00 +0000] [8345] [DEBUG] GET /
Until now, things are pretty straightforward. I can request and get answers instantly when accessing Gunicorn directly. The problem starts when the instance is added to the load balancer:
[2016-07-27 14:44:32 +0000] [8345] [DEBUG] GET /
[2016-07-27 14:44:32 +0000] [8345] [DEBUG] GET /
[2016-07-27 14:44:33 +0000] [8345] [DEBUG] GET /
[2016-07-27 14:45:04 +0000] [8340] [CRITICAL] WORKER TIMEOUT (pid:8345)
[2016-07-27 14:45:04 +0000] [8345] [INFO] Worker exiting (pid: 8345)
[2016-07-27 14:45:04 +0000] [8348] [INFO] Booting worker with pid: 8348
[2016-07-27 14:45:33 +0000] [8348] [DEBUG] Closing connection.
[2016-07-27 14:45:33 +0000] [8348] [DEBUG] Closing connection.
[2016-07-27 14:45:33 +0000] [8348] [DEBUG] Closing connection.
[2016-07-27 14:45:33 +0000] [8348] [DEBUG] GET /
[2016-07-27 14:45:34 +0000] [8348] [DEBUG] Closing connection.
[2016-07-27 14:45:34 +0000] [8348] [DEBUG] GET /
[2016-07-27 14:45:37 +0000] [8348] [DEBUG] Closing connection.
[2016-07-27 14:45:37 +0000] [8348] [DEBUG] Closing connection.
[2016-07-27 14:45:37 +0000] [8348] [DEBUG] GET /
[2016-07-27 14:45:37 +0000] [8348] [DEBUG] GET /
[2016-07-27 14:45:37 +0000] [8348] [DEBUG] GET /
In this moment, some requests are answered, some aren't. ELB health check faces some timeouts and considers that it isn't working anymore (Instance has failed at least the Unhealthy Threshold number of health checks consecutively.
) and kicks it from its instance pool.
Also can you share your configuration of the ELB?
Pretty much the default:
Port Configuration:
80 (HTTP) forwarding to 8000 (HTTP)
Stickiness: Disabled
-
Connection Settings:
Idle timeout: 60 seconds
-
Connection Draining: Enabled, 300 seconds
-
Ping Target HTTP:8000/
Timeout 5 seconds
Interval 30 seconds
Unhealthy threshold 2
Healthy threshold 10
Is there anything else that I may provide to help you to debug this issue?
I'm going to close this as the issue is very old and has not been updated in some years now. As far as I understand it, some issues like these may be solved by moving from ELB to ALB, as ALB does not pre-open connections to the backend. See https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/how-elastic-load-balancing-works.html
Most helpful comment
Hi Benoit,
Sorry for the delay. Moving between jobs and cities here.
Started this example application:
Until now, things are pretty straightforward. I can request and get answers instantly when accessing Gunicorn directly. The problem starts when the instance is added to the load balancer:
In this moment, some requests are answered, some aren't. ELB health check faces some timeouts and considers that it isn't working anymore (
Instance has failed at least the Unhealthy Threshold number of health checks consecutively.
) and kicks it from its instance pool.Pretty much the default:
Is there anything else that I may provide to help you to debug this issue?