Gunicorn: Dealing with idle connections

Created on 19 Jun 2016 · 8Comments · Source: benoitc/gunicorn

Hi,

Environment:

Gunicorn: 19.6.0
Python: 2.7
Ubuntu: 14.04

Is there something that should be done with Gunicorn regarding idle connections?

I've stumbled upon a problem closely related to this, but somewhat worse. By default, Amazon Elastic Load Balancer (ELB) maintains a connection to the application server opened, closing it when there's no data sent after 60s. This exhausts the available Gunicorn workers very quickly, making new connections to wait up to a minute to be answered.

As a workaround, I've reduced the ELB timeout to 1s, but this doesn't solve every use case. If there's a high latency client, which takes more than a second to answer any step of the connection, he'll face a 504 - Gateway Timeout error.

I've tried other approaches like changing Gunicorn worker class (gevent, eventlet, etc.) and putting it behind an nginx reverse proxy, but none of this helped. Does anyone have any ideas about what else can I do?

Regards,
Tiago.

- Bugs -

Source

myhro

Most helpful comment

Hi Benoit,

Sorry for the delay. Moving between jobs and cities here.

Anyway do you have any logs on the gunicorn side that could help us?

Started this example application:

$ gunicorn --bind 0.0.0.0:8000 --log-level debug web:app
(...)
[2016-07-27 14:43:50 +0000] [8340] [INFO] Starting gunicorn 19.6.0
[2016-07-27 14:43:50 +0000] [8340] [DEBUG] Arbiter booted
[2016-07-27 14:43:50 +0000] [8340] [INFO] Listening at: http://0.0.0.0:8000 (8340)
[2016-07-27 14:43:50 +0000] [8340] [INFO] Using worker: sync
[2016-07-27 14:43:50 +0000] [8345] [INFO] Booting worker with pid: 8345
[2016-07-27 14:43:50 +0000] [8340] [DEBUG] 1 workers
[2016-07-27 14:43:53 +0000] [8345] [DEBUG] GET /
[2016-07-27 14:43:56 +0000] [8345] [DEBUG] GET /
[2016-07-27 14:43:58 +0000] [8345] [DEBUG] GET /
[2016-07-27 14:44:00 +0000] [8345] [DEBUG] GET /

Until now, things are pretty straightforward. I can request and get answers instantly when accessing Gunicorn directly. The problem starts when the instance is added to the load balancer:

[2016-07-27 14:44:32 +0000] [8345] [DEBUG] GET /
[2016-07-27 14:44:32 +0000] [8345] [DEBUG] GET /
[2016-07-27 14:44:33 +0000] [8345] [DEBUG] GET /
[2016-07-27 14:45:04 +0000] [8340] [CRITICAL] WORKER TIMEOUT (pid:8345)
[2016-07-27 14:45:04 +0000] [8345] [INFO] Worker exiting (pid: 8345)
[2016-07-27 14:45:04 +0000] [8348] [INFO] Booting worker with pid: 8348
[2016-07-27 14:45:33 +0000] [8348] [DEBUG] Closing connection.
[2016-07-27 14:45:33 +0000] [8348] [DEBUG] Closing connection.
[2016-07-27 14:45:33 +0000] [8348] [DEBUG] Closing connection.
[2016-07-27 14:45:33 +0000] [8348] [DEBUG] GET /
[2016-07-27 14:45:34 +0000] [8348] [DEBUG] Closing connection.
[2016-07-27 14:45:34 +0000] [8348] [DEBUG] GET /
[2016-07-27 14:45:37 +0000] [8348] [DEBUG] Closing connection.
[2016-07-27 14:45:37 +0000] [8348] [DEBUG] Closing connection.
[2016-07-27 14:45:37 +0000] [8348] [DEBUG] GET /
[2016-07-27 14:45:37 +0000] [8348] [DEBUG] GET /
[2016-07-27 14:45:37 +0000] [8348] [DEBUG] GET /

In this moment, some requests are answered, some aren't. ELB health check faces some timeouts and considers that it isn't working anymore (Instance has failed at least the Unhealthy Threshold number of health checks consecutively.) and kicks it from its instance pool.

Also can you share your configuration of the ELB?

Pretty much the default:

Port Configuration: 
80 (HTTP) forwarding to 8000 (HTTP)
Stickiness: Disabled
-
Connection Settings:
Idle timeout: 60 seconds
-
Connection Draining: Enabled, 300 seconds
-
Ping Target HTTP:8000/
Timeout 5 seconds
Interval    30 seconds
Unhealthy threshold 2
Healthy threshold   10

Is there anything else that I may provide to help you to debug this issue?

myhro on 27 Jul 2016

👍2

All 8 comments

What happened when you used gevent? That should have alleviated the problem, assuming you created enough workers (processes) and worker_connections (greenlets) to handle your concurrent load.

RonRothman on 22 Jun 2016

Ron,

The behavior didn't changed, whether using the default "sync" worker class or "gevent". The default number of worker connections is 1000, so this shouldn't be the problem (I was testing a non-production URL).

myhro on 22 Jun 2016

What is meant by "exhausts the ... workers"? The workers stop serving requests and just hang in keep-alive?

tilgovi on 27 Jun 2016

Randall,

Yes, exactly. There's no problem if I try to access the application directly using a web browser or curl/wget. But if I try to access it through an ELB, it can hang up to 60s (default idle timeout), waiting for an idle connection to be finished.

myhro on 27 Jun 2016

@myhro the number of worker_connections has no impact on the sync worker. It's more likely a keepalive issue. ELB can be tricked to not keep alive the connections.

Anyway do you have any logs on the gunicorn side that could help us? Also can you share your configuration of the ELB?

benoitc on 6 Jul 2016

@myhro any news?????

benoitc on 27 Jul 2016

Hi Benoit,

Sorry for the delay. Moving between jobs and cities here.

Anyway do you have any logs on the gunicorn side that could help us?

Started this example application:

$ gunicorn --bind 0.0.0.0:8000 --log-level debug web:app
(...)
[2016-07-27 14:43:50 +0000] [8340] [INFO] Starting gunicorn 19.6.0
[2016-07-27 14:43:50 +0000] [8340] [DEBUG] Arbiter booted
[2016-07-27 14:43:50 +0000] [8340] [INFO] Listening at: http://0.0.0.0:8000 (8340)
[2016-07-27 14:43:50 +0000] [8340] [INFO] Using worker: sync
[2016-07-27 14:43:50 +0000] [8345] [INFO] Booting worker with pid: 8345
[2016-07-27 14:43:50 +0000] [8340] [DEBUG] 1 workers
[2016-07-27 14:43:53 +0000] [8345] [DEBUG] GET /
[2016-07-27 14:43:56 +0000] [8345] [DEBUG] GET /
[2016-07-27 14:43:58 +0000] [8345] [DEBUG] GET /
[2016-07-27 14:44:00 +0000] [8345] [DEBUG] GET /

Until now, things are pretty straightforward. I can request and get answers instantly when accessing Gunicorn directly. The problem starts when the instance is added to the load balancer:

[2016-07-27 14:44:32 +0000] [8345] [DEBUG] GET /
[2016-07-27 14:44:32 +0000] [8345] [DEBUG] GET /
[2016-07-27 14:44:33 +0000] [8345] [DEBUG] GET /
[2016-07-27 14:45:04 +0000] [8340] [CRITICAL] WORKER TIMEOUT (pid:8345)
[2016-07-27 14:45:04 +0000] [8345] [INFO] Worker exiting (pid: 8345)
[2016-07-27 14:45:04 +0000] [8348] [INFO] Booting worker with pid: 8348
[2016-07-27 14:45:33 +0000] [8348] [DEBUG] Closing connection.
[2016-07-27 14:45:33 +0000] [8348] [DEBUG] Closing connection.
[2016-07-27 14:45:33 +0000] [8348] [DEBUG] Closing connection.
[2016-07-27 14:45:33 +0000] [8348] [DEBUG] GET /
[2016-07-27 14:45:34 +0000] [8348] [DEBUG] Closing connection.
[2016-07-27 14:45:34 +0000] [8348] [DEBUG] GET /
[2016-07-27 14:45:37 +0000] [8348] [DEBUG] Closing connection.
[2016-07-27 14:45:37 +0000] [8348] [DEBUG] Closing connection.
[2016-07-27 14:45:37 +0000] [8348] [DEBUG] GET /
[2016-07-27 14:45:37 +0000] [8348] [DEBUG] GET /
[2016-07-27 14:45:37 +0000] [8348] [DEBUG] GET /

Also can you share your configuration of the ELB?

Pretty much the default:

Port Configuration: 
80 (HTTP) forwarding to 8000 (HTTP)
Stickiness: Disabled
-
Connection Settings:
Idle timeout: 60 seconds
-
Connection Draining: Enabled, 300 seconds
-
Ping Target HTTP:8000/
Timeout 5 seconds
Interval    30 seconds
Unhealthy threshold 2
Healthy threshold   10

Is there anything else that I may provide to help you to debug this issue?

myhro on 27 Jul 2016

👍2

I'm going to close this as the issue is very old and has not been updated in some years now. As far as I understand it, some issues like these may be solved by moving from ELB to ALB, as ALB does not pre-open connections to the backend. See https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/how-elastic-load-balancing-works.html

tilgovi on 30 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings