gunicorn 19 with WebSocket on Heroku

Created on 19 Nov 2015 · 28Comments · Source: benoitc/gunicorn

I'm investigating a strange problem that occurs when using gunicorn 19.0.0 or newer with an application that uses my Flask-SocketIO server, which upgrades client connections to WebSocket.

The issue, as observed from the client side, is that the connection is extremely slow. Upon inspection of the heroku logs, the slowness appears to be caused by constant timeouts and reconnects. The timeouts occur when using both the eventlet and gevent workers, each using a completely different WebSocket implementation, which I think eliminates eventlet, gevent and the server-side WebSocket code as a possible culprits. You can see detailed heroku logs on this bug report.

I discovered that switching to gunicorn 18 makes the problem disappear completely. With 18 the WebSocket connection is rock solid.

I have also found that using any of the 19.x releases locally (without a reverse proxy in front of it) works fine, so it appears the problem is related to having the Heroku router/reverse proxy in front of gunicorn.

I went through the list of changes in release 19.0.0 and can't really see anything that could be causing this. Has anyone seen this problem before? Any idea what can cause it?

Edit: forgot to mention that in all local and heroku tests to troubleshoot this issue I used Python 2.7.10.

Feedback Requested Investigation

Source

miguelgrinberg

Most helpful comment

Well, after some more debugging I found what's different between R18 and R19 that breaks websocket applications.

These applications based on the Flask-SocketIO extension are stateful. If multiple workers are used, then sticky sessions are required at the load balancer. Unfortunately gunicorn does not support sticky sessions, so when using gunicorn it is required to run with one worker.

The difference between R18 and R19 is that in R19 the WEB_CONCURRENCY environment variable is used to determine the number of workers to initialize by default, while R18 goes with one worker by default.

Heroku sets WEB_CONCURRENCY=2, so when running under R19 the application runs in an unsupported configuration. So that was it, adding -w 1 to the command line makes everything work.

I think the problem with the response class I indicated above should be addressed, though. I think there is also a problem with the ALREADY_HANDLED hack, which looks like it was specifically put to support the WebSocket handshake. The problem is that eventlet uses its own ALREADY_HANDLED variable in its WebSocket code. Maybe gunicorn should accept it as well as its own.

But anyway, the main problem turned out to be a false alarm. I should have done an investigation sooner, sorry I dragged this issue for so long!

miguelgrinberg on 24 Jan 2016

👍3 ❤1

All 28 comments

Sounds like a possible regression. @benoitc @berkerpeksag should we add this to milestone 20?

tilgovi on 20 Nov 2015

@tilgovi definitely . I can see 56b5f4038f60626c439e4f6be39128c63d452c53 as a possible reason for it. @miguelgrinberg can you try to revert that change and test by chance?

@miguelgrinberg also which version of python are you using?

benoitc on 20 Nov 2015

@benoitc All the tests are Python 2.7.10. I tried reverting the commit you indicated, but the result was the same (i.e. same failures on Heroku, works ok locally).

Note that reverting the commit on top of master was not clean, I had to manually adjust a couple minor things. You can see my reverted commit here.

miguelgrinberg on 20 Nov 2015

i need to create an heroku account to test that issue. Will try that ASAP.

benoitc on 26 Nov 2015

@miguelgrinberg looks the issue has been solved, could you confirm?

benoitc on 22 Jan 2016

No, unfortunately I still see the reconnection errors with 19.4.5 and with master. Here is the app on heroku, if you want to see the errors: http://socketio1.herokuapp.com/ (see the errors in the browser's console). The normal behavior for this app is to establish a websocket connection and then the console log should stay quiet, as all the communications occur on the socket. With gunicorn 18.0 this application runs without any errors.

miguelgrinberg on 23 Jan 2016

Can you post some log about gunicorn in debug mode there? Looking at the flask error it is more likely that heroku is closing the request but not sure.

benoitc on 23 Jan 2016

@benoitc So I took a closer look at the problem. Just as a reminder, recall that this problem only occurs when this application is running on Heroku. The same application, running locally, with the same gunicorn version works fine.

Using gunicorn master, I reproduce the problem with this Procfile:

web: gunicorn -k eventlet app:app

if I add --spew to the command, then strangely the problem goes away and the WebSocket connection works fine (though kinda slow, probably due to the copious amounts of debugging logs).

Without --spew, the WebSocket connection is never established, the client keeps trying to connect. I noticed the following stack trace in the log:

2016-01-24T01:56:12.013385+00:00 app[web.1]: [2016-01-24 01:56:12 +0000] [10] [ERROR] Error handling request /socket.io/?EIO=3&transport=websocket&sid=36cf395c8e6446039e9ea956a42a3618
2016-01-24T01:56:12.013389+00:00 app[web.1]: Traceback (most recent call last):
2016-01-24T01:56:12.013390+00:00 app[web.1]:   File "/app/.heroku/src/gunicorn-master/gunicorn/workers/async.py", line 52, in handle
2016-01-24T01:56:12.013390+00:00 app[web.1]:     self.handle_request(listener_name, req, client, addr)
2016-01-24T01:56:12.013391+00:00 app[web.1]:   File "/app/.heroku/src/gunicorn-master/gunicorn/workers/async.py", line 114, in handle_request
2016-01-24T01:56:12.013392+00:00 app[web.1]:     resp.close()
2016-01-24T01:56:12.013392+00:00 app[web.1]:   File "/app/.heroku/src/gunicorn-master/gunicorn/http/wsgi.py", line 408, in close
2016-01-24T01:56:12.013393+00:00 app[web.1]:     self.send_headers()
2016-01-24T01:56:12.013393+00:00 app[web.1]:   File "/app/.heroku/src/gunicorn-master/gunicorn/http/wsgi.py", line 324, in send_headers
2016-01-24T01:56:12.013394+00:00 app[web.1]:     tosend = self.default_headers()
2016-01-24T01:56:12.013394+00:00 app[web.1]:   File "/app/.heroku/src/gunicorn-master/gunicorn/http/wsgi.py", line 305, in default_headers
2016-01-24T01:56:12.013395+00:00 app[web.1]:     elif self.should_close():
2016-01-24T01:56:12.013396+00:00 app[web.1]:     if self.status_code < 200 or self.status_code in (204, 304):
2016-01-24T01:56:12.013396+00:00 app[web.1]:   File "/app/.heroku/src/gunicorn-master/gunicorn/http/wsgi.py", line 235, in should_close
2016-01-24T01:56:12.013397+00:00 app[web.1]: AttributeError: 'Response' object has no attribute 'status_code'

I think this is an actual bug in the Response class. Note that the status_code attribute is only added to the object when start_response is called, the constructor does not initialize this attribute. Looking at the WebSocket code in eventlet, it seems start_response is never called during the entire handshake, eventlet writes the 101 Switching Protocols response directly on the socket. That I think explains this stack trace. It does not explain why this never errors in gunicorn 18, which seems to have the same problem in the Response class.

I'll keep digging and if I can make it work I'll submit a patch.

miguelgrinberg on 24 Jan 2016

Well, after some more debugging I found what's different between R18 and R19 that breaks websocket applications.

Heroku sets WEB_CONCURRENCY=2, so when running under R19 the application runs in an unsupported configuration. So that was it, adding -w 1 to the command line makes everything work.

But anyway, the main problem turned out to be a false alarm. I should have done an investigation sooner, sorry I dragged this issue for so long!

miguelgrinberg on 24 Jan 2016

👍3 ❤1

@miguelgrinberg thanks a lot for your investigation! Seems like you spotted the right issue.

Anyway how does Flask-SocketIO store the sticky session? Seems like it can use redis but... Just asking because I may have a solution for it in the next coming branch of gunicorn.

Thanks again for all the work above anyway. I will let the ticket open for now, so I can work on it.

benoitc on 24 Jan 2016

@benoitc Flask-SocketIO keeps the client session in memory. If you have one server process, then it is easy, all the clients are kept in that single process. When working with multiple servers, each server owns a subset of the clients, so in that case, it is required that all the requests for a given client are always sent to the same server. Any operations that require working on the complete client list are coordinated through messages on a message queue such as Redis or RabbitMQ.

The set up that I recommend for the multi-server scenario is nginx as load balancer with the ip_hash option to enable sticky sessions. Behind nginx, multiple gunicorn processes with a single worker each, either eventlet or gevent. Then a redis store is added for the server-to-server communication.

It would be awesome if you decide to support something similar to nginx's ip_hash dispatching. That would enable Flask-SocketIO multi-server on Heroku or similar platforms where the load balancer is not controlled by the user.

miguelgrinberg on 25 Jan 2016

I'm going to close this per the diagnosis in https://github.com/benoitc/gunicorn/issues/1147#issuecomment-174245319.

If anyone is interest I see three possible new issues. Without judgment, they are:

Use eventlet.wsgi.ALREADY_HANDLED in eventlet worker
Document a way to signal websocket handling in WSGI environment (see #1015)
Implement worker affinity

tilgovi on 27 Jan 2016

@miguelgrinberg how dyu finally get the app running on heroku, i'm having a problem uploading the socketio app, if you could help i'd appreciate

Jaysins on 4 Oct 2017

@Jaysins the example gunicorn commands I have in the Socket.IO documentation should work on Heroku. You need to force a single worker process with -w 1, and use eventlet or gevent. If you do those two things, I think Socket.IO should work.

miguelgrinberg on 4 Oct 2017

@miguelgrinberg where can i see the documentation please?

Jaysins on 5 Oct 2017

@Jaysins https://flask-socketio.readthedocs.io/en/latest/#gunicorn-web-server

miguelgrinberg on 5 Oct 2017

👍1

so you saying if i include this 'gunicorn --worker-class eventlet -w 1 module:app' in my Procfile and follow the usual flask steps it should work fine? not really a flask person, just branched out, so i'm kinda new to most this things

Jaysins on 5 Oct 2017

@Jaysins Help with what? Shouldn't you talk to Heroku to solve your login issue if you can't solve it on your own?

miguelgrinberg on 5 Oct 2017

aight thanks

Jaysins on 6 Oct 2017

@miguelgrinberg sorry, but the eventlet allow just one user at a time?

Jaysins on 9 Oct 2017

@Jaysins the default for gunicorn is 1000 simultaneous connections for eventlet or gevent. If you get only one, my guess is that you are doing something that is not async friendly that is blocking everything.

miguelgrinberg on 9 Oct 2017

👍1

@miguelgrinberg thanks you really helped a lot

Jaysins on 10 Oct 2017

@miguelgrinberg Given the fact that gunicorn can only hold 1000 simultaneous connections for eventlet/gevent, do you have any suggestions as to how I could scale my application to accommodate more websocket connections on heroku? For background information, my app is supposed to allow users to have multiple private real-time conversations in the browser which will require multiple connections per users. I'm using flask_socketio. Any advice would be highly appreciated.

cheickmec on 10 Oct 2017

@cheickmec the 1000 limit is a default, you can increase it with the --max-connections argument, though I'm not sure a Heroku dyno will be able to handle that many connections anyway. Flask-SocketIO supports horizontal scaling, but I have never investigated how feasible is to use that option with Heroku, as you will need to have some sort of load balancer that can distribute requests between multiple dynos. See https://flask-socketio.readthedocs.io/en/latest/#using-multiple-workers for information on horizontally scaling a Flask-SocketIO app.

Side note: did you guys notice that this is the gunicorn issue tracker? Not sure this unrelated discussion has a place here.

miguelgrinberg on 10 Oct 2017

Thank you @miguelgrinberg

cheickmec on 10 Oct 2017

Thanks @miguelgrinberg

Changing my Procfile to what you've got listed in the documentation with one worker fixed the heroku socketio connection issues.

calvinzirk on 12 Feb 2018

gunicorn --worker-class eventlet -w 1 module:app done it but not still working

virgincodes on 10 Aug 2018

Same here, I've tried multiple variations of eventlet/gevent gunicorn 19.x, gunicorn 18.0.0 and have tried all the procfiles listed above. Here's a more detailed summary (https://stackoverflow.com/questions/52143869/flask-socketio-takes-extremely-long-to-connect-on-heroku)

Still unable to get a connection with heroku to flask_socketio