I'm investigating a strange problem that occurs when using gunicorn 19.0.0 or newer with an application that uses my Flask-SocketIO server, which upgrades client connections to WebSocket.
The issue, as observed from the client side, is that the connection is extremely slow. Upon inspection of the heroku logs, the slowness appears to be caused by constant timeouts and reconnects. The timeouts occur when using both the eventlet and gevent workers, each using a completely different WebSocket implementation, which I think eliminates eventlet, gevent and the server-side WebSocket code as a possible culprits. You can see detailed heroku logs on this bug report.
I discovered that switching to gunicorn 18 makes the problem disappear completely. With 18 the WebSocket connection is rock solid.
I have also found that using any of the 19.x releases locally (without a reverse proxy in front of it) works fine, so it appears the problem is related to having the Heroku router/reverse proxy in front of gunicorn.
I went through the list of changes in release 19.0.0 and can't really see anything that could be causing this. Has anyone seen this problem before? Any idea what can cause it?
Edit: forgot to mention that in all local and heroku tests to troubleshoot this issue I used Python 2.7.10.
Sounds like a possible regression. @benoitc @berkerpeksag should we add this to milestone 20?
@tilgovi definitely . I can see 56b5f4038f60626c439e4f6be39128c63d452c53 as a possible reason for it. @miguelgrinberg can you try to revert that change and test by chance?
@miguelgrinberg also which version of python are you using?
@benoitc All the tests are Python 2.7.10. I tried reverting the commit you indicated, but the result was the same (i.e. same failures on Heroku, works ok locally).
Note that reverting the commit on top of master was not clean, I had to manually adjust a couple minor things. You can see my reverted commit here.
i need to create an heroku account to test that issue. Will try that ASAP.
@miguelgrinberg looks the issue has been solved, could you confirm?
No, unfortunately I still see the reconnection errors with 19.4.5 and with master. Here is the app on heroku, if you want to see the errors: http://socketio1.herokuapp.com/ (see the errors in the browser's console). The normal behavior for this app is to establish a websocket connection and then the console log should stay quiet, as all the communications occur on the socket. With gunicorn 18.0 this application runs without any errors.
Can you post some log about gunicorn in debug mode there? Looking at the flask error it is more likely that heroku is closing the request but not sure.
@benoitc So I took a closer look at the problem. Just as a reminder, recall that this problem only occurs when this application is running on Heroku. The same application, running locally, with the same gunicorn version works fine.
Using gunicorn master, I reproduce the problem with this Procfile:
web: gunicorn -k eventlet app:app
if I add --spew
to the command, then strangely the problem goes away and the WebSocket connection works fine (though kinda slow, probably due to the copious amounts of debugging logs).
Without --spew
, the WebSocket connection is never established, the client keeps trying to connect. I noticed the following stack trace in the log:
2016-01-24T01:56:12.013385+00:00 app[web.1]: [2016-01-24 01:56:12 +0000] [10] [ERROR] Error handling request /socket.io/?EIO=3&transport=websocket&sid=36cf395c8e6446039e9ea956a42a3618
2016-01-24T01:56:12.013389+00:00 app[web.1]: Traceback (most recent call last):
2016-01-24T01:56:12.013390+00:00 app[web.1]: File "/app/.heroku/src/gunicorn-master/gunicorn/workers/async.py", line 52, in handle
2016-01-24T01:56:12.013390+00:00 app[web.1]: self.handle_request(listener_name, req, client, addr)
2016-01-24T01:56:12.013391+00:00 app[web.1]: File "/app/.heroku/src/gunicorn-master/gunicorn/workers/async.py", line 114, in handle_request
2016-01-24T01:56:12.013392+00:00 app[web.1]: resp.close()
2016-01-24T01:56:12.013392+00:00 app[web.1]: File "/app/.heroku/src/gunicorn-master/gunicorn/http/wsgi.py", line 408, in close
2016-01-24T01:56:12.013393+00:00 app[web.1]: self.send_headers()
2016-01-24T01:56:12.013393+00:00 app[web.1]: File "/app/.heroku/src/gunicorn-master/gunicorn/http/wsgi.py", line 324, in send_headers
2016-01-24T01:56:12.013394+00:00 app[web.1]: tosend = self.default_headers()
2016-01-24T01:56:12.013394+00:00 app[web.1]: File "/app/.heroku/src/gunicorn-master/gunicorn/http/wsgi.py", line 305, in default_headers
2016-01-24T01:56:12.013395+00:00 app[web.1]: elif self.should_close():
2016-01-24T01:56:12.013396+00:00 app[web.1]: if self.status_code < 200 or self.status_code in (204, 304):
2016-01-24T01:56:12.013396+00:00 app[web.1]: File "/app/.heroku/src/gunicorn-master/gunicorn/http/wsgi.py", line 235, in should_close
2016-01-24T01:56:12.013397+00:00 app[web.1]: AttributeError: 'Response' object has no attribute 'status_code'
I think this is an actual bug in the Response
class. Note that the status_code
attribute is only added to the object when start_response
is called, the constructor does not initialize this attribute. Looking at the WebSocket code in eventlet, it seems start_response
is never called during the entire handshake, eventlet writes the 101 Switching Protocols
response directly on the socket. That I think explains this stack trace. It does not explain why this never errors in gunicorn 18, which seems to have the same problem in the Response
class.
I'll keep digging and if I can make it work I'll submit a patch.
Well, after some more debugging I found what's different between R18 and R19 that breaks websocket applications.
These applications based on the Flask-SocketIO extension are stateful. If multiple workers are used, then sticky sessions are required at the load balancer. Unfortunately gunicorn does not support sticky sessions, so when using gunicorn it is required to run with one worker.
The difference between R18 and R19 is that in R19 the WEB_CONCURRENCY
environment variable is used to determine the number of workers to initialize by default, while R18 goes with one worker by default.
Heroku sets WEB_CONCURRENCY=2
, so when running under R19 the application runs in an unsupported configuration. So that was it, adding -w 1
to the command line makes everything work.
I think the problem with the response class I indicated above should be addressed, though. I think there is also a problem with the ALREADY_HANDLED
hack, which looks like it was specifically put to support the WebSocket handshake. The problem is that eventlet uses its own ALREADY_HANDLED
variable in its WebSocket code. Maybe gunicorn should accept it as well as its own.
But anyway, the main problem turned out to be a false alarm. I should have done an investigation sooner, sorry I dragged this issue for so long!
@miguelgrinberg thanks a lot for your investigation! Seems like you spotted the right issue.
Anyway how does Flask-SocketIO store the sticky session? Seems like it can use redis but... Just asking because I may have a solution for it in the next coming branch of gunicorn.
Thanks again for all the work above anyway. I will let the ticket open for now, so I can work on it.
@benoitc Flask-SocketIO keeps the client session in memory. If you have one server process, then it is easy, all the clients are kept in that single process. When working with multiple servers, each server owns a subset of the clients, so in that case, it is required that all the requests for a given client are always sent to the same server. Any operations that require working on the complete client list are coordinated through messages on a message queue such as Redis or RabbitMQ.
The set up that I recommend for the multi-server scenario is nginx as load balancer with the ip_hash
option to enable sticky sessions. Behind nginx, multiple gunicorn processes with a single worker each, either eventlet or gevent. Then a redis store is added for the server-to-server communication.
It would be awesome if you decide to support something similar to nginx's ip_hash
dispatching. That would enable Flask-SocketIO multi-server on Heroku or similar platforms where the load balancer is not controlled by the user.
I'm going to close this per the diagnosis in https://github.com/benoitc/gunicorn/issues/1147#issuecomment-174245319.
If anyone is interest I see three possible new issues. Without judgment, they are:
@miguelgrinberg how dyu finally get the app running on heroku, i'm having a problem uploading the socketio app, if you could help i'd appreciate
@Jaysins the example gunicorn commands I have in the Socket.IO documentation should work on Heroku. You need to force a single worker process with -w 1
, and use eventlet or gevent. If you do those two things, I think Socket.IO should work.
@miguelgrinberg where can i see the documentation please?
@Jaysins https://flask-socketio.readthedocs.io/en/latest/#gunicorn-web-server
so you saying if i include this 'gunicorn --worker-class eventlet -w 1 module:app' in my Procfile and follow the usual flask steps it should work fine? not really a flask person, just branched out, so i'm kinda new to most this things
@Jaysins Help with what? Shouldn't you talk to Heroku to solve your login issue if you can't solve it on your own?
aight thanks
@miguelgrinberg sorry, but the eventlet allow just one user at a time?
@Jaysins the default for gunicorn is 1000 simultaneous connections for eventlet or gevent. If you get only one, my guess is that you are doing something that is not async friendly that is blocking everything.
@miguelgrinberg thanks you really helped a lot
@miguelgrinberg Given the fact that gunicorn can only hold 1000 simultaneous connections for eventlet/gevent, do you have any suggestions as to how I could scale my application to accommodate more websocket connections on heroku? For background information, my app is supposed to allow users to have multiple private real-time conversations in the browser which will require multiple connections per users. I'm using flask_socketio. Any advice would be highly appreciated.
@cheickmec the 1000 limit is a default, you can increase it with the --max-connections
argument, though I'm not sure a Heroku dyno will be able to handle that many connections anyway. Flask-SocketIO supports horizontal scaling, but I have never investigated how feasible is to use that option with Heroku, as you will need to have some sort of load balancer that can distribute requests between multiple dynos. See https://flask-socketio.readthedocs.io/en/latest/#using-multiple-workers for information on horizontally scaling a Flask-SocketIO app.
Side note: did you guys notice that this is the gunicorn issue tracker? Not sure this unrelated discussion has a place here.
Thank you @miguelgrinberg
Thanks @miguelgrinberg
Changing my Procfile to what you've got listed in the documentation with one worker fixed the heroku socketio connection issues.
gunicorn --worker-class eventlet -w 1 module:app done it but not still working
Same here, I've tried multiple variations of eventlet/gevent gunicorn 19.x, gunicorn 18.0.0 and have tried all the procfiles listed above. Here's a more detailed summary (https://stackoverflow.com/questions/52143869/flask-socketio-takes-extremely-long-to-connect-on-heroku)
Still unable to get a connection with heroku to flask_socketio
Most helpful comment
Well, after some more debugging I found what's different between R18 and R19 that breaks websocket applications.
These applications based on the Flask-SocketIO extension are stateful. If multiple workers are used, then sticky sessions are required at the load balancer. Unfortunately gunicorn does not support sticky sessions, so when using gunicorn it is required to run with one worker.
The difference between R18 and R19 is that in R19 the
WEB_CONCURRENCY
environment variable is used to determine the number of workers to initialize by default, while R18 goes with one worker by default.Heroku sets
WEB_CONCURRENCY=2
, so when running under R19 the application runs in an unsupported configuration. So that was it, adding-w 1
to the command line makes everything work.I think the problem with the response class I indicated above should be addressed, though. I think there is also a problem with the
ALREADY_HANDLED
hack, which looks like it was specifically put to support the WebSocket handshake. The problem is that eventlet uses its ownALREADY_HANDLED
variable in its WebSocket code. Maybe gunicorn should accept it as well as its own.But anyway, the main problem turned out to be a false alarm. I should have done an investigation sooner, sorry I dragged this issue for so long!