Gunicorn: POST fails when returning >13k response on Heroku

Created on 4 Aug 2014  路  34Comments  路  Source: benoitc/gunicorn

Hi, we hit this issue in production using Flask + Gunicorn + Heroku and couldn't find a cause or a workaround.

For one particular POST request with POST parameters, the request would fail with an H18 error (sock=backend) in Heroku's router indicating that the server closed the socket when it shouldn't have.

We started decreasing the response size of that failing endpoint until we narrowed it down to around the 13k mark. If we sent less than 13k, the response would always work. If we sent more than 13k, the response would almost always not work.

Code to reproduce this is available at https://github.com/erjiang/gunicorn-issue - just deploy the repo to Heroku as is and follow the instructions in the README.

( Feedback Requested unconfirmed help wanted - Bugs -

Most helpful comment

I was able to reproduce using the testcase at https://github.com/erjiang/gunicorn-issue (which is using gunicorn 19.9.0, Python 2.7.14, sync worker, --workers 4). Of note is that gunicorn's access log output reports it believes it has returned an HTTP 200.

Updating to Python 3.7.3 + gunicorn master, and reducing to --workers 1 had no effect on reproducibility, however switching from sync worker to gevent made the error occur less frequently (though it still did). Using --log-level debug didn't reveal anything significant (the only additional output during the request was the [DEBUG] POST /test1 line).

Next I tried --spew, however the issue no longer reproduced. This led me to try adding a time.sleep(1) before the resp.close() here which similarly prevented the issue.

As such it seems the cause is that the socket send buffer might not be empty at time of the close(), which can cause the response to be lost:

Note: close() releases the resource associated with a connection but does not necessarily close the connection immediately. If you want to close the connection in a timely fashion, call shutdown() before close().

(See https://docs.python.org/3/library/socket.html#socket.socket.close)

Adding sock.shutdown(socket.SHUT_RDWR) (docs) prior to the sock.close() here resolved the issue for me. An alternative fix perhaps might be to use SO_LINGER, though from what I've read it has trade-offs.

Docs on this subject are hard to come by, but I found:
https://stackoverflow.com/questions/8874021/close-socket-directly-after-send-unsafe
https://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable

Hope that helps :-)

All 34 comments

Really helpful report, thanks @erjiang.

i've no heroku account to test from. Can anyone with such account test it? cc @tilgovi @kennethreitz

Happy to but I probably won't get to it soon.

As a quick sanity check I ran it locally and checked a few things with curl to compare waitress and gunicorn:

  • [x] Content-Length the same
  • [x] Same body content
  • [x] Same transfer encoding (neither one specifies chunked, both use Content-Length)

Next I'm curious about maybe whether there are differences at the TCP level. I'll tcpdump them and see if I notice anything fishy.

I did notice that even with the same curl line gunicorn drops the connection but waitress leaves it open. No clues from that yet, but it's the _only_ thing I could see that was different.

@tilgovi I guess that the behaviour you see with waitress could be reproduced with the threaded worker. Anyway thanks to take care of this :)

HI all,
I am experiencing the same problem. Did any of you got the chance to examine this issue more thoroughly?
@tilgovi @erjiang @benoitc

Cheers
Maxim

@maximkgn are you also using flask? Any more details?

I am using django 1.7.
We had a certain post response which was always longer than 13k, and with a certain probability ~0.5 the response in the client would be truncated to a bit over 13k. In the heroku logs we saw the same h18 error, and after we made sure no error is happening in our python code, we had to conclude it happens in the gunicorn layer between heroku and our python.
When we switched to waitress/uwsgi the bug stopped happening..

@maximkgn what happen if you use the --threads setting ?

Anyone able to test this?

I have the same problem with flask and gunicorn (tested versions 19.3 and 19.4.5). @benoitc I tried 1, 2, and 4 threads (with the --threads) option, and it makes no difference.

Let me know if I can help test this in any way?

@cbaines what the requests look like?

Friendpaste is able to accept more than 1M posts.... so there is certainly not a limit inside gunicorn.

never had an answer. closing the issue since it's not reproducible. Feel free to reopen one if needed.

Still reproduces after updating dependencies to include Flask 1.0.2 and gunicorn 19.9.0. Might be nice to get the attention of someone at Heroku about this though - I hear they have some dedicated Python people.

See latest commit here: https://github.com/erjiang/gunicorn-issue/

I am also receiving this H18 error on a large GET request regularly.

Switching to waitress did fix the issue. Not sure why gunicorn produces it, but the same exact code is being executed.

response body is 21.54 KB

Still reproduces after updating dependencies to include Flask 1.0.2 and gunicorn 19.9.0. Might be nice to get the attention of someone at Heroku about this though - I hear they have some dedicated Python people.

See latest commit here: https://github.com/erjiang/gunicorn-issue/

I created a support ticket on Heroku. Will update here if anything useful comes from it.

@benoitc looks like @erjiang provided a reproducible example. Could we open this back up?

Re-opened. I'll self-assign and take a look when I can.

Still reproduces after updating dependencies to include Flask 1.0.2 and gunicorn 19.9.0. Might be nice to get the attention of someone at Heroku about this though - I hear they have some dedicated Python people.
See latest commit here: https://github.com/erjiang/gunicorn-issue/

I created a support ticket on Heroku. Will update here if anything useful comes from it.

Did you get a reply from heroku?

I was able to reproduce using the testcase at https://github.com/erjiang/gunicorn-issue (which is using gunicorn 19.9.0, Python 2.7.14, sync worker, --workers 4). Of note is that gunicorn's access log output reports it believes it has returned an HTTP 200.

Updating to Python 3.7.3 + gunicorn master, and reducing to --workers 1 had no effect on reproducibility, however switching from sync worker to gevent made the error occur less frequently (though it still did). Using --log-level debug didn't reveal anything significant (the only additional output during the request was the [DEBUG] POST /test1 line).

Next I tried --spew, however the issue no longer reproduced. This led me to try adding a time.sleep(1) before the resp.close() here which similarly prevented the issue.

As such it seems the cause is that the socket send buffer might not be empty at time of the close(), which can cause the response to be lost:

Note: close() releases the resource associated with a connection but does not necessarily close the connection immediately. If you want to close the connection in a timely fashion, call shutdown() before close().

(See https://docs.python.org/3/library/socket.html#socket.socket.close)

Adding sock.shutdown(socket.SHUT_RDWR) (docs) prior to the sock.close() here resolved the issue for me. An alternative fix perhaps might be to use SO_LINGER, though from what I've read it has trade-offs.

Docs on this subject are hard to come by, but I found:
https://stackoverflow.com/questions/8874021/close-socket-directly-after-send-unsafe
https://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable

Hope that helps :-)

Full STR:

  1. Create a free Heroku account at https://signup.heroku.com
  2. Install the Heroku CLI (see https://devcenter.heroku.com/articles/heroku-cli)
  3. Log into the CLI using heroku login
  4. git clone https://github.com/erjiang/gunicorn-issue && cd gunicorn-issue
  5. heroku create (this creates a free Heroku app with randomly generated name and configures a git remote named heroku)
  6. git push heroku master
  7. curl --data "foo=bar" https://YOUR_GENERATED_APP_NAME.herokuapp.com/test1 (fails >75% of the time)
  8. When finished, run heroku destroy to delete the app.

@tilgovi Sounds like @edmorley produced a plausible explanation of what's wrong. Are you able to take a look and see what the right fix is? I could also submit a PR to add sock.shutdown() but I don't know enough to say whether it's the right fix or whether it would negatively affect other situations.

Hello, I hit the same problem with 503 KB response size. Response data is JSON array.
Observed behavior is:

  1. I see truncated response body and the http client (Chrome, curl) is still waiting for the response.
  2. ~75% of requests experience response time between 120-130 seconds. The remaining requests resolve under 400 ms.
  3. Requests with small response size are fast.

It's reproducible on both:

  1. local Docker installation on Windows 10
  2. Running docker container on AWS ECS

Runtime environment setup
meinheld-gunicorn-docker image tagged as _python3.6_ with Python 3.6.7, Flask 1.0.2, flask-restplus 0.12.1, simpe Flask-caching

Docker configuration: 3 CPUs, RAM 1024 MB

Gunicorn configuration:

In https://github.com/benoitc/gunicorn/issues/2015 someone else had problems with a meinheld worker hanging, and using a different worker type solved the problem. I wonder if there's a general issue with it. @stapetro are you able to try a different worker?

Hello @jamadden ,
Your suggestion fixed the issue. There's no problem with both _gevent_ and _gthread_ worker classes. I moved away from meinheld. Thank you for the quick reply and help! :)

Full STR:

  1. Create a free Heroku account at https://signup.heroku.com
  2. Install the Heroku CLI (see https://devcenter.heroku.com/articles/heroku-cli)
  3. Log into the CLI using heroku login
  4. git clone https://github.com/erjiang/gunicorn-issue && cd gunicorn-issue
  5. heroku create (this creates a free Heroku app with randomly generated name and configures a git remote named heroku)
  6. git push heroku master
  7. curl --data "foo=bar" https://YOUR_GENERATED_APP_NAME.herokuapp.com/test1 (fails >75% of the time)
  8. When finished, run heroku destroy to delete the app.

I had very similar behaviour on my app, and found that when using curl -H instead of curl --data (since it's a GET request) it works for my app (Django, Gunicorn, Heruko). I have not tested on the gunicorn-issue app. Thought this might be useful to someone.

@mikkelhn Yesss. An app with Flask/Flask RestPlus and Gunicorn behaves in this way: replying to POST request gives 503 error [if payload > 13k], while the error does not happen if the app replies to a GET. Exactly the same code!
Can anyone explain this very annoying behavior? Is switching to waitress the only workaround to fix this issue? I feel that modifying Gunicorn "by hand" is not a viable solution...

I went ahead and opened a PR to call shutdown() before close(). Frankly, it's a bit wild that Heroku continues to recommend Gunicorn when it's broken by default on Heroku.

If, as @erijang correctly states, Heroku recommends Gunicorn when Gunicorn is not the way to go: which are simple and viable alternatives to Gunicorn (and how to best configure them on Heroku)?
AFAIK, many customers choose Heroku just because it should not require a deep knowledge in server architectures and configuration details... :|

@RinaldoNani what do you mean? Also which worker are we talking about? .

@benoitc This issue affects multiple worker types, as mentioned in:
https://github.com/benoitc/gunicorn/issues/840#issuecomment-482491267

Hi @benoitc. As I mentioned in a previous post, we have deployed a pretty simple Flask / FlaskRestPlus app on Heroku, following with care Heroku's guidelines for Python/Flask server side application deployment (which, as I understand, suggest the use of Gunicorn sync "web" workers).

The behavior of our app reflects the title of this thread.

Tested locally, everything works fine, the app delivers 20k+ JSON with no problems; but when the app is deployed on Heroku, the 503 error issue becomes systematic: even with literally no traffic, the output is not delivered.
As others pointed out, the logs show that at HTTP level everything seems ok (200 response code is logged).
If the payload is less than 13k, then Heroku/Gunicorn respond to POSTs as expected.
We followed @mikkelhn idea of avoiding POST (?!?) endpoints and use GET instead, and this seems a (not very nice) way to tackle the problem.

We are not Gunicorn experts, and frankly we expected that our simple use case could work "out of the box".
If you have any suggestions to help us out, we will be forever grateful :)

@RinaldoNani Shot in the dark... somewhere in your request handler, try reading all of request.data. For example:

@route('/whatever', methods=['POST'])
def whatever_handler():
    str(request.data)
    return flask.jsonify(...)

Does that have any effect on your errors?

I am writing this at 1:00AM after hustling with H18 issue for over 2 weeks now (couldn't wait to share).

I am working with huge dataset and responding 18K to 20K records to plot. H18 came as a very random error. It would work fine sometimes, but would throw "Content-header length doesn't match" on all the browsers. I tried almost all the solutions discussed about this issue but didn't have any luck. There were 2 things I tried which worked finally:

  1. Changed the POST request to GET.
  2. My data had NaN/Null values so I changed my model and provided a default value. (I think this solved the issue)
    After this, I stopped getting this error.
    Hope this could help someone!
Was this page helpful?
0 / 5 - 0 ratings