Gunicorn: Gunicorn suddenly closing connection under certain logging conditions

Created on 1 Jun 2020 · 4Comments · Source: benoitc/gunicorn

I'm seeing a similar error to the one documented at https://github.com/benoitc/gunicorn/issues/1507, except it occurs even if there isn't an exception thrown in my code.

I have a Django app, and I noticed that the following pseudocode causes any gunicorn worker to suddenly close the connection (even when the timeout is set very high) right after the call to logger.error flushes. The gunicorn worker dies with no logs emitted, and a new one spins up. Note that the logic causes a database model to be created that must be cleaned up if validation fails:

@api_view(["POST"])
def post_data(request):
    try:
        obj = create_object()
        data = serialize(obj)
        return Response(data)
    except ValidationError as e:
        if obj:
            obj.delete()
        logger.error("Validation failed")
        return Response({'error': str(e)}, status=400)

Our error log is configured to send errors to Sentry.

What fixes the problem:
Any of the following prevents gunicorn from killing the worker:

removing the line obj.delete(). Note that when we delete this object, we make two external API calls.
moving the call to logger.error above the line with obj.delete()

Not a surprise, but when gunicorn closes the connection I see this from nginx: [error] 27#0: *552 upstream prematurely closed connection while reading response header from upstream...

Any ideas here?

Source

alexanderatallah

All 4 comments

Does disabling Sentry and just using file logging fix the problem also?

jamadden on 1 Jun 2020

no error on gunicorn side also ?

benoitc on 2 Jun 2020

👍1

I ran into troubling intermittent 502 errors with a django app using gunicorn / nginx-proxy / postgres in a docker setup and believe this may be related.

In my case, I had nginx-proxy, a django/gunicorn and a postgres container all on the same default network.

The nginx error I was getting was:

564 upstream prematurely closed connection while reading response header from upstream, client: [client ip] [host], request: "GET [failed path] HTTP/2.0", upstream: "http:// [internal IP to the postgres server] :5432/ [failed path] /", host: "[host]"

I started thinking this was an nginx-proxy error, since something like this is mentioned over there.

I noticed every one of these 502s were proceeded by a postgres LOG entry, LOG: invalid length of startup packet. This is supposedly a benign error log from postgres.

The source of this error appears to be the passive health check behavior of nginx, which I believe could be triggering postgres's incomplete startup packet error. Nginx's default configuration says that if there a single failure _or lack of response_ in the default 10 seconds will result in a 502.

I could show this by refreshing constantly and measuring the appearance of the 502 timestamps happening at 10 second intervals.

I am not sure about this, but I suspect gunicorn is detecting the "benign" postgres error and due to #1507 where gunicorn is detecting an upstream error as actually one with the client.

The result is gunicorn closes the connection prematurely, which of course nginx then accurately sees the premature disconnect from gunicorn as its own an upstream error.

However to answer @benoitc's question, no, there is no error reported on gunicorn's side, which made this even harder to figure out.

What worked for me
I went through many troubleshooting steps on this eliminating possibilities, @jamadden 's mention of Sentry was what got me here, (disabling that and Mixpanel were two of my earlier troubleshooting steps)

Finally tried setting the fail_timeout on the postgres server in the upstream block to a very high number and the 502s stop. This is because nginx now has a very long time before it takes issue with an error from gunicorn. Interestingly, setting fail_timetout=0 did not work for me, and actually caused the errors to happen constantly instead of "intermittently" or occasionally.

I can't say that this is exactly what is going on, or if #1507 is at fault. I am pretty sure my workaround could be improved, though I'm not sure what else to do and must move on.

I have not seen these issues between postgres, gunicorn and nginx or rather nginx-proxy linked yet, so I hope this finds someone.

Here's the upstream block of my nginx.conf:

upstream my.hostname.com {
                ## Can be connected with "app_default" network
                # app_nginx-proxy-letsencrypt_1
                        server 172.21.0.5 down;
                ## Can be connected with "app_default" network
                # app_web_1
                        server 172.21.0.3:8000;
                ## Can be connected with "app_default" network
                # app_db_1
                        server 172.21.0.2:5432 max_fails=1 fail_timeout=1000000000s;

If this is at all what a person is going through, nginx-proxy depends on docker-gen so you can't make manual changes to the conf file, but you can try this out by editing the conf it in the container and restarting nginx and doing the refreshes to check the output nginx output for errors. Good luck!

edit:

One more detail. In my process of troubleshooting this, I manually set my postgres server's logging threshold to panic, which caused the log entries that were otherwise accompanying ever 502 to disappear. However, the upstream-of-nginx failures pointing at the postgres server continued to occur at the 10s interval used by nginx's passive health checks.

So if anything, I'd suppose that gunicorn is detecting the errors even if log entries are not being thrown.