Gunicorn: Silent OOM (Booting worker)

Created on 6 Dec 2019 · 8Comments · Source: benoitc/gunicorn

Here is a reproducible example of the problem mentioned in https://github.com/benoitc/gunicorn/issues/1663

app_oom.py:

#!flask/bin/python
import logging
import traceback

from flask import Flask, jsonify

app = Flask(__name__)
app.logger.addHandler(logging.StreamHandler())
app.logger.setLevel('DEBUG')


@app.route('/hello_oom', methods=['GET'])
def hello_oom():
    try:
        huge_list = []
        while True:
            huge_list.extend([1] * 100000)
        logging.info('Success')
        return jsonify({'output': 'Success'}), 201
    except:
        logging.error(traceback.format_exc())
        return jsonify({'error': traceback.format_exc(), 'error_code': 10}), 500  # Unknown error


if __name__ == '__main__':
    app.run(host='0.0.0.0', debug=False)

Dockerfile.oom:

FROM ubuntu:16.04
RUN apt update -y \
    && apt install -y gcc python3-pip python3-dev \
    && pip3 install flask==1.1.1 gunicorn==19.9.0
COPY app_oom.py app_oom.py
EXPOSE 8888
CMD ["gunicorn", "-b", "0.0.0.0:8888", "--timeout", "20000", "--workers", "1", "--log-level", "debug", "app_oom:app"]

docker build -t app_oom . -f Dockerfile.oom
docker run -ti -p 8888:8888 -m 500m --name app_oom --rm app_oom

When I make a request, in the logs there is no error message, but a message about new worker booting

[2019-12-06 14:48:56 +0000] [12] [INFO] Booting worker with pid: 12
[2019-12-06 14:48:56 +0000] [1] [DEBUG] 1 workers
[2019-12-06 14:48:58 +0000] [12] [DEBUG] GET /hello_oom
[2019-12-06 14:49:32 +0000] [14] [INFO] Booting worker with pid: 14

And the client gets no response (using curl http://0.0.0.0:8888/hello_oom):

curl: (52) Empty reply from server

@benoitc

Discussion Documentation

Source

ilivans

Most helpful comment

https://github.com/benoitc/gunicorn/pull/2219

monobaila on 13 Dec 2019

👍2

All 8 comments

I've just been hit by exactly the same issue with a django app deployed on kubernetes. In my case I set a low memory limit which caused OOM killer to be invoked once this was exceeded.

Do you see evidence of this if you run dmesg on the test machine?

From my initial findings I'm not sure what gunicorn can do here, OOM killer sends a SIGKILL which isn't possible to handle or do cleanup from so process silently disappearing is completely expected. The code path which writes the "Booting worker" message is simply a loop which spawns a new one if worker count is lower than expected - https://github.com/benoitc/gunicorn/blob/d765f0d123fff5da0f36da8f087a8dd0da778411/gunicorn/arbiter.py#L615

I'll wait for others to comment in case I have this wrong but I think this is a documentation issue to explain this failure mode. I'd be happy to assist with writing this after losing many hours to troubleshooting this issue!

monobaila on 13 Dec 2019

you should configure the OOP Killer to send a SIGQUIT or SIGTERM if you want to get fast recovery for a worker. Another way is to reduce the timeout. This way the arbiter will restart the worker faster. Sending SIGKILL to a worker is not a good idea if you want to let gunicorn catch the signal in a fast manner. The SIGKILL signal can't be caught, so the arbiter only detect it when the worker failed to notify it's alive.

benoitc on 13 Dec 2019

These issues in kubernetes are related to this one:

kubernetes/kubernetes#40157
maybe related : kubernetes/kubernetes#50632 , there is an interresting link at the end : https://github.com/ricardomaraschini/oomhero

in any case i don't think it's an issue proper to Gunicorn. I would suggest to increase the memory of your pods if it happen you are going often over the limit.

Maybe the documentation could be augmented with some informations about it also, any patch describe that would be great :)

benoitc on 13 Dec 2019

👍1

I'm writing documentation update now, PR to follow shortly.

monobaila on 13 Dec 2019

👍1

https://github.com/benoitc/gunicorn/pull/2219

monobaila on 13 Dec 2019

👍2

I am thinking we could also have it killed before the oom killer does by monitoring the memory. This would improve the whole thing. I will open an issue about it

benoitc on 13 Dec 2019

👍1

@benoitc How do you feel about an additional log entry when a worker was killed due to a signal? Proposal: #2475