Here is a reproducible example of the problem mentioned in https://github.com/benoitc/gunicorn/issues/1663
app_oom.py
:
#!flask/bin/python
import logging
import traceback
from flask import Flask, jsonify
app = Flask(__name__)
app.logger.addHandler(logging.StreamHandler())
app.logger.setLevel('DEBUG')
@app.route('/hello_oom', methods=['GET'])
def hello_oom():
try:
huge_list = []
while True:
huge_list.extend([1] * 100000)
logging.info('Success')
return jsonify({'output': 'Success'}), 201
except:
logging.error(traceback.format_exc())
return jsonify({'error': traceback.format_exc(), 'error_code': 10}), 500 # Unknown error
if __name__ == '__main__':
app.run(host='0.0.0.0', debug=False)
Dockerfile.oom
:
FROM ubuntu:16.04
RUN apt update -y \
&& apt install -y gcc python3-pip python3-dev \
&& pip3 install flask==1.1.1 gunicorn==19.9.0
COPY app_oom.py app_oom.py
EXPOSE 8888
CMD ["gunicorn", "-b", "0.0.0.0:8888", "--timeout", "20000", "--workers", "1", "--log-level", "debug", "app_oom:app"]
docker build -t app_oom . -f Dockerfile.oom
docker run -ti -p 8888:8888 -m 500m --name app_oom --rm app_oom
When I make a request, in the logs there is no error message, but a message about new worker booting
[2019-12-06 14:48:56 +0000] [12] [INFO] Booting worker with pid: 12
[2019-12-06 14:48:56 +0000] [1] [DEBUG] 1 workers
[2019-12-06 14:48:58 +0000] [12] [DEBUG] GET /hello_oom
[2019-12-06 14:49:32 +0000] [14] [INFO] Booting worker with pid: 14
And the client gets no response (using curl http://0.0.0.0:8888/hello_oom
):
curl: (52) Empty reply from server
@benoitc
I've just been hit by exactly the same issue with a django app deployed on kubernetes. In my case I set a low memory limit which caused OOM killer to be invoked once this was exceeded.
Do you see evidence of this if you run dmesg
on the test machine?
From my initial findings I'm not sure what gunicorn can do here, OOM killer sends a SIGKILL which isn't possible to handle or do cleanup from so process silently disappearing is completely expected. The code path which writes the "Booting worker" message is simply a loop which spawns a new one if worker count is lower than expected - https://github.com/benoitc/gunicorn/blob/d765f0d123fff5da0f36da8f087a8dd0da778411/gunicorn/arbiter.py#L615
I'll wait for others to comment in case I have this wrong but I think this is a documentation issue to explain this failure mode. I'd be happy to assist with writing this after losing many hours to troubleshooting this issue!
you should configure the OOP Killer to send a SIGQUIT or SIGTERM if you want to get fast recovery for a worker. Another way is to reduce the timeout. This way the arbiter will restart the worker faster. Sending SIGKILL to a worker is not a good idea if you want to let gunicorn catch the signal in a fast manner. The SIGKILL signal can't be caught, so the arbiter only detect it when the worker failed to notify it's alive.
These issues in kubernetes are related to this one:
in any case i don't think it's an issue proper to Gunicorn. I would suggest to increase the memory of your pods if it happen you are going often over the limit.
Maybe the documentation could be augmented with some informations about it also, any patch describe that would be great :)
I'm writing documentation update now, PR to follow shortly.
I am thinking we could also have it killed before the oom killer does by monitoring the memory. This would improve the whole thing. I will open an issue about it
@benoitc How do you feel about an additional log entry when a worker was killed due to a signal? Proposal: #2475
Closed in #2475.
Most helpful comment
https://github.com/benoitc/gunicorn/pull/2219