Celery worker getting stuck consuming lot of resident memory.
Version is celery (3.1.17)
Strace
celery]# strace -p 8401
Process 8401 attached - interrupt to quit
read(10,
celery]# lsof -n -p 8401 | egrep -v '(DIR|REG)'
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
python 8401 dsl 0r FIFO 0,8 0t0 124716100 pipe
python 8401 dsl 1w FIFO 0,8 0t0 124716101 pipe
python 8401 dsl 2w FIFO 0,8 0t0 124716101 pipe
python 8401 dsl 6r FIFO 0,8 0t0 124716462 pipe
python 8401 dsl 7w FIFO 0,8 0t0 124716462 pipe
python 8401 dsl 8r FIFO 0,8 0t0 124716463 pipe
python 8401 dsl 9w FIFO 0,8 0t0 124716463 pipe
python 8401 dsl 10r FIFO 0,8 0t0 124716464 pipe
python 8401 dsl 13w FIFO 0,8 0t0 124716465 pipe
python 8401 dsl 14r FIFO 0,8 0t0 124716466 pipe
python 8401 dsl 15r CHR 1,3 0t0 3662 /dev/null
python 8401 dsl 16w FIFO 0,8 0t0 124716467 pipe
Pstack dump
celery]# pstack 8401
Its stuck on pipe read.
proc]# ls -l /proc/8401/fd
total 0
lr-x------ 1 dsl dsl 64 May 5 17:26 0 -> pipe:[124716100]
l-wx------ 1 dsl dsl 64 May 5 17:26 1 -> pipe:[124716101]
lr-x------ 1 dsl dsl 64 May 5 17:26 10 -> pipe:[124716464]
l-wx------ 1 dsl dsl 64 May 5 17:26 13 -> pipe:[124716465]
@girishbin
Are you using redis as the broker? Seeing a similar symptom with redis broker on celery 3.1.8 and billiard 3.3.0.16 . No high memory consumption though.
Same here. @joostdevries it happens quite often to us, hard to say under what conditions. We have 4 workers using redis backend.
workers log before they are stuck:
[2015-08-07 16:50:40,140: INFO/MainProcess] Task feeds.transformers.rss_atom.by[6dbd5f0d-222b-4c5c-bd22-5e05bb63447b] succeeded in 0.0153002970037s: {}
[2015-08-07 16:50:40,141: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[8a582425-456d-4e49-93c8-eb375967cac5]
[2015-08-07 16:50:40,155: INFO/MainProcess] Task feeds.transformers.rss_atom.by[3f60a721-2a6e-4494-bda7-b7b939efe66a] succeeded in 0.00693402900652s: {}
[2015-08-07 16:50:40,157: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[486cc62d-d330-467f-b30c-02005c2038b6]
[2015-08-07 16:50:40,171: INFO/MainProcess] Task feeds.transformers.rss_atom.by[8a582425-456d-4e49-93c8-eb375967cac5] succeeded in 0.0071912699932s: {}
[2015-08-07 16:50:40,173: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[538a71a6-0c33-494a-adfe-575f7465e9d4]
[2015-08-07 16:50:40,188: INFO/MainProcess] Task feeds.transformers.rss_atom.by[486cc62d-d330-467f-b30c-02005c2038b6] succeeded in 0.0155014329939s: {}
[2015-08-07 16:50:40,189: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[2ad9294a-7553-468a-a3dc-73a8f2cea188]
[2015-08-07 16:50:40,203: INFO/MainProcess] Task feeds.transformers.rss_atom.by[538a71a6-0c33-494a-adfe-575f7465e9d4] succeeded in 0.0153862849984s: {}
[2015-08-07 16:50:40,205: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[f7371090-aa05-4194-a327-cb41d1165b7e]
[2015-08-07 16:50:40,220: INFO/MainProcess] Task feeds.transformers.rss_atom.by[2ad9294a-7553-468a-a3dc-73a8f2cea188] succeeded in 0.0158518639946s: {}
[2015-08-07 16:50:40,222: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[6d35c0b2-9c5a-425f-9405-9f7e1fb3aa41]
[2015-08-07 16:50:40,236: INFO/MainProcess] Task feeds.transformers.rss_atom.by[f7371090-aa05-4194-a327-cb41d1165b7e] succeeded in 0.00751440098975s: {}
[2015-08-07 16:50:40,238: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[ed664157-a065-4b15-9bbe-633edf96d230]
[2015-08-07 16:50:40,252: INFO/MainProcess] Task feeds.transformers.rss_atom.by[6d35c0b2-9c5a-425f-9405-9f7e1fb3aa41] succeeded in 0.00709322700277s: {}
[2015-08-07 16:50:40,254: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[683bc593-44c1-4145-b698-1f2ba66a43bd]
[2015-08-07 16:50:40,260: INFO/MainProcess] Task feeds.transformers.rss_atom.by[ed664157-a065-4b15-9bbe-633edf96d230] succeeded in 0.015573162993s: {}
[2015-08-07 16:50:40,275: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[d8018f2b-a1f0-4112-97ba-335cd597be1c]
[2015-08-07 16:50:40,282: INFO/MainProcess] Task feeds.transformers.rss_atom.by[683bc593-44c1-4145-b698-1f2ba66a43bd] succeeded in 0.0205264859978s: {}
[2015-08-07 16:50:40,292: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[92f938ce-894a-49f4-8c89-a090c37b71c8]
[2015-08-07 16:50:40,299: INFO/MainProcess] Task feeds.transformers.rss_atom.by[d8018f2b-a1f0-4112-97ba-335cd597be1c] succeeded in 0.016343981988s: {}
[2015-08-07 16:50:40,302: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[ff12882d-61b8-403e-97ac-b096580de5f0]
[2015-08-07 16:50:40,318: INFO/MainProcess] Task feeds.transformers.rss_atom.by[92f938ce-894a-49f4-8c89-a090c37b71c8] succeeded in 0.0184662840038s: {}
[2015-08-07 16:50:40,330: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[3e0ff326-cd83-4d88-8dee-f9c167a44f6a]
[2015-08-07 16:50:40,338: INFO/MainProcess] Task feeds.transformers.rss_atom.by[ff12882d-61b8-403e-97ac-b096580de5f0] succeeded in 0.0187683639961s: {}
[2015-08-07 16:50:40,341: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[f12efd44-b2be-42fd-8872-5c979421bea3]
[2015-08-07 16:50:40,357: INFO/MainProcess] Task feeds.transformers.rss_atom.by[3e0ff326-cd83-4d88-8dee-f9c167a44f6a] succeeded in 0.0182138609962s: {}
[2015-08-07 16:50:40,359: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[57d8ba02-db1f-40d4-88f9-1d3d83939c2a]
[2015-08-07 16:50:40,374: INFO/MainProcess] Task feeds.transformers.rss_atom.by[f12efd44-b2be-42fd-8872-5c979421bea3] succeeded in 0.0165290409932s: {}
[2015-08-07 16:50:40,385: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[8ae76799-d19d-4acf-b2b4-cfe9d77e3000]
[2015-08-07 16:50:40,393: INFO/MainProcess] Task feeds.transformers.rss_atom.by[57d8ba02-db1f-40d4-88f9-1d3d83939c2a] succeeded in 0.0185826079978s: {}
[2015-08-07 16:50:40,405: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[ca5e9d23-f660-4f88-b7d1-d30b91750727]
[2015-08-07 16:50:40,414: INFO/MainProcess] Task feeds.transformers.rss_atom.by[8ae76799-d19d-4acf-b2b4-cfe9d77e3000] succeeded in 0.0205218030023s: {}
[2015-08-07 16:50:40,417: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[5cfdbbe8-3da1-45b5-b83e-599303ad02eb]
[2015-08-07 16:50:40,433: INFO/MainProcess] Task feeds.transformers.rss_atom.by[ca5e9d23-f660-4f88-b7d1-d30b91750727] succeeded in 0.0175381449953s: {}
[2015-08-07 16:50:40,436: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[cb864c28-b3c5-4def-a5de-8c15049c6d29]
[2015-08-07 16:50:40,453: INFO/MainProcess] Task feeds.transformers.rss_atom.by[5cfdbbe8-3da1-45b5-b83e-599303ad02eb] succeeded in 0.0190771400084s: {}
[2015-08-07 16:50:40,456: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[cc85e444-7969-45ed-98a4-f9ab074db260]
[2015-08-07 16:50:40,473: INFO/MainProcess] Task feeds.transformers.rss_atom.by[cb864c28-b3c5-4def-a5de-8c15049c6d29] succeeded in 0.0191195910011s: {}
[2015-08-07 16:50:40,476: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[4696c784-819d-4278-87fd-b74c2aab4c57]
[2015-08-07 16:50:40,491: INFO/MainProcess] Task feeds.transformers.rss_atom.by[cc85e444-7969-45ed-98a4-f9ab074db260] succeeded in 0.0098425810138s: {}
[2015-08-07 16:50:40,494: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[bce5ad3e-dd4a-4bbc-bb69-9585597ae010]
[2015-08-07 16:50:40,501: INFO/MainProcess] Task feeds.transformers.rss_atom.by[4696c784-819d-4278-87fd-b74c2aab4c57] succeeded in 0.0174191809929s: {}
[2015-08-07 16:50:40,512: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[cfefb0bb-399c-423e-9032-add27dabd6df]
[2015-08-07 16:50:40,526: INFO/MainProcess] Task feeds.transformers.rss_atom.by[bce5ad3e-dd4a-4bbc-bb69-9585597ae010] succeeded in 0.0162074130058s: {}
[2015-08-07 16:50:40,528: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[51f11154-3dbc-448a-a3a4-00c2fe7ab782]
[2015-08-07 16:50:40,543: INFO/MainProcess] Task feeds.transformers.rss_atom.by[cfefb0bb-399c-423e-9032-add27dabd6df] succeeded in 0.0163563999959s: {}
[2015-08-07 16:50:40,545: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[fb900a0c-403f-4941-aa6a-f242312e5d85]
[2015-08-07 16:50:40,560: INFO/MainProcess] Task feeds.transformers.rss_atom.by[51f11154-3dbc-448a-a3a4-00c2fe7ab782] succeeded in 0.0162344239943s: {}
[2015-08-07 16:50:40,563: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[330fff18-71e0-4d9e-9ad1-0111d44f7e03]
No CPU usage, memory usage (output from top):
31059 celery 20 0 517176 64388 3456 S 0.0 1.7 0:14.86 celery
31058 celery 20 0 517204 64308 3456 S 0.0 1.7 0:14.32 celery
31062 celery 20 0 517044 64308 3456 S 0.0 1.7 0:14.88 celery
31061 celery 20 0 516912 63508 3064 S 0.0 1.7 0:14.32 celery
31046 celery 20 0 369344 63396 6964 S 0.0 1.7 1:09.70 celery
16967 celery 20 0 366648 57396 7612 S 0.0 1.6 0:04.04 celery
will try setting BROKER_TRANSPORT_OPTIONS = {'socket_timeout': 10}
and see if it helps.
@domenkozar Did it help?
Maybe it's not:
[root@ip-172-30-0-183 ec2-user]# lsof -p 9253|grep 6379
celery 9253 celery 10u IPv4 2738405 0t0 TCP ip-172-30-0-183.ec2.internal:37155->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery 9253 celery 23u IPv4 2737905 0t0 TCP ip-172-30-0-183.ec2.internal:37090->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery 9253 celery 33u IPv4 2737956 0t0 TCP ip-172-30-0-183.ec2.internal:37098->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery 9253 celery 36u IPv4 2737990 0t0 TCP ip-172-30-0-183.ec2.internal:37105->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery 9253 celery 46u IPv4 2739860 0t0 TCP ip-172-30-0-183.ec2.internal:37297->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
[root@ip-172-30-0-183 ec2-user]# lsof -p 9253|grep CLOSE_WAIT
celery 9253 celery 11u IPv4 2739163 0t0 TCP ip-172-30-0-183.ec2.internal:45206->wordpress.com:http (CLOSE_WAIT)
celery 9253 celery 16u IPv4 2737802 0t0 TCP ip-172-30-0-183.ec2.internal:53865->205.251.242.33:https (CLOSE_WAIT)
celery 9253 celery 32u IPv4 2764510 0t0 TCP ip-172-30-0-183.ec2.internal:48242->ec2-52-2-102-195.compute-1.amazonaws.com:http (CLOSE_WAIT)
celery 9253 celery 37u IPv4 2739073 0t0 TCP ip-172-30-0-183.ec2.internal:45198->wordpress.com:http (CLOSE_WAIT)
celery 9253 celery 38u IPv4 2738052 0t0 TCP ip-172-30-0-183.ec2.internal:36962->230-156-220-74-available.ilandcloud.com:http (CLOSE_WAIT)
celery 9253 celery 39u IPv4 2738313 0t0 TCP ip-172-30-0-183.ec2.internal:46067->qb-in-f118.1e100.net:http (CLOSE_WAIT)
celery 9253 celery 40u IPv4 2739202 0t0 TCP ip-172-30-0-183.ec2.internal:43252->r2.ycpi.vip.dcb.yahoo.net:http (CLOSE_WAIT)
celery 9253 celery 42u IPv4 2739382 0t0 TCP ip-172-30-0-183.ec2.internal:45228->wordpress.com:http (CLOSE_WAIT)
celery 9253 celery 43u IPv4 2739488 0t0 TCP ip-172-30-0-183.ec2.internal:38920->wordpress.com:http (CLOSE_WAIT)
celery 9253 celery 45u IPv4 2739667 0t0 TCP ip-172-30-0-183.ec2.internal:57721->ec2-54-165-198-100.compute-1.amazonaws.com:https (CLOSE_WAIT)
Flower is showing all workers as offline, workers appear to be just waiting for a task.
new hypothesis: I think the major culprit was that redis ran out of memory (there are some OOM traces), and master is unable to send new tasks to workers.
The strace you are seeing looks normal to me, is it not just waiting for more tasks? I guess it's possible that it was sent incomplete data (seeing as there is something in the buffer)
Yes, workers appear to just be waiting for more tasks, so something is going on with master celery process. The task queue has 100k items,.
So I've cleared celery
redis queue and celery beat on master process is not scheduling new tasks. Looking at gdb it's doing what it should:
(gdb) py-list
115 buf.seek(self.bytes_written)
116 marker = 0
117
118 try:
119 while True:
>120 data = self._sock.recv(socket_read_size)
121 # an empty string indicates the server shutdown the socket
122 if isinstance(data, bytes) and len(data) == 0:
123 raise socket.error(SERVER_CLOSED_CONNECTION_ERROR)
124 buf.write(data)
125 data_length = len(data)
Aha, found which of the connection was the problem, from CLIENT LIST
:
id=497 addr=172.30.0.183:45591 fd=20 name= age=1142 idle=1142 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=NULL
It's interesting that it sends a cmd=NULL
and the connction is coming from the master celery worker. @ask any ideas?
So the way I see it, redis connection pool gets exausted. If I kill some random connection on redis side, they go to CLOSE_WAIT
state, but celery master process keeps them open.
We have many many small tasks, so tasks are getting picked up really fast.
You can see there are 11 connections and 2 of them were killed from redis side with CLIENT KILL
command:
[root@ip-172-30-0-183 ec2-user]# lsof -p 24747|grep TCP
celery 24747 celery 10u IPv4 3619402 0t0 TCP ip-172-30-0-183.ec2.internal:45535->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery 24747 celery 11u IPv4 3619425 0t0 TCP ip-172-30-0-183.ec2.internal:45544->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery 24747 celery 16u sock 0,6 0t0 3113866 protocol: TCP
celery 24747 celery 34u IPv4 3619406 0t0 TCP ip-172-30-0-183.ec2.internal:45537->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery 24747 celery 35u IPv4 3619421 0t0 TCP ip-172-30-0-183.ec2.internal:45543->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery 24747 celery 36u IPv4 3619417 0t0 TCP ip-172-30-0-183.ec2.internal:45542->ip-172-30-3-169.ec2.internal:6379 (CLOSE_WAIT)
celery 24747 celery 37u IPv4 3619431 0t0 TCP ip-172-30-0-183.ec2.internal:45545->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery 24747 celery 38u IPv4 3619435 0t0 TCP ip-172-30-0-183.ec2.internal:45546->ip-172-30-3-169.ec2.internal:6379 (CLOSE_WAIT)
celery 24747 celery 39u IPv4 3619437 0t0 TCP ip-172-30-0-183.ec2.internal:45547->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery 24747 celery 40u IPv4 3703870 0t0 TCP ip-172-30-0-183.ec2.internal:46884->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery 24747 celery 41u IPv4 3619441 0t0 TCP ip-172-30-0-183.ec2.internal:45549->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery 24747 celery 42u IPv4 3619443 0t0 TCP ip-172-30-0-183.ec2.internal:45550->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
So it's also not connection pool being exausted. I've increased it to 20 and currently it got stuck with 8 connections.
This is the traceback that happens when I kill cmd=NULL
connection:
[2015-09-06 11:34:44,336: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/opt/cmgd/venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 278, in start
blueprint.start(self)
File "/opt/cmgd/venv/lib/python2.7/site-packages/celery/bootsteps.py", line 123, in start
step.start(parent)
File "/opt/cmgd/venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 821, in start
c.loop(*c.loop_args())
File "/opt/cmgd/venv/lib/python2.7/site-packages/celery/worker/loops.py", line 70, in asynloop
next(loop)
File "/opt/cmgd/venv/lib/python2.7/site-packages/kombu/async/hub.py", line 267, in create_loop
tick_callback()
File "/opt/cmgd/venv/lib/python2.7/site-packages/kombu/transport/redis.py", line 942, in on_poll_start
[add_reader(fd, on_readable, fd) for fd in cycle.fds]
File "/opt/cmgd/venv/lib/python2.7/site-packages/kombu/async/hub.py", line 201, in add_reader
return self.add(fds, callback, READ | ERR, args)
File "/opt/cmgd/venv/lib/python2.7/site-packages/kombu/async/hub.py", line 152, in add
self.poller.register(fd, flags)
File "/opt/cmgd/venv/lib/python2.7/site-packages/kombu/utils/eventio.py", line 78, in register
self._epoll.register(fd, events)
IOError: [Errno 9] Bad file descriptor
[2015-09-06 11:34:44,339: WARNING/MainProcess] Restoring 1 unacknowledged message(s).
[2015-09-06 11:34:44,398: INFO/MainProcess] Connected to redis://mas-re-13j3y6r7dd9ui.vfavzi.0001.use1.cache.am
I think I found a way for everyone to reproduce this, schedule few thousands of dummy tasks that don't do anything.
4 Workers process around 170 tasks/s then get stuck.
And flower graphs showing the hang http://i.imgur.com/1q0A8DN.png
We're issuing the same problems here.
RabbitMQ does the job at the moment just right.
We prefer redis because of performance issues.
Our workers hangs after a short period with redis, not error messages or anything that looks suspicious.
set socket_timeout did not resolve the issue.
@domenkozar What configuration are you using? I execute 100k tasks using redis often with the stress test suite.
If it's consuming a lot of memory, do you have the fanout_prefix
and fanout_patterns
transport options set? http://docs.celeryproject.org/en/latest/getting-started/brokers/redis.html#caveats
@ask unfortunately I don't have access to that environment anymore, but we didn't change fanout_*
settings
In my case, I see the worker's main process in a tight loop at 100% CPU while job-doing process sit quietly doing nothing.
strace indicates that fd=3 (always the same fd) is being continuously polled, and the poll returns POLLIN | POLLHUP
.
strace -p 9988
poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
lsof
says:
celeryd: 9988 sentry 3u sock 0,6 0t0 566337941 can't identify protocol
I've read somewhere this is an indication of a half-closed socket connection.
But that's as far as I've gotten. Reading this thread gives me more hints to look after the next time the issue kicks.
I'm on the same project that @domenkozar was working on, and we're seeing this problem again. strace shows the worker processes hanging in a read() syscall.
Interestingly, the hangs went away for months and only started to recur recently when we added a bunch more work to the system by processing more tasks, so in some way it's timing- or load-dependent.
Package versions: amqp-1.4.9, billiard -3.3.0.23, kombu-3.0.35, celery 3-1.23.
How should I go about debugging this? Should I be trying to instrument the socket operations in billiard, in kombu, or in celery?
@akuchling can you use redis-cli
and post the output of CLIENT LIST
?
Also, I'd double check all servers are using kombu-3.0.35
Version # of kombu is in fact 3.0.35. 'Client list' returns the following: http://dpaste.com/2GGCECS (IP addresses slightly redacted).
Interesting! In that dpaste, there isn't a client listed where cmd=null, so maybe this is a new failure mode. It doesn't seem to me that we're filling up Redis's memory at the moment; our data is around 3Gb on a 6Gb AWS machine.
We do have socket_timeout set to 10, but that doesn't fix the problem. Instead celery ends up doing a few tasks every 10sec and then freezes again. Note how the seconds figure is incrementing: 37, 47, 57 in this dpaste: http://dpaste.com/2AZYEH4
I found myself looking at on_stop_not_started() in celery/concurrent/asyncpool.py because it showed up in a GDB stacktrace, and was puzzled by the use of pending_remove_fd.
It's initially given the value of an empty set. Then the code does: for fd in outqueues: self._flush_outqueue(fd, pending_remove_fd.discard, ...)
and removes the contents of pending_remove_fd from outqueues.
Bug: The set is initially empty, and _flush_outqueues() is calling pending_remove_fd.discard()
. So how does the pending_remove_fd
set ever get filled in? Was it supposed to be initialized as = set(outqueues)
to make a copy of the starting set of fds? Or have I misread something?
@akuchling good catch, I'm guess it should be pending_remove_fds.add
instead of pending_remove_fds.discard
there
So at the end of that loop it will remove the flushed outqueues when it does the outqueues.difference_update(pending_remove_fd)
Thanks! Should this change also be committed to the 3.1 branch (which I've done in a fork so that I can test it)?
Unfortunately the pending_remove_fd patch doesn't fix all hanging for us, though it doesn't seem to cause any problems. With the patch, I've also gotten a worker/master hanging: the master is doing a read() from the worker, and the worker is in a read() in Billiard_conn_recv(), being called from pickle.load(). Any suggestions for what this might be?
Closing this, as we don't have the resources to complete this task.
May be fixed in master, let's see if comes back after 4.0 release.
@akuchling @domenkozar - was the discussion on this issue continued anywhere / did you find any resolution?
Trawling through the issues list there's a lot of subjective "something is broken" but this appears to be the closest to the issues we're seeing.
Sadly none of us work at the company anymore where this was happening in production.
@akuchling took over my previous efforts, maybe he can report the last update if this bug was really fixed or not.
@JonPeel: I don't remember if we ever found a resolution. To summarize the changes we applied:
'socket_timeout': 10, # timeout in case of network errors
'fanout_prefix': True,
'fanout_patterns': True,
@JonPeel Now I'm running 4.0.22 (and planning to update to 4.1). I haven't found this issue anymore.
I do find from time time one worker with -c1
(for serialization) being unresponsive after reaching the maximum jobs (2000). It seems that when trying to kill and respawn the child worker it gets stuck. But I haven't explored deeply: Oddly enough, restarting the worker makes it work ok: even recycling the child worker after reaching another 2000 jobs. So I don't think is related to Celery but to our App.
Thanks for the replies; will open a new ticket if our issues re-occurs (For ref: Celery 4.1, Redis broker and result backend: 2.10.5 redis-py, Redis 3.2.0, SSL enabled).
From what I've seen we're getting workers stuck listening on connections that are doing nothing (CLOSE_WAIT) state and a restart resumes business as usual.
From that, https://github.com/andymccurdy/redis-py/issues/306 looks more relevant to us than it was above. If so, it's a) an issue in redis-py not celery and b) seems to be positively affected by the socket_timeout in BROKER_TRANSPORT_OPTIONS as suggested above. Restart was required every 24 hours or so, no longer appears to be a problem.
I am running into this issue on
Celery 4.1.0
redis-py 2.10.6
Celery hangs on redis-py connection.py
return sock.recv(*args, **kwargs)
Here is the stack
[2017-11-24 04:11:08,423: WARNING/MainProcess] File "/usr/bin/celery", line 11, in <module>
sys.exit(main())
[2017-11-24 04:11:08,423: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/__main__.py", line 14, in main
_main()
[2017-11-24 04:11:08,423: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bin/celery.py", line 326, in main
cmd.execute_from_commandline(argv)
[2017-11-24 04:11:08,424: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bin/celery.py", line 488, in execute_from_commandline
super(CeleryCommand, self).execute_from_commandline(argv)))
[2017-11-24 04:11:08,425: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bin/base.py", line 281, in execute_from_commandline
return self.handle_argv(self.prog_name, argv[1:])
[2017-11-24 04:11:08,425: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bin/celery.py", line 480, in handle_argv
return self.execute(command, argv)
[2017-11-24 04:11:08,425: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bin/celery.py", line 412, in execute
).run_from_argv(self.prog_name, argv[1:], command=argv[0])
[2017-11-24 04:11:08,426: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bin/worker.py", line 221, in run_from_argv
return self(*args, **options)
[2017-11-24 04:11:08,426: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bin/base.py", line 244, in __call__
ret = self.run(*args, **kwargs)
[2017-11-24 04:11:08,427: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bin/worker.py", line 256, in run
worker.start()
[2017-11-24 04:11:08,427: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/worker/worker.py", line 203, in start
self.blueprint.start(self)
[2017-11-24 04:11:08,428: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
[2017-11-24 04:11:08,428: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bootsteps.py", line 370, in start
return self.obj.start()
[2017-11-24 04:11:08,429: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 320, in start
blueprint.start(self)
[2017-11-24 04:11:08,429: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
[2017-11-24 04:11:08,429: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 596, in start
c.loop(*c.loop_args())
[2017-11-24 04:11:08,430: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/worker/loops.py", line 88, in asynloop
next(loop)
[2017-11-24 04:11:08,430: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/kombu/async/hub.py", line 354, in create_loop
cb(*cbargs)
[2017-11-24 04:11:08,431: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/kombu/transport/redis.py", line 1040, in on_readable
self.cycle.on_readable(fileno)
[2017-11-24 04:11:08,431: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/kombu/transport/redis.py", line 337, in on_readable
chan.handlers[type]()
[2017-11-24 04:11:08,431: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/kombu/transport/redis.py", line 667, in _receive
ret.append(self._receive_one(c))
[2017-11-24 04:11:08,432: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/kombu/transport/redis.py", line 678, in _receive_one
response = c.parse_response()
[2017-11-24 04:11:08,432: WARNING/MainProcess] File "/usr/local/cedar/utils/redis/client.py", line 2448, in parse_response
return self._execute(connection, connection.read_response)
[2017-11-24 04:11:08,432: WARNING/MainProcess] File "/usr/local/cedar/utils/redis/client.py", line 2437, in _execute
return command(*args)
[2017-11-24 04:11:08,433: WARNING/MainProcess] File "/usr/local/cedar/utils/redis/connection.py", line 604, in read_response
response = self._parser.read_response()
[2017-11-24 04:11:08,433: WARNING/MainProcess] File "/usr/local/cedar/utils/redis/connection.py", line 259, in read_response
response = self._buffer.readline()
[2017-11-24 04:11:08,434: WARNING/MainProcess] File "/usr/local/cedar/utils/redis/connection.py", line 185, in readline
traceback.print_stack()
This happens when a connection is reset by a timeout (load balancer in my case). An error event is sent on the socket, but then celery tries to read from the socket. Since the socket is closed, no data is available and it hangs forever. It specifically happens on celery's pubsub redis connection, since that is a long running connection.
A socket_timeout
param will fix the issue of blocking and will throw an error after the timeout, but there is an underlying issue that celery should not read from a closed socket.
The imperfect fix I made was to use this patch in redis-py.
https://github.com/andymccurdy/redis-py/pull/886
as well as this custom fix for pubsub
https://github.com/cedar-team/redis-py/compare/f9348968dec581830a32e8469228c74ce443bd88...cedar-team:6aae12c96b733ad0b6e896001f0cf576fa26280a
+1. Hitting this issue as well:
strace
shows it blocked on recvfrom
cmd=NULL
in Redis for the connection@ericholscher If you're using hiredis, try without it. See #4321 and #3898.
@mvaled We are -- I will try without it. Thanks.
Reporting back, removing hiredis
seems to have fixed the issue. Thanks! 👍
thanks Eric
Most helpful comment
+1. Hitting this issue as well:
strace
shows it blocked onrecvfrom
cmd=NULL
in Redis for the connection