Celery: Worker hangs and stop responding after an undeterminated amount of time (sometimes).

Created on 10 Oct 2017 · 50Comments · Source: celery/celery

This is an ongoing investigation. What I'm putting here what I can see, so far.

I have seen this problem since 4.0.2, and now I've just spot it in 4.1.0. These are the versions of the packages I'm using:

kombu = 4.1.0
celery = 4.1.0
amqp = 2.2.1
billiard = 3.5.0.3
redis = 2.10.6

The problem seems to happen when a worker with -c 1, and a worker_max_tasks_per_child=2000. This worker also processes tasks from a single queue 'odoo-8.0.cdr'. Just after the 2000th task it hangs. Strangely, the issue does not happen every time the worker reaches this threshold, but when it happens it happens at it.

screenshot from 2017-10-10 12-02-08

This is a production server so I can't really test with master branch. I'll try to reproduce on a development box later.

stracing the Worker yields:

root@mercurio:~# strace -p 9105
Process 9105 attached
recvfrom(27,

I waited for 5+ minutes and it didn't received anything. The file descriptors:

[celeryd: 9105 mercurio   13u  sock       0,6      0t0 273693676 can't identify protocol
[celeryd: 9105 mercurio   14u  sock       0,6      0t0 273693673 can't identify protocol
[celeryd: 9105 mercurio   15u  0000       0,9        0      4047 [eventpoll]
[celeryd: 9105 mercurio   16u  sock       0,6      0t0 273693679 can't identify protocol
[celeryd: 9105 mercurio   17u  0000       0,9        0      4047 [eventpoll]
[celeryd: 9105 mercurio   18u  0000       0,9        0      4047 [eventpoll]
[celeryd: 9105 mercurio   19u  sock       0,6      0t0 273693682 can't identify protocol
[celeryd: 9105 mercurio   20u  sock       0,6      0t0 273693685 can't identify protocol
[celeryd: 9105 mercurio   21u  sock       0,6      0t0 273693942 can't identify protocol
[celeryd: 9105 mercurio   22u  sock       0,6      0t0 273693945 can't identify protocol
[celeryd: 9105 mercurio   23u  sock       0,6      0t0 273693948 can't identify protocol
[celeryd: 9105 mercurio   24u  sock       0,6      0t0 273693951 can't identify protocol
[celeryd: 9105 mercurio   25u  IPv4 288763673      0t0       TCP localhost.localdomain:56030->localhost.localdomain:6379 (ESTABLISHED)
[celeryd: 9105 mercurio   26r  FIFO       0,8      0t0 288763672 pipe
[celeryd: 9105 mercurio   27u  IPv4 288763676      0t0       TCP localhost.localdomain:56031->localhost.localdomain:6379 (ESTABLISHED)

show that 27 is the redis connection, but there are several fds for which lsof can't indentify protocol. Half-closed connections?

The ForkPoolWorker strace shows:

root@mercurio:~# strace -p 23943
Process 23943 attached
read(7,

I also waited for 5+ minutes here to see if anything happen. Nothing. lsof shows:

[celeryd: 23943 mercurio    6r   CHR    1,9      0t0     18535 /dev/urandom
[celeryd: 23943 mercurio    7r  FIFO    0,8      0t0 273693670 pipe
[celeryd: 23943 mercurio   10r   CHR    1,9      0t0     18535 /dev/urandom
[celeryd: 23943 mercurio   11w  FIFO    0,8      0t0 273693671 pipe
[celeryd: 23943 mercurio   12r   CHR    1,3      0t0     18531 /dev/null
[celeryd: 23943 mercurio   15u  0000    0,9        0      4047 [eventpoll]
[celeryd: 23943 mercurio   17u  0000    0,9        0      4047 [eventpoll]
[celeryd: 23943 mercurio   26w  FIFO    0,8      0t0 288763672 pipe

So 7 is a FIFO pipe (I don't know which).

I'm sure that the queue gets a task every minute, since our application issues a new 'odoo-8.0.cdr' task per minute (like a clock). I can see the backlog:

root@mercurio:~# redis-cli -n 9  llen 'odoo-8.0.cdr'
(integer) 4586

One minute later:

root@mercurio:~# redis-cli -n 9  llen 'odoo-8.0.cdr'
(integer) 4587

To resume operation I have to restart the whole worker and this managed to process all the backlog (all tasks expire after 65s in the queue) and then, most of the time the worker keeps worker even beyond the 2000th task.

Killing the child process shows that the worker does not even reap the child:

root@mercurio:~# ps ax | grep cdr
 9105 ?        S      2:32 [celeryd: cdr-1@mercurio:MainProcess] -active- (celery worker -l INFO -n cdr-1@%h -c1 -Q odoo-8.0.cdr)
23943 ?        Z      0:00 [[celeryd: cdr-1] <defunct>

Needs Verification ✘ Worker Hangs

Source

mvaled

👍5

Most helpful comment

Nope. But today, when I arrived a work, flower was showing again the 'offline' worker after 2000 tasks.

I vote to keep this open. Maybe I can find a little bit of time to make a deeper investigation.

mvaled on 6 Nov 2017

👍2

All 50 comments

@mvaled Did you find any solution?

korycins on 6 Nov 2017

Nope. But today, when I arrived a work, flower was showing again the 'offline' worker after 2000 tasks.

I vote to keep this open. Maybe I can find a little bit of time to make a deeper investigation.

mvaled on 6 Nov 2017

👍2

I have same problem.

kombu = 4.0.2
celery = 4.0.2
amqp = 2.1.4
billiard = 33.5.0.2
redis = 2.10.5

My environment is -c 16 and a worker_max_tasks_per_child=1.
I wonder this happens when worker seemts to be busy.

saorio on 20 Nov 2017

Hi @saorio,

Does your task creates threads or greenlets? I'm performing a code review of the components involved in the task. I have found some non-daemon threads that I think are not needed (but yet no code path seems to validate my hypothesis). My code base is fairly complex, and I'm in a tight schedule these days. So, I won't be able to return to this issue soon.

mvaled on 24 Nov 2017

Sorry for replying so late.
My task don't use threads or greenlets. Worker worked by supervisord.

saorio on 5 Dec 2017

if you could install all the latest commits from master branch and verify?

auvipy on 21 Dec 2017

Hi @auvipy, Unfortunately I'm seeing this in a production server, so I cannot upgrade to master easily. If I can manage to do it I will let you know. I will try to put it in a stage server to see if it happens there.

mvaled on 21 Dec 2017

@mvaled any update?

dralley on 6 Mar 2018

Hi @dralley,

Nothing yet. Still, the bug shows up from time to time. Today I arrived at work and found that another worker (this has -c 3) hung after 6000 jobs (3 times the max task per worker).

I've been unable to deploy master in the staging server, as I said I would. I will try again today.

mvaled on 12 Mar 2018

Seeing similar issue on celery==4.0.2 and -c 2. Except mine hangs on TCP pipe that isn't bound/connected anywhere (perhaps to master process?). Connections in redis seem fine.
Is this broker issue?

shulcsm on 12 Mar 2018

I've just deployed in our stage server with celery at a6b3e2278ed38c153b07e38177794dd3d3e280af and kombu at celery/kombu@dba85e2d9515b9ce202bd30e8690131aa055e6bf.

I configured the max tasks to 10, since I have one worker that should process a task per minute I expect the bug (if present in master) should manifest itself in a few days.

mvaled on 12 Mar 2018

Well, I didn't have to wait too long. Today, in the staging server with the 'master' branch I get:

screenshot from 2018-03-13 09-10-54

The funny thing is that the first 'offline' worker processed not an integral multiple of 10 (the max tasks per worker). The other symptoms remain:

$ strace -p 25614  # the cdr-1@stage2 MainProcess

strace: Process 25614 attached
recvfrom(28,

Nothing happens here. The lsof shows the same:

[celeryd: 25614 mercurio    0r  FIFO       0,8      0t0 321516134 pipe
[celeryd: 25614 mercurio    1w  FIFO       0,8      0t0 321516135 pipe
[celeryd: 25614 mercurio    2w  FIFO       0,8      0t0 321516136 pipe
[celeryd: 25614 mercurio    3u  0000       0,9        0      4073 [eventpoll]
[celeryd: 25614 mercurio    4u  0000       0,9        0      4073 [eventpoll]
[celeryd: 25614 mercurio    5u  0000       0,9        0      4073 [eventpoll]
[celeryd: 25614 mercurio    6r  FIFO       0,8      0t0 321517058 pipe
[celeryd: 25614 mercurio    7w  FIFO       0,8      0t0 321517058 pipe
[celeryd: 25614 mercurio    8r  FIFO       0,8      0t0 321517059 pipe
[celeryd: 25614 mercurio    9w  FIFO       0,8      0t0 321517059 pipe
[celeryd: 25614 mercurio   10r   CHR       1,9      0t0   3582641 /dev/urandom
[celeryd: 25614 mercurio   11r   CHR       1,9      0t0   3582641 /dev/urandom
[celeryd: 25614 mercurio   12r  FIFO       0,8      0t0 323120452 pipe
[celeryd: 25614 mercurio   13u  0000       0,9        0      4073 [eventpoll]
[celeryd: 25614 mercurio   14u  sock       0,6      0t0 323094902 can't identify protocol
[celeryd: 25614 mercurio   15u  sock       0,6      0t0 323094909 can't identify protocol
[celeryd: 25614 mercurio   16r   CHR       1,9      0t0   3582641 /dev/urandom
[celeryd: 25614 mercurio   17u  sock       0,6      0t0 323094922 can't identify protocol
[celeryd: 25614 mercurio   18u  0000       0,9        0      4073 [eventpoll]
[celeryd: 25614 mercurio   19u  0000       0,9        0      4073 [eventpoll]
[celeryd: 25614 mercurio   20u  sock       0,6      0t0 321517162 can't identify protocol
[celeryd: 25614 mercurio   21u  sock       0,6      0t0 323094894 can't identify protocol
[celeryd: 25614 mercurio   22u  sock       0,6      0t0 323094887 can't identify protocol
[celeryd: 25614 mercurio   23u  sock       0,6      0t0 323094929 can't identify protocol
[celeryd: 25614 mercurio   24u  sock       0,6      0t0 323094936 can't identify protocol
[celeryd: 25614 mercurio   25u  sock       0,6      0t0 323094943 can't identify protocol
[celeryd: 25614 mercurio   26u  sock       0,6      0t0 323094950 can't identify protocol
[celeryd: 25614 mercurio   27r  FIFO       0,8      0t0 323120452 pipe
[celeryd: 25614 mercurio   28u  IPv4 323120457      0t0       TCP localhost.localdomain:55904->localhost.localdomain:6379 (ESTABLISHED)

The fd 28 is that of the redis connection.

The same pattern is shown for the 'default-notifications-3' main process.

The jobs for the queue 'odoo-10.0.cdr', which is exclusively consumed by the 'cdr' worker keep piling up:

$ redis-cli -n 9 llen 'odoo-10.0.cdr'
(integer) 347

Restarting the workers make them work, and the backlog decreases:

$ redis-cli -n 9 llen 'odoo-10.0.cdr'
(integer) 257

mvaled on 13 Mar 2018

I'm also seeing this issue with celery 4.1.0. Any update here?

jproffitt on 23 Mar 2018

😕2

I changed the title of the issue because it's happening now not at the boundaries of max-tasks-per-child. Today I found the worker offline after 11727 jobs.

mvaled on 13 Apr 2018

@mvaled Have you tried the latest releases ( 4.2.0rc1 or 4.2.0rc2 )?

WoaDmulL on 13 Apr 2018

lets try 4.2rc2 and all the dependencies from github master

auvipy on 13 Apr 2018

@auvipy Which is the branch for 4.2rc2? As I said before, I deployed a staging server with 'master' branch for both celery and kombu and reproduced the error there as well.

mvaled on 13 Apr 2018

@mvaled pip install -U celery

WoaDmulL on 13 Apr 2018

Hi @WoaDmulL That wouldn't work for me: I use buildout to deploy this application. I could pin the version number, though.

Nevertheless, notice that you would need to execute pip install --pre -U celery in order to install a pre-release version.

But since, the bug is reproducible in the master branch I assume is still present in 4.2.0rc2. In any case I will try with pre-release tarball.

mvaled on 16 Apr 2018

could you install kombu, pyamqp from master too? use all the codes from master and let us know

auvipy on 16 Apr 2018

Deployed a staging sever using 3fe5a20ca655aa3aeed60205b39ede2a21845df0; celery/kombu@0f6ef8c90b32ca4fbefbd8f2262efea343a4e5bc; and celery/py-amqp@2e145d2edf9865a1dfb79ea4d087a28b3c9e969b.

Let's wait some hours to see if the worker hangs.

mvaled on 16 Apr 2018

Well, unfortunately, with those commits the workers fail to start:

AttributeError: async
  File "srv/mercurio/src/xhg/bin/xoeuf", line 126, in <module>
    sys.exit(xoeuf.cli.server.server())
  File "xoeuf/cli/server.py", line 28, in server
    main(default=DEFAULT_COMMAND)
  File "xoutil/cli/app.py", line 44, in main
    sys.exit(cmd().run(args))
  File "odoo/cli/celery.py", line 44, in run
    _main(argv=['celery', ] + cmdargs)
  File "celery/bin/celery.py", line 322, in main
    cmd.execute_from_commandline(argv)
  File "celery/bin/celery.py", line 484, in execute_from_commandline
    super(CeleryCommand, self).execute_from_commandline(argv)))
  File "celery/bin/base.py", line 275, in execute_from_commandline
    return self.handle_argv(self.prog_name, argv[1:])
  File "celery/bin/celery.py", line 476, in handle_argv
    return self.execute(command, argv)
  File "celery/bin/celery.py", line 408, in execute
    ).run_from_argv(self.prog_name, argv[1:], command=argv[0])
  File "celery/bin/worker.py", line 223, in run_from_argv
    return self(*args, **options)
  File "celery/bin/base.py", line 238, in __call__
    ret = self.run(*args, **kwargs)
  File "celery/bin/worker.py", line 257, in run
    **kwargs)
  File "celery/worker/worker.py", line 101, in __init__
    self.setup_instance(**self.prepare_args(**kwargs))
  File "celery/worker/worker.py", line 124, in setup_instance
    self.should_use_eventloop() if use_eventloop is None
  File "celery/worker/worker.py", line 243, in should_use_eventloop
    self._conninfo.transport.implements.async and
  File "kombu/transport/base.py", line 125, in __getattr__
    raise AttributeError(key)

mvaled on 16 Apr 2018

this AttributeError: async is dew to a change converting async to asynchronous in kombu, but not in celery, we have to fix that and an issue is open for that. thanks for report

auvipy on 17 Apr 2018

Can you link the kombu issue here? I tried to find it, but only got closed issues. I could track the issue and try again when fixed.

mvaled on 17 Apr 2018

you can get detail here in last comment https://github.com/celery/celery/issues/4500

auvipy on 17 Apr 2018

I'm having the same problem as @mvaled using version celery==4.2.0rc3

Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/kombu/transport/base.py", line 123, in __getattr__
return self[key]
KeyError: 'async'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/bin/celery", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.5/site-packages/celery/__main__.py", line 16, in main
_main()
File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 322, in main
cmd.execute_from_commandline(argv)
File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 484, in execute_from_commandline
super(CeleryCommand, self).execute_from_commandline(argv)))
File "/usr/local/lib/python3.5/site-packages/celery/bin/base.py", line 275, in execute_from_commandline
return self.handle_argv(self.prog_name, argv[1:])
File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 476, in handle_argv
return self.execute(command, argv)
File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 408, in execute
).run_from_argv(self.prog_name, argv[1:], command=argv[0])
File "/usr/local/lib/python3.5/site-packages/celery/bin/worker.py", line 223, in run_from_argv
return self(args, options)
File "/usr/local/lib/python3.5/site-packages/celery/bin/base.py", line 238, in __call__
ret = self.run(args, *kwargs)
File "/usr/local/lib/python3.5/site-packages/celery/bin/worker.py", line 257, in run
*kwargs)
File "/usr/local/lib/python3.5/site-packages/celery/worker/worker.py", line 101, in __init__
self.setup_instance(self.prepare_args(**kwargs))
File "/usr/local/lib/python3.5/site-packages/celery/worker/worker.py", line 124, in setup_instance
self.should_use_eventloop() if use_eventloop is None
File "/usr/local/lib/python3.5/site-packages/celery/worker/worker.py", line 243, in should_use_eventloop
self._conninfo.transport.implements.async and
File "/usr/local/lib/python3.5/site-packages/kombu/transport/base.py", line 125, in __getattr__
raise AttributeError(key)
AttributeError: async

em1208 on 21 May 2018

Downgrading to kombu 4.0.2 fixed the issue. The problem was commit https://github.com/celery/celery/commit/ed0130cad13ffb08256243ff87254abe00b3c565
which updated the requirements for version 4.1.0 of celery to kombu 4.2.0. Please @auvipy fix it!

em1208 on 21 May 2018

could you try 4.1.1 also might wait for 4.2rc4

auvipy on 21 May 2018

Issue #4500 is now closed. I will try first with Celery 4.1.1, and if the bug is still present there, I will try with master again.

mvaled on 22 May 2018

With celery 4.1.1 and kombu 4.2.0 the bug one of the workers has become unresponsive after 50 jobs (with max jobs per worker set to 10).

I'm going to test with 'master'.

mvaled on 22 May 2018

Please let us know your latest resolution about this issue @mvaled with celery 4.2rc4

auvipy on 25 May 2018

I'm deploying a staging server with celery 4.2rc4.

mvaled on 28 May 2018

After 2 hours deployed in the staging server all workers are still online. I have max tasks to 10 to ensure a high recycling of workers processes. I've noticed however that one worker has not been given any job, which is really strange.

screenshot from 2018-05-28 14-45-39

mvaled on 28 May 2018

interesting! so its improving ?

auvipy on 28 May 2018

I'm really suspicious about it, but after 24 hours running all workers remain online. @auvipy What do you think? Are there any commits touching the (probably) involved components?

mvaled on 29 May 2018

hehe.. quite tough but might take time to spot them

auvipy on 29 May 2018

And after 2110 jobs, one worker went down :(

screenshot from 2018-05-30 12-34-42

According to the backlog in the queue is down for about 8hrs already:

$ redis-cli -n 9  llen 'odoo-10.0.cdr'
(integer) 484

mvaled on 30 May 2018

May be this bug is related to https://github.com/celery/celery/issues/3898. By the way i have noticed same behavior when task fails with timeout (if time_limit is specified). After removing hiredis everything works properly. Do you have hiredis installed?

CompadreP on 30 May 2018

I do have hiredis installed. I'm gonna try without it.

mvaled on 30 May 2018

I'd be interested in whether mvaled still has the same issues even without hiredis. I expect, or maybe hope, that he's still seeing the same issue as it reflects my own experience including with Celery 4.2 even though we're on very different platforms (I'm on FreeBSD running my celery processes in a jail. If I run redis within the jail I don't see these issues over the short time that I've left it running which to me pointed to a network issue but I'm having no luck in tracking that down.)

rick-pri on 11 Jun 2018

I'm sorry I didn't report sooner. But the staging server was a bit messed up and I need to redeploy it without hiredis.

mvaled on 11 Jun 2018

I've just redeployed the staging server without hiredis. Let's wait and see.

mvaled on 11 Jun 2018

BTW; I'm using the recently released 4.2.0.

mvaled on 11 Jun 2018

hope for the best

auvipy on 11 Jun 2018

... plan for the worst, for then all your surprises will be pleasant ones :)

rick-pri on 12 Jun 2018

21hrs has flown and all workers are still alive. So, maybe hiredis... Tomorrow I'll check again.

mvaled on 12 Jun 2018

looking forward

auvipy on 12 Jun 2018

Well, another 23hrs with the workers alive. Should we close this issue and let it be dealt in #3898?

mvaled on 13 Jun 2018

I'd say if it's looking like the same cause then this should be closed. My issue, which seemed very similar, was eventually tracked down to an issue with our configuration management which was restarting the firewall process every run (every 30 mins) which lead to the workers deadlocking.

rick-pri on 13 Jun 2018

Lets be optimistic and closing it as part of 4.2, but will re open if the error hit back.

auvipy on 13 Jun 2018

Was this page helpful?

0 / 5 - 0 ratings