This is an ongoing investigation. What I'm putting here what I can see, so far.
I have seen this problem since 4.0.2, and now I've just spot it in 4.1.0. These are the versions of the packages I'm using:
kombu = 4.1.0
celery = 4.1.0
amqp = 2.2.1
billiard = 3.5.0.3
redis = 2.10.6
The problem seems to happen when a worker with -c 1
, and a worker_max_tasks_per_child=2000
. This worker also processes tasks from a single queue 'odoo-8.0.cdr'. Just after the 2000th task it hangs. Strangely, the issue does not happen every time the worker reaches this threshold, but when it happens it happens at it.
This is a production server so I can't really test with master branch. I'll try to reproduce on a development box later.
stracing the Worker yields:
root@mercurio:~# strace -p 9105
Process 9105 attached
recvfrom(27,
I waited for 5+ minutes and it didn't received anything. The file descriptors:
[celeryd: 9105 mercurio 13u sock 0,6 0t0 273693676 can't identify protocol
[celeryd: 9105 mercurio 14u sock 0,6 0t0 273693673 can't identify protocol
[celeryd: 9105 mercurio 15u 0000 0,9 0 4047 [eventpoll]
[celeryd: 9105 mercurio 16u sock 0,6 0t0 273693679 can't identify protocol
[celeryd: 9105 mercurio 17u 0000 0,9 0 4047 [eventpoll]
[celeryd: 9105 mercurio 18u 0000 0,9 0 4047 [eventpoll]
[celeryd: 9105 mercurio 19u sock 0,6 0t0 273693682 can't identify protocol
[celeryd: 9105 mercurio 20u sock 0,6 0t0 273693685 can't identify protocol
[celeryd: 9105 mercurio 21u sock 0,6 0t0 273693942 can't identify protocol
[celeryd: 9105 mercurio 22u sock 0,6 0t0 273693945 can't identify protocol
[celeryd: 9105 mercurio 23u sock 0,6 0t0 273693948 can't identify protocol
[celeryd: 9105 mercurio 24u sock 0,6 0t0 273693951 can't identify protocol
[celeryd: 9105 mercurio 25u IPv4 288763673 0t0 TCP localhost.localdomain:56030->localhost.localdomain:6379 (ESTABLISHED)
[celeryd: 9105 mercurio 26r FIFO 0,8 0t0 288763672 pipe
[celeryd: 9105 mercurio 27u IPv4 288763676 0t0 TCP localhost.localdomain:56031->localhost.localdomain:6379 (ESTABLISHED)
show that 27 is the redis connection, but there are several fds for which lsof
can't indentify protocol. Half-closed connections?
The ForkPoolWorker strace shows:
root@mercurio:~# strace -p 23943
Process 23943 attached
read(7,
I also waited for 5+ minutes here to see if anything happen. Nothing. lsof
shows:
[celeryd: 23943 mercurio 6r CHR 1,9 0t0 18535 /dev/urandom
[celeryd: 23943 mercurio 7r FIFO 0,8 0t0 273693670 pipe
[celeryd: 23943 mercurio 10r CHR 1,9 0t0 18535 /dev/urandom
[celeryd: 23943 mercurio 11w FIFO 0,8 0t0 273693671 pipe
[celeryd: 23943 mercurio 12r CHR 1,3 0t0 18531 /dev/null
[celeryd: 23943 mercurio 15u 0000 0,9 0 4047 [eventpoll]
[celeryd: 23943 mercurio 17u 0000 0,9 0 4047 [eventpoll]
[celeryd: 23943 mercurio 26w FIFO 0,8 0t0 288763672 pipe
So 7 is a FIFO pipe (I don't know which).
I'm sure that the queue gets a task every minute, since our application issues a new 'odoo-8.0.cdr' task per minute (like a clock). I can see the backlog:
root@mercurio:~# redis-cli -n 9 llen 'odoo-8.0.cdr'
(integer) 4586
One minute later:
root@mercurio:~# redis-cli -n 9 llen 'odoo-8.0.cdr'
(integer) 4587
To resume operation I have to restart the whole worker and this managed to process all the backlog (all tasks expire after 65s in the queue) and then, most of the time the worker keeps worker even beyond the 2000th task.
Killing the child process shows that the worker does not even reap the child:
root@mercurio:~# ps ax | grep cdr
9105 ? S 2:32 [celeryd: cdr-1@mercurio:MainProcess] -active- (celery worker -l INFO -n cdr-1@%h -c1 -Q odoo-8.0.cdr)
23943 ? Z 0:00 [[celeryd: cdr-1] <defunct>
@mvaled Did you find any solution?
Nope. But today, when I arrived a work, flower was showing again the 'offline' worker after 2000 tasks.
I vote to keep this open. Maybe I can find a little bit of time to make a deeper investigation.
I have same problem.
kombu = 4.0.2
celery = 4.0.2
amqp = 2.1.4
billiard = 33.5.0.2
redis = 2.10.5
My environment is -c 16
and a worker_max_tasks_per_child=1
.
I wonder this happens when worker seemts to be busy.
Hi @saorio,
Does your task creates threads or greenlets? I'm performing a code review of the components involved in the task. I have found some non-daemon threads that I think are not needed (but yet no code path seems to validate my hypothesis). My code base is fairly complex, and I'm in a tight schedule these days. So, I won't be able to return to this issue soon.
Sorry for replying so late.
My task don't use threads or greenlets. Worker worked by supervisord.
if you could install all the latest commits from master branch and verify?
Hi @auvipy, Unfortunately I'm seeing this in a production server, so I cannot upgrade to master easily. If I can manage to do it I will let you know. I will try to put it in a stage server to see if it happens there.
@mvaled any update?
Hi @dralley,
Nothing yet. Still, the bug shows up from time to time. Today I arrived at work and found that another worker (this has -c 3
) hung after 6000 jobs (3 times the max task per worker).
I've been unable to deploy master in the staging server, as I said I would. I will try again today.
Seeing similar issue on celery==4.0.2 and -c 2. Except mine hangs on TCP pipe that isn't bound/connected anywhere (perhaps to master process?). Connections in redis seem fine.
Is this broker issue?
I've just deployed in our stage server with celery at a6b3e2278ed38c153b07e38177794dd3d3e280af and kombu at celery/kombu@dba85e2d9515b9ce202bd30e8690131aa055e6bf.
I configured the max tasks to 10, since I have one worker that should process a task per minute I expect the bug (if present in master) should manifest itself in a few days.
Well, I didn't have to wait too long. Today, in the staging server with the 'master' branch I get:
The funny thing is that the first 'offline' worker processed not an integral multiple of 10 (the max tasks per worker). The other symptoms remain:
$ strace -p 25614 # the cdr-1@stage2 MainProcess
strace: Process 25614 attached
recvfrom(28,
Nothing happens here. The lsof
shows the same:
[celeryd: 25614 mercurio 0r FIFO 0,8 0t0 321516134 pipe
[celeryd: 25614 mercurio 1w FIFO 0,8 0t0 321516135 pipe
[celeryd: 25614 mercurio 2w FIFO 0,8 0t0 321516136 pipe
[celeryd: 25614 mercurio 3u 0000 0,9 0 4073 [eventpoll]
[celeryd: 25614 mercurio 4u 0000 0,9 0 4073 [eventpoll]
[celeryd: 25614 mercurio 5u 0000 0,9 0 4073 [eventpoll]
[celeryd: 25614 mercurio 6r FIFO 0,8 0t0 321517058 pipe
[celeryd: 25614 mercurio 7w FIFO 0,8 0t0 321517058 pipe
[celeryd: 25614 mercurio 8r FIFO 0,8 0t0 321517059 pipe
[celeryd: 25614 mercurio 9w FIFO 0,8 0t0 321517059 pipe
[celeryd: 25614 mercurio 10r CHR 1,9 0t0 3582641 /dev/urandom
[celeryd: 25614 mercurio 11r CHR 1,9 0t0 3582641 /dev/urandom
[celeryd: 25614 mercurio 12r FIFO 0,8 0t0 323120452 pipe
[celeryd: 25614 mercurio 13u 0000 0,9 0 4073 [eventpoll]
[celeryd: 25614 mercurio 14u sock 0,6 0t0 323094902 can't identify protocol
[celeryd: 25614 mercurio 15u sock 0,6 0t0 323094909 can't identify protocol
[celeryd: 25614 mercurio 16r CHR 1,9 0t0 3582641 /dev/urandom
[celeryd: 25614 mercurio 17u sock 0,6 0t0 323094922 can't identify protocol
[celeryd: 25614 mercurio 18u 0000 0,9 0 4073 [eventpoll]
[celeryd: 25614 mercurio 19u 0000 0,9 0 4073 [eventpoll]
[celeryd: 25614 mercurio 20u sock 0,6 0t0 321517162 can't identify protocol
[celeryd: 25614 mercurio 21u sock 0,6 0t0 323094894 can't identify protocol
[celeryd: 25614 mercurio 22u sock 0,6 0t0 323094887 can't identify protocol
[celeryd: 25614 mercurio 23u sock 0,6 0t0 323094929 can't identify protocol
[celeryd: 25614 mercurio 24u sock 0,6 0t0 323094936 can't identify protocol
[celeryd: 25614 mercurio 25u sock 0,6 0t0 323094943 can't identify protocol
[celeryd: 25614 mercurio 26u sock 0,6 0t0 323094950 can't identify protocol
[celeryd: 25614 mercurio 27r FIFO 0,8 0t0 323120452 pipe
[celeryd: 25614 mercurio 28u IPv4 323120457 0t0 TCP localhost.localdomain:55904->localhost.localdomain:6379 (ESTABLISHED)
The fd 28 is that of the redis connection.
The same pattern is shown for the 'default-notifications-3' main process.
The jobs for the queue 'odoo-10.0.cdr', which is exclusively consumed by the 'cdr' worker keep piling up:
$ redis-cli -n 9 llen 'odoo-10.0.cdr'
(integer) 347
Restarting the workers make them work, and the backlog decreases:
$ redis-cli -n 9 llen 'odoo-10.0.cdr'
(integer) 257
I'm also seeing this issue with celery 4.1.0. Any update here?
I changed the title of the issue because it's happening now not at the boundaries of max-tasks-per-child. Today I found the worker offline after 11727 jobs.
@mvaled Have you tried the latest releases ( 4.2.0rc1 or 4.2.0rc2 )?
lets try 4.2rc2 and all the dependencies from github master
@auvipy Which is the branch for 4.2rc2? As I said before, I deployed a staging server with 'master' branch for both celery and kombu and reproduced the error there as well.
@mvaled pip install -U celery
Hi @WoaDmulL That wouldn't work for me: I use buildout to deploy this application. I could pin the version number, though.
Nevertheless, notice that you would need to execute pip install --pre -U celery
in order to install a pre-release version.
But since, the bug is reproducible in the master branch I assume is still present in 4.2.0rc2. In any case I will try with pre-release tarball.
could you install kombu, pyamqp from master too? use all the codes from master and let us know
Deployed a staging sever using 3fe5a20ca655aa3aeed60205b39ede2a21845df0; celery/kombu@0f6ef8c90b32ca4fbefbd8f2262efea343a4e5bc; and celery/py-amqp@2e145d2edf9865a1dfb79ea4d087a28b3c9e969b.
Let's wait some hours to see if the worker hangs.
Well, unfortunately, with those commits the workers fail to start:
AttributeError: async
File "srv/mercurio/src/xhg/bin/xoeuf", line 126, in <module>
sys.exit(xoeuf.cli.server.server())
File "xoeuf/cli/server.py", line 28, in server
main(default=DEFAULT_COMMAND)
File "xoutil/cli/app.py", line 44, in main
sys.exit(cmd().run(args))
File "odoo/cli/celery.py", line 44, in run
_main(argv=['celery', ] + cmdargs)
File "celery/bin/celery.py", line 322, in main
cmd.execute_from_commandline(argv)
File "celery/bin/celery.py", line 484, in execute_from_commandline
super(CeleryCommand, self).execute_from_commandline(argv)))
File "celery/bin/base.py", line 275, in execute_from_commandline
return self.handle_argv(self.prog_name, argv[1:])
File "celery/bin/celery.py", line 476, in handle_argv
return self.execute(command, argv)
File "celery/bin/celery.py", line 408, in execute
).run_from_argv(self.prog_name, argv[1:], command=argv[0])
File "celery/bin/worker.py", line 223, in run_from_argv
return self(*args, **options)
File "celery/bin/base.py", line 238, in __call__
ret = self.run(*args, **kwargs)
File "celery/bin/worker.py", line 257, in run
**kwargs)
File "celery/worker/worker.py", line 101, in __init__
self.setup_instance(**self.prepare_args(**kwargs))
File "celery/worker/worker.py", line 124, in setup_instance
self.should_use_eventloop() if use_eventloop is None
File "celery/worker/worker.py", line 243, in should_use_eventloop
self._conninfo.transport.implements.async and
File "kombu/transport/base.py", line 125, in __getattr__
raise AttributeError(key)
this AttributeError: async is dew to a change converting async to asynchronous in kombu, but not in celery, we have to fix that and an issue is open for that. thanks for report
Can you link the kombu issue here? I tried to find it, but only got closed issues. I could track the issue and try again when fixed.
you can get detail here in last comment https://github.com/celery/celery/issues/4500
I'm having the same problem as @mvaled using version celery==4.2.0rc3
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/kombu/transport/base.py", line 123, in __getattr__
return self[key]
KeyError: 'async'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/celery", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.5/site-packages/celery/__main__.py", line 16, in main
_main()
File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 322, in main
cmd.execute_from_commandline(argv)
File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 484, in execute_from_commandline
super(CeleryCommand, self).execute_from_commandline(argv)))
File "/usr/local/lib/python3.5/site-packages/celery/bin/base.py", line 275, in execute_from_commandline
return self.handle_argv(self.prog_name, argv[1:])
File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 476, in handle_argv
return self.execute(command, argv)
File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 408, in execute
).run_from_argv(self.prog_name, argv[1:], command=argv[0])
File "/usr/local/lib/python3.5/site-packages/celery/bin/worker.py", line 223, in run_from_argv
return self(args, options)
File "/usr/local/lib/python3.5/site-packages/celery/bin/base.py", line 238, in __call__
ret = self.run(args, *kwargs)
File "/usr/local/lib/python3.5/site-packages/celery/bin/worker.py", line 257, in run
*kwargs)
File "/usr/local/lib/python3.5/site-packages/celery/worker/worker.py", line 101, in __init__
self.setup_instance(self.prepare_args(**kwargs))
File "/usr/local/lib/python3.5/site-packages/celery/worker/worker.py", line 124, in setup_instance
self.should_use_eventloop() if use_eventloop is None
File "/usr/local/lib/python3.5/site-packages/celery/worker/worker.py", line 243, in should_use_eventloop
self._conninfo.transport.implements.async and
File "/usr/local/lib/python3.5/site-packages/kombu/transport/base.py", line 125, in __getattr__
raise AttributeError(key)
AttributeError: async
Downgrading to kombu 4.0.2 fixed the issue. The problem was commit https://github.com/celery/celery/commit/ed0130cad13ffb08256243ff87254abe00b3c565
which updated the requirements for version 4.1.0 of celery to kombu 4.2.0. Please @auvipy fix it!
could you try 4.1.1 also might wait for 4.2rc4
Issue #4500 is now closed. I will try first with Celery 4.1.1, and if the bug is still present there, I will try with master again.
With celery 4.1.1 and kombu 4.2.0 the bug one of the workers has become unresponsive after 50 jobs (with max jobs per worker set to 10).
I'm going to test with 'master'.
Please let us know your latest resolution about this issue @mvaled with celery 4.2rc4
I'm deploying a staging server with celery 4.2rc4.
After 2 hours deployed in the staging server all workers are still online. I have max tasks to 10 to ensure a high recycling of workers processes. I've noticed however that one worker has not been given any job, which is really strange.
interesting! so its improving ?
I'm really suspicious about it, but after 24 hours running all workers remain online. @auvipy What do you think? Are there any commits touching the (probably) involved components?
hehe.. quite tough but might take time to spot them
And after 2110 jobs, one worker went down :(
According to the backlog in the queue is down for about 8hrs already:
$ redis-cli -n 9 llen 'odoo-10.0.cdr'
(integer) 484
May be this bug is related to https://github.com/celery/celery/issues/3898. By the way i have noticed same behavior when task fails with timeout (if time_limit is specified). After removing hiredis everything works properly. Do you have hiredis installed?
I do have hiredis installed. I'm gonna try without it.
I'd be interested in whether mvaled still has the same issues even without hiredis
. I expect, or maybe hope, that he's still seeing the same issue as it reflects my own experience including with Celery 4.2 even though we're on very different platforms (I'm on FreeBSD running my celery processes in a jail. If I run redis within the jail I don't see these issues over the short time that I've left it running which to me pointed to a network issue but I'm having no luck in tracking that down.)
I'm sorry I didn't report sooner. But the staging server was a bit messed up and I need to redeploy it without hiredis.
I've just redeployed the staging server without hiredis. Let's wait and see.
BTW; I'm using the recently released 4.2.0.
hope for the best
... plan for the worst, for then all your surprises will be pleasant ones :)
21hrs has flown and all workers are still alive. So, maybe hiredis... Tomorrow I'll check again.
looking forward
Well, another 23hrs with the workers alive. Should we close this issue and let it be dealt in #3898?
I'd say if it's looking like the same cause then this should be closed. My issue, which seemed very similar, was eventually tracked down to an issue with our configuration management which was restarting the firewall process every run (every 30 mins) which lead to the workers deadlocking.
Lets be optimistic and closing it as part of 4.2, but will re open if the error hit back.
Most helpful comment
Nope. But today, when I arrived a work, flower was showing again the 'offline' worker after 2000 tasks.
I vote to keep this open. Maybe I can find a little bit of time to make a deeper investigation.