The issue is a repost of an unattended Google groups post _Same task runs multiple times?_
> ./bin/celery -A celery_app report
software -> celery:4.1.0 (latentcall) kombu:4.1.0 py:3.6.1
billiard:3.5.0.3 redis:2.10.6
platform -> system:Linux arch:64bit, ELF imp:CPython
loader -> celery.loaders.app.AppLoader
settings -> transport:redis results:redis://localhost:6379/2
broker_url: 'redis://localhost:6379/2'
result_backend: 'redis://localhost:6379/2'
task_serializer: 'json'
result_serializer: 'json'
accept_content: ['json']
timezone: 'Europe/Berlin'
enable_utc: True
imports: 'tasks'
task_routes: {
'tasks': {'queue': 'celery-test-queue'}}
My application schedules a single group of two, sometimes three tasks, each of which with their own ETA within one hour. When the ETA arrives, I see the following in my celery log:
[2017-11-20 09:55:34,470: INFO/ForkPoolWorker-2] Task tasks._test_exec[bd08ab85-28a8-488f-ba03-c2befde10054] succeeded in 33.81780316866934s: None
[2017-11-20 09:55:34,481: INFO/ForkPoolWorker-2] Task tasks._test_exec[bd08ab85-28a8-488f-ba03-c2befde10054] succeeded in 0.009824380278587341s: None
[2017-11-20 09:55:34,622: INFO/ForkPoolWorker-2] Task tasks._test_exec[bd08ab85-28a8-488f-ba03-c2befde10054] succeeded in 0.14010038413107395s: None
ā¦
[2017-11-20 09:55:37,890: INFO/ForkPoolWorker-8] Task tasks._test_exec[bd08ab85-28a8-488f-ba03-c2befde10054] succeeded in 0.012678759172558784s: None
[2017-11-20 09:55:37,891: INFO/ForkPoolWorker-2] Task tasks._test_exec[bd08ab85-28a8-488f-ba03-c2befde10054] succeeded in 0.01177949644625187s: None
[2017-11-20 09:55:37,899: INFO/ForkPoolWorker-8] Task tasks._test_exec[bd08ab85-28a8-488f-ba03-c2befde10054] succeeded in 0.008250340819358826s: None
ā¦
This can repeat dozens of times. Note the first taskās 33 seconds execution time, and the use of different workers!
I have no explanation for this behavior, and would like to understand whatās going on here.
Maybe this is related to visibility timeout?
@georgepsarakis Could you please elaborate on your suspicion?
As far as I know, this is a known issue for broker transports that do not have built-in acknowledgement characteristics of AMQP. The task will be assigned to a new worker if the task completion time exceeds the visibility timeout, thus you may see tasks being executed in parallel.
@georgepsarakis So if the task is scheduled far ahead in the future, then I might see the above? The āvisibility timeoutā addresses that? From the documentation you linked:
The default visibility timeout for Redis is 1 hour.
Meaning that if within the hour the worker does not ack the task (i.e. run it?) that task is being sent to another worker which wouldnāt ack, and so onā¦ Indeed this seems to be the case looking at the caveats section of the documentation; this related issue https://github.com/celery/kombu/issues/337; or quoting from this blog:
But when developers just start using it, they regularly face abnormal behaviour of workers, specifically multiple execution of the same task by several workers. The reason which causes it is a visibility timeout setting.
Looks like setting the visibility_timeout
to 31,540,000 seconds (one year) might be a quick fix.
I would say that if you increase the visibility timeout to 2 hours, your tasks will be executed only once.
So if you combine:
I think what happens is:
Looking into the Redis transport implementation, you will notice that it uses Sorted Sets, passing the queued time as a score to zadd. The message is restored based on that timestamp and comparing to an interval equal to the visibility timeout.
Hope this explains a bit the internals of the Redis transport.
@georgepsarakis, Iām now thoroughly confused. If a taskās ETA is set for two months from now, why would a worker pick it up one hour after the tasks has been scheduled? Am I missing something?
My (incorrect?) assumption is that:
Your ā_I think what happens is:_ā above is quite different from my assumption.
I also encountered the same problemļ¼have you solved it? @jenstroeger
Thanks!
@jenstroeger that does not sound like a feasible flow, I think the worker just continuously requeues the message in order to postpone execution until the ETA condition is finally met. The concept of the queue is to distribute messages as soon as they arrive, so the worker examines the message and just requeues.
Please note that this is my guess, I am not really aware of the internals of the ETA implementation.
@zivsu, as mentioned above Iāve set the visibility_timeout
to a _very_ large number and that seems to have resolved the symptoms. However, as @georgepsarakis points out, that seems to be a poor approach.
I do not know the cause of the original problem nor how to address it properly.
@jenstroeger I read some blog, change visibility_timeout
can not solve the problem completely, so I change my borker to rabbitmq
.
@zivsu, can you please share the link to the blog? Did you use Redis before?
@jenstroeger I can't find the blog, I used Redis as broker before. For schedule task, I choose rebbitmq to avoid the error happen again.
I have exactly same issue, my config is:
django==1.11.6
celery==4.2rc2
django-celery-beat==1.0.1
settings:
CELERY_ENABLE_UTC = True
# CELERY_TIMEZONE = 'America/Los_Angeles'
And that is the only one working combination of this settings. Also I have to schedule my periodic tasks in UTC timezone.
If you enable CELERY_TIMEZONE
or disable CELERY_ENABLE_UTC
it starts running periodic tasks multiple times.
I have the save problem. the eta task excute multiply times when using redis as a broker.
any way to solve this..
look like change broker from redis to rabbitmq solve this problem..
Using redis, there is a well-known issue when you specify a timezone other than UTC. To work around the issue, subclass the default app, and add your own timezone handling function:
from celery import Celery
class MyAppCelery(Celery):
def now(self):
"""Return the current time and date as a datetime."""
from datetime import datetime
return datetime.now(self.timezone)
Hope that helps anyone else that is running into this problem.
I get this problem sometimes when frequently restarting celery jobs with beat on multicore machines. I've gotten in the habit of running ps aux | grep celery
then kill <each_pid>
to resolve it.
Best advice I have is to always make sure you see the "restart DONE" message before disconnecting from the machine.
{"log":"INFO 2018-10-09 17:41:08,468 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T17:41:08.468912644Z"}
{"log":"INFO 2018-10-09 17:41:08,468 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T17:41:08.468955918Z"}
{"log":"INFO 2018-10-09 19:46:04,293 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T19:46:04.293780045Z"}
{"log":"INFO 2018-10-09 19:46:04,293 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T19:46:04.293953621Z"}
{"log":"INFO 2018-10-09 20:46:04,802 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T20:46:04.802819711Z"}
{"log":"INFO 2018-10-09 20:46:04,802 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T20:46:04.802974829Z"}
{"log":"INFO 2018-10-09 21:46:05,335 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T21:46:05.336081133Z"}
{"log":"INFO 2018-10-09 21:46:05,335 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T21:46:05.336107517Z"}
{"log":"INFO 2018-10-09 22:46:05,900 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T22:46:05.901078395Z"}
{"log":"INFO 2018-10-09 22:46:05,900 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T22:46:05.901173663Z"}
{"log":"INFO 2018-10-09 23:46:06,484 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T23:46:06.485276904Z"}
{"log":"INFO 2018-10-09 23:46:06,484 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T23:46:06.485415253Z"}
{"log":"INFO 2018-10-10 00:46:07,072 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-10T00:46:07.072529828Z"}
{"log":"INFO 2018-10-10 00:46:07,072 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-10T00:46:07.072587887Z"}
{"log":"INFO 2018-10-10 01:46:07,602 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-10T01:46:07.60325321Z"}
{"log":"INFO 2018-10-10 01:46:07,602 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-10T01:46:07.603327426Z"}
{"log":"INFO 2018-10-10 02:46:08,155 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-10T02:46:08.155868992Z"}
{"log":"INFO 2018-10-10 02:46:08,155 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-10T02:46:08.155921893Z"}
{"log":"INFO 2018-10-10 03:46:08,753 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-10T03:46:08.75401387Z"}
{"log":"INFO 2018-10-10 03:46:08,753 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-10T03:46:08.754056891Z"}
{"log":"DEBUG 2018-10-10 04:00:00,013 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:70\n","stream":"stderr","time":"2018-10-10T04:00:00.013548928Z"}
{"log":"DEBUG 2018-10-10 04:00:00,013 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:70\n","stream":"stderr","time":"2018-10-10T04:00:00.013592318Z"}
{"log":"DEBUG 2018-10-10 04:00:00,013 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:71\n","stream":"stderr","time":"2018-10-10T04:00:00.014000106Z"}
{"log":"DEBUG 2018-10-10 04:00:00,013 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:71\n","stream":"stderr","time":"2018-10-10T04:00:00.014167558Z"}
{"log":"DEBUG 2018-10-10 04:00:00,014 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:64\n","stream":"stderr","time":"2018-10-10T04:00:00.014661348Z"}
{"log":"DEBUG 2018-10-10 04:00:00,014 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:64\n","stream":"stderr","time":"2018-10-10T04:00:00.014684354Z"}
{"log":"DEBUG 2018-10-10 04:00:00,014 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:65\n","stream":"stderr","time":"2018-10-10T04:00:00.01514884Z"}
{"log":"DEBUG 2018-10-10 04:00:00,014 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:65\n","stream":"stderr","time":"2018-10-10T04:00:00.015249646Z"}
{"log":"DEBUG 2018-10-10 04:00:00,015 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:66\n","stream":"stderr","time":"2018-10-10T04:00:00.01571124Z"}
{"log":"DEBUG 2018-10-10 04:00:00,015 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:66\n","stream":"stderr","time":"2018-10-10T04:00:00.01580249Z"}
{"log":"DEBUG 2018-10-10 04:00:00,019 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:68\n","stream":"stderr","time":"2018-10-10T04:00:00.019260948Z"}
{"log":"DEBUG 2018-10-10 04:00:00,019 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:68\n","stream":"stderr","time":"2018-10-10T04:00:00.019322151Z"}
{"log":"DEBUG 2018-10-10 04:00:00,245 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:70\n","stream":"stderr","time":"2018-10-10T04:00:00.245159563Z"}
{"log":"DEBUG 2018-10-10 04:00:00,245 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:70\n","stream":"stderr","time":"2018-10-10T04:00:00.245177267Z"}
{"log":"DEBUG 2018-10-10 04:00:00,245 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:67\n","stream":"stderr","time":"2018-10-10T04:00:00.245338722Z"}
{"log":"DEBUG 2018-10-10 04:00:00,245 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:67\n","stream":"stderr","time":"2018-10-10T04:00:00.245351289Z"}
{"log":"DEBUG 2018-10-10 04:00:00,256 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:65\n","stream":"stderr","time":"2018-10-10T04:00:00.256770035Z"}
{"log":"DEBUG 2018-10-10 04:00:00,256 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:65\n","stream":"stderr","time":"2018-10-10T04:00:00.256788689Z"}
{"log":"INFO 2018-10-10 04:00:00,371 trace celery.app.trace 68 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.35710329699213617s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.371967002Z"}
{"log":"INFO 2018-10-10 04:00:00,371 trace celery.app.trace 68 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.35710329699213617s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.371983293Z"}
{"log":"INFO 2018-10-10 04:00:00,387 trace celery.app.trace 69 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.10637873200175818s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.388119538Z"}
{"log":"INFO 2018-10-10 04:00:00,387 trace celery.app.trace 69 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.10637873200175818s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.388166317Z"}
{"log":"INFO 2018-10-10 04:00:00,404 trace celery.app.trace 70 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.16254851799749304s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.404834545Z"}
{"log":"INFO 2018-10-10 04:00:00,404 trace celery.app.trace 70 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.16254851799749304s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.404862208Z"}
{"log":"INFO 2018-10-10 04:00:00,421 trace celery.app.trace 65 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.1654666289978195s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.421607856Z"}
{"log":"INFO 2018-10-10 04:00:00,421 trace celery.app.trace 65 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.1654666289978195s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.421674687Z"}
{"log":"INFO 2018-10-10 04:00:00,438 trace celery.app.trace 67 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.19588526099687442s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.438295459Z"}
{"log":"INFO 2018-10-10 04:00:00,438 trace celery.app.trace 67 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.19588526099687442s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.438311386Z"}
...
if we check Received task timestamps, every hour it will get new task with same id. The result is that all ETA messages are sent more than 10 times. Looks like rabbitmq is only option if we want to use ETA
Rcently meet similar bug. Also ps aux | grep celery
showed more processes than workers started, twice more. Appending parameter --pool gevent
to command launching celery workers lowered number of processes to exact number of started workers and celery beat. And now i'm wathnig my tasks execution.
Might another solution be disabling ack emulation entirely? i.e. "broker_transport_options": {"ack_emulation": False}
. Any drawbacks for short-running tasks / countdowns?
Did anyone get the fix?
Have been facing the same problem, Any solution for it?
Bumping, the same problem.
Same issue here, using Redis as broker.
$ pipenv graph --bare | grep -i "redis\|celery"
---
channels-redis==2.4.0
- aioredis [required: ~=1.0, installed: 1.3.0]
- hiredis [required: Any, installed: 1.0.0]
django-celery-beat==1.5.0
django-celery-results==1.1.2
- celery [required: >=4.3,<5.0, installed: 4.3.0]
- celery [required: >=3.1.0, installed: 4.3.0]
- pylint-celery [required: ==0.3, installed: 0.3]
redis==3.2.1
Same problem here. Celery version 4.3.0
celery = Celery('tasks', broker='pyamqp://nosd:sdsd@rabbit//', config_from_object={"broker_transport_options":{'visibility_timeout': 18000}})
The command I use to run my worker
celery -A tasks worker --pidfile=/tmp/celery_worker.pid -f=/tmp/celery_worker.log -Q celery_queue --loglevel=info --pool=solo --concurrency=1 -n worker_celery --detach --without-gossip --without-mingle --without-heartbeat
can you give celery==4.4.0rc4 a try?
Celery is receiving same task twice, with same task id at same time.
Here are the logs
[2019-11-29 08:07:35,464: INFO/MainProcess] Received task: app.jobs.booking.bookFlightTask[657985d5-c3a3-438d-a524-dbb129529443]
[2019-11-29 08:07:35,465: INFO/MainProcess] Received task: app.jobs.booking.bookFlightTask[657985d5-c3a3-438d-a524-dbb129529443]
[2019-11-29 08:07:35,471: WARNING/ForkPoolWorker-4] in booking funtion1
[2019-11-29 08:07:35,473: WARNING/ForkPoolWorker-3] in booking funtion1
[2019-11-29 08:07:35,537: WARNING/ForkPoolWorker-3] book_request_pp
[2019-11-29 08:07:35,543: WARNING/ForkPoolWorker-4] book_request_pp
both are are running simultaneously,
using celery==4.4.0rc4 , boto3==1.9.232, kombu==4.6.6 with SQS in pyhton flask.
in SQS, Default Visibility Timeout is 30 minutes, and my task is not having ETA and not ack.
running worker like,
celery worker -A app.jobs.run -l info --pidfile=/var/run/celery/celery.pid --logfile=/var/log/celery/celery.log --time-limit=7200 --concurrency=8
@auvipy , any help would be great.
which broker and result back end are you using? can you try switching to another back end?
which broker and result back end are you using? can you try switching to another back end?
using SQS, result backend is MYSQL, with sqlalchemy.
details are here at SO, https://stackoverflow.com/questions/59123536/celery-is-receiving-same-task-twice-with-same-task-id-at-same-time
@auvipy can you please have a look.
@thedrow do you face this issue in bloomberg?
@nitish-itilite : what timezone are you using for celery?
@nitish-itilite : what timezone are you using for celery?
it is default UTC. in sqs , region is this US East (N. Virginia).
I had a similar case running celery with SQS. I ran a dummy task with countdown=60
, while visibility timeout in SQS is 30 seconds. Here's what I get:
NOTE: I've started celery with --concurrency=1
, so there are two threads, right?
[2020-02-18 14:46:32 +0000] [INFO] Received task: notification[b483a22f-31cc-4335-9709-86041baa8f05] ETA:[2020-02-18 14:47:31.898563+00:00]
[2020-02-18 14:47:02 +0000] [INFO] Received task: notification[b483a22f-31cc-4335-9709-86041baa8f05] ETA:[2020-02-18 14:47:31.898563+00:00]
[2020-02-18 14:47:32 +0000] [INFO] Task notification[b483a22f-31cc-4335-9709-86041baa8f05] succeeded in 0.012232275999849662s: None
[2020-02-18 14:47:32 +0000] [INFO] Task notification[b483a22f-31cc-4335-9709-86041baa8f05] succeeded in 0.012890915997559205s: None
What happened in chronological order:
inflight
mode for 30 secondsMy guess is that this is a bug in Celery (?), I think it should've checked if the message id (b483a22f-31cc-4335-9709-86041baa8f05
) has already been taken by that worker.
Maybe there could be a hash list with all message ids, so that celery could decide if a received task is valid for processing. Can celery do that?
NOTE 2:
We can't set a visibility timeout for too long, because if the worker does actually die, the message would take too long to be picked up by another worker. Setting it too low would expose this condition.
This seems to be happening to me too.
[2020-05-11 15:31:23,673: INFO/MainProcess] Received task: ee_external_attributes.tasks. recreate_specific_values[53046bd7-2a19-4f72-808f-d712eaecb0e8]
[2020-05-11 15:31:28,673: INFO/MainProcess] Received task: ee_external_attributes.tasks.recreate_specific_values[53046bd7-2a19-4f72-808f-d712eaecb0e8]
(I tweaked the task name in the logs for public posting.)
Due to uniqueness constraints, one of my workers throws an error partway through the task, and the other one succeeds.
I tried setting
CELERY_WORKER_PREFETCH_MULTIPLIER = 1
That turned out not to help.
I'm using
celery==4.4.1
django-celery-results==1.2.1
And I'm using AWS SQS for the queue.
I do have a theory. Apparently my "Default Visibility Timeout" setting on my queue was only set to 5 seconds. It may be that the second worker pulled the job while the first was working on it, because it assumed the first worker had died. I upped the visibility timeout to 2 minutes, and it seems to be doing better. I had plenty of tasks that took 8-12 seconds, so 2 minutes may be overkill. But hopefully that solves it.
It may be that the second worker pulled the job while the first was working on it, because it assumed the first worker had died.
@JulieGoldberg, that would be a crummy way for Celery to handle jobs. A Celery worker should never start a job that another worker has pulled off the queue and is actively processing; think that would be seriously broken. (But itās Celery, Iām surprised by nothing anymore š)
I have a similar problem with an application that is running in Kubernetes. In the Kubernetes instance, we have 10 workers (celery app instance) who consume the tasks from the Redis.
Symptoms:
The celery worker schedules an ETA task twhich will be planed after 30 minutes. If the Kubernetes pod is rotated (the worker is killed by Kubernetes) or a newer version of the application is deployed (all workers are killed and new workers are created), all workers will take the scheduled task and start executing in the defined time.
For the worker, I tried to set different values of visibility_timeout
for several hours up to one year, but the result was still the same. The same behavior was reached with the setting enable_utc = True
, or a reduction ofworker_prefetch_multiplier = 1
.
I don't know if this will help anyone but this was my issue:
I had tasks (report generation) that were being run when a page was loaded via GET. For some reason (something to do with favicons) Chrome would send 2 GET requests on every page load, triggering the task twice.
GET requests are supposed to be side effect free, so I turned them all into forms that you submit and the issue was resolved.
It may be that the second worker pulled the job while the first was working on it, because it assumed the first worker had died.
@JulieGoldberg, that would be a crummy way for Celery to handle jobs. A Celery worker should never start a job that another worker has pulled off the queue and is actively processing; think that would be seriously broken. (But itās Celery, Iām surprised by nothing anymore š)
Instead of complaining, you can help us fix the issue by coming up with a solution and a PR.
I have a similar problem with an application that is running in Kubernetes. In the Kubernetes instance, we have 10 workers (celery app instance) who consume the tasks from the Redis.
Symptoms:
The celery worker schedules an ETA task twhich will be planed after 30 minutes. If the Kubernetes pod is rotated (the worker is killed by Kubernetes) or a newer version of the application is deployed (all workers are killed and new workers are created), all workers will take the scheduled task and start executing in the defined time.
@elMateso I faced similar issues with Airflow deployment on k8s (consumers on pods and redis as a queue). But I was able to make the deployment stable and working as expected, maybe those tips will help you:
https://www.polidea.com/blog/application-scalability-kubernetes/#tips-for-hpa
Facing the same here.
Doesn't seems to be a problem with any timing configuration (visibility timeout, ETA, etc..), for me at least. In mine case it happens microseconds between executions. Didn't find how celery does in fact ACK a message, but, if, in rabbitMQ it's working perfectly it seems to be a problem with concurrency and ACK in Redis.
I am seeing the same issue and we are using Redis as broker too. Changing to rabbitMQ is not an option for us.
Does anyone have tried using a lock to ensure the task is only executed once only. Could that work?
e.g. https://docs.celeryproject.org/en/latest/tutorials/task-cookbook.html#ensuring-a-task-is-only-executed-one-at-a-time
@ErikKalkoken we end up doing exactly that.
def semaphore(fn):
@wraps(fn)
def wrapper(self_origin, *args, **kwargs):
cache_name = f"{UID}-{args[0].request.id}-semaphore"
agreement_redis = AgreementsRedis()
if not agreement_redis.redis.set(cache_name, "", ex=30, nx=True):
Raise Exception("...")
try:
return fn(self_origin, *args, **kwargs)
finally:
agreement_redis.redis.delete(cache_name)
return wrapper
The code above is not used for celery, but celery multiple execution is the same logic, you just need to get the task_id and set the cache. So far is working fine.
can someone check this pr https://github.com/vinayinvicible/kombu/commit/a755ba14def558f2983b3ff3358086ba55521dcc
Most helpful comment
@georgepsarakis So if the task is scheduled far ahead in the future, then I might see the above? The āvisibility timeoutā addresses that? From the documentation you linked:
Meaning that if within the hour the worker does not ack the task (i.e. run it?) that task is being sent to another worker which wouldnāt ack, and so onā¦ Indeed this seems to be the case looking at the caveats section of the documentation; this related issue https://github.com/celery/kombu/issues/337; or quoting from this blog:
Looks like setting the
visibility_timeout
to 31,540,000 seconds (one year) might be a quick fix.