Hangfire: RecurringJobManager doesn't respect queues when using multiple servers

Created on 3 Jul 2017 · 17Comments · Source: HangfireIO/Hangfire

I have a situation where there are multiple Hangfire job servers using one database. Each of the servers have unique name and unique queue name that they use to queue jobs. Each of the servers also has its own recurring jobs. Everything else works fine but I've noticed that Hangfire is trying to execute recurring jobs on wrong job servers that aren't configured to use the queues.

E.g. I have two job servers, A and B.
A is configured to use queue "queueA" and B is configured to use queue "queueB". I queue a recurring job in server A in queue "queueA" to run every 5 minutes. Every 5 minutes job server A runs the job, but also server B tries to run the job even though it hasn't been configured to do so.

This seems like a bug to me.

Relevant source:
https://github.com/HangfireIO/Hangfire/blob/master/src/Hangfire.Core/Server/RecurringJobScheduler.cs#L156

I'm getting an exception: "_Recurring job '[job name]' can not be scheduled due to job load exception. Hangfire.Common.JobLoadException: Could not load the job. See inner exception for the details. ---> System.IO.FileNotFoundException: Could not load file or assembly_"
This fails because the job is from a different assembly than the current job server. The queue check seems to happen after job loading and it should be moved to happen before loading the job to prevent this.

Source

Z1ni

👍17 😕3

Most helpful comment

I have seen this behaviour as well, and really bloated our logs, untill filtering them out. But surely this seems like a wrong behavior. How else could one partition job execution in a micro-service architecture with a single monitoring dashboard?

mikanyg on 30 Aug 2017

👍9

All 17 comments

Seems that this is a duplicate of #908. I will close this if the issue gets some activity.

Z1ni on 3 Jul 2017

Yeah there seems to be something screwed up with queues and multiple servers in HF. Things seems to work OK on my local machine but break when I move my code to our production cluster.

mikechamberlain on 8 Aug 2017

👍1

mikanyg on 30 Aug 2017

👍9

@odinserj Any thoughts on how to have HF behave well in a partitioned multi server setup? I'm considering looking into doing a PR to fix it, if possible. But would like your thoughts on the issue and possible fix.

mikanyg on 19 Jan 2018

@odinserj is there an official opinion about this behavior?

igor-moskvitin on 27 Feb 2018

👍7

Same happening for us. Seems a clear bug. I'm wondering, if the execution Class for queue A's task exists on server for queue B, with the server go ahead and process the task from the wrong queue?

timavaza on 20 Sep 2018

This is a big issue for my company as well. Its causing many of our jobs to be delayed unnecessarily since they only have a 50/50 chance of hitting the correct queue. Any updates on this would be appreciated.

chris1411 on 21 Feb 2019

Big issue for us aswell. Same scenario as OP. First time the job is run by correct server/worker (separate queues), but when it fails and is retried, half of the time it ends up trying to run on the wrong worker. We would like to keep Hangfire on one database and one dashboard. Since it's such an old case, I have little hope it will get any other solution at all.

Finestmedia on 27 Mar 2019

👍1

Any update on this issue?

Codearella on 28 Oct 2019

👀1

I would have loved to use queues but the randomness of this issue just risks the job scheduler throwing and putting the job in permanent not enqueued state.
The only thing that solved this problem for me was using a separate database per application.

Compl1cation on 11 Nov 2019

@odinserj could you please consider to include this fix in some nearest releases? Very unexpected behaviour. Faced couple of days ago in production after more than 3 years with Hangfire.

ZlobnyiSerg on 14 May 2020

I think also https://github.com/HangfireIO/Hangfire/issues/595 is duplicate of this

ZlobnyiSerg on 14 May 2020

👍2

@odinserj we at least would know if there is any plan to fix this behavior, please ?

berdem on 27 May 2020

👍3

Same here :(

alefcarlos on 18 Jun 2020

And the same problem for me :(

kbialkowski on 29 Jul 2020

I have this solution running with hyper care for a few months now with ~25 jobs in 4 different queues and 2 servers not sharing code base. Seems to be working fine up to now.