Hangfire: RecurringJobManager doesn't respect queues when using multiple servers

Created on 3 Jul 2017  路  17Comments  路  Source: HangfireIO/Hangfire

I have a situation where there are multiple Hangfire job servers using one database. Each of the servers have unique name and unique queue name that they use to queue jobs. Each of the servers also has its own recurring jobs. Everything else works fine but I've noticed that Hangfire is trying to execute recurring jobs on wrong job servers that aren't configured to use the queues.

E.g. I have two job servers, A and B.
A is configured to use queue "queueA" and B is configured to use queue "queueB". I queue a recurring job in server A in queue "queueA" to run every 5 minutes. Every 5 minutes job server A runs the job, but also server B tries to run the job even though it hasn't been configured to do so.

This seems like a bug to me.

Relevant source:
https://github.com/HangfireIO/Hangfire/blob/master/src/Hangfire.Core/Server/RecurringJobScheduler.cs#L156

I'm getting an exception: "_Recurring job '[job name]' can not be scheduled due to job load exception. Hangfire.Common.JobLoadException: Could not load the job. See inner exception for the details. ---> System.IO.FileNotFoundException: Could not load file or assembly_"
This fails because the job is from a different assembly than the current job server. The queue check seems to happen after job loading and it should be moved to happen before loading the job to prevent this.

Most helpful comment

I have seen this behaviour as well, and really bloated our logs, untill filtering them out. But surely this seems like a wrong behavior. How else could one partition job execution in a micro-service architecture with a single monitoring dashboard?

All 17 comments

Seems that this is a duplicate of #908. I will close this if the issue gets some activity.

Yeah there seems to be something screwed up with queues and multiple servers in HF. Things seems to work OK on my local machine but break when I move my code to our production cluster.

I have seen this behaviour as well, and really bloated our logs, untill filtering them out. But surely this seems like a wrong behavior. How else could one partition job execution in a micro-service architecture with a single monitoring dashboard?

@odinserj Any thoughts on how to have HF behave well in a partitioned multi server setup? I'm considering looking into doing a PR to fix it, if possible. But would like your thoughts on the issue and possible fix.

@odinserj is there an official opinion about this behavior?

Same happening for us. Seems a clear bug. I'm wondering, if the execution Class for queue A's task exists on server for queue B, with the server go ahead and process the task from the wrong queue?

This is a big issue for my company as well. Its causing many of our jobs to be delayed unnecessarily since they only have a 50/50 chance of hitting the correct queue. Any updates on this would be appreciated.

Big issue for us aswell. Same scenario as OP. First time the job is run by correct server/worker (separate queues), but when it fails and is retried, half of the time it ends up trying to run on the wrong worker. We would like to keep Hangfire on one database and one dashboard. Since it's such an old case, I have little hope it will get any other solution at all.

Any update on this issue?

I would have loved to use queues but the randomness of this issue just risks the job scheduler throwing and putting the job in permanent not enqueued state.
The only thing that solved this problem for me was using a separate database per application.

@odinserj could you please consider to include this fix in some nearest releases? Very unexpected behaviour. Faced couple of days ago in production after more than 3 years with Hangfire.

I think also https://github.com/HangfireIO/Hangfire/issues/595 is duplicate of this

@odinserj we at least would know if there is any plan to fix this behavior, please ?

Same here :(

And the same problem for me :(

I have this solution running with hyper care for a few months now with ~25 jobs in 4 different queues and 2 servers not sharing code base. Seems to be working fine up to now.

@GeXiaoguo I'm gonna try it! Thank you! Today I need to update job library references in all Servers, but this is easy to forget.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

osmanrahimi picture osmanrahimi  路  3Comments

plmwong picture plmwong  路  3Comments

odinserj picture odinserj  路  4Comments

shorbachuk picture shorbachuk  路  4Comments

nigel-dewar picture nigel-dewar  路  3Comments