Hi, just a "simple" question: how and when bull try to process a job again?
I'm currently facing this strange behavour: I've a node.js cluster which spawns 2 workers, with concurrency set to 2, listening to the same queue (as in the example given in the bull documentation). a job is put on the queue, and the first worker starts processing it. after a couple of seconds, the second worker starts to process the same exact job, even if the first one hasn't finished.
I see no errors, no exceptions raised... I've attached listeners to the 'error', 'completed' and 'failed' events of the queue and nothing is happening. I've attached listeners to the cluster 'exit' and 'disconnect' events and nothing is happening...
If I spawn just 1 worker, the job is executed only once as it should.
I'm currently out of ideas, so any help or hint is greatly appreciated! :)
btw I've tried with bull 0.7.0 and 0.4.0 and it still happens, so I suppose it's something on my side (or maybe on some modules I use, as it seems it happens when I parse some big files with cheerio), but I really have no ideas why bull is calling my handler twice...
what do you do in your handler?
if you return a promise make sure you're function doesn't take a done callback.
queue.proces(function(job) {
return new Promise();
});
or
queue.process(function(job, done) {
//do stuff
done();
});
Otherwise could you check if your redis database has a key called: bull:queuename:<jobid>:lock where <jobid> is the actual jobid. This should be there as long as the job is "processing".
Hi, thanks for the suggestions. I've checked redis database:
127.0.0.1:6379> keys *
1) "bull:worker:5"
2) "bull:worker:5:lock"
3) "bull:worker:active"
4) "bull:worker:id"
the ":lock" key is there, but after a while it disappears, so I suppose it's the reason why the job starts again. Obviously I've no idea why it happens :)
I've added this code to my worker:
queue.on 'ready', ->
console.log '--------->', cluster.worker.id, 'ready'
.on 'error', (error) ->
console.log '--------->', cluster.worker.id, 'error', error
.on 'active', (job, jobPromise) ->
console.log '--------->', cluster.worker.id, 'active'
.on 'progress', (job, progress) ->
console.log '--------->', cluster.worker.id, 'progress', progress
.on 'completed', (job, result) ->
console.log '--------->', cluster.worker.id, 'complete', result
.on 'failed', (job, err) ->
console.log '--------->', cluster.worker.id, 'failed', err
.on 'paused', ->
console.log '--------->', cluster.worker.id, 'paused'
.on 'resumed', (job) ->
console.log '--------->', cluster.worker.id, 'resumed'
.on 'cleaned', (jobs, type) ->
console.log '--------->', cluster.worker.id, 'cleaned'
and this is what gets logged:
---------> 1 active
---------> 1 ready
---------> 2 ready
---------> 2 active
so no other events, jobs completed ecc... and there's just that one single job in the database
inside my handler I get the done callback, so no promises involved. Since the main computation is done inside an "async.parallel()" function, I've added a "return null" at the end of the function, just in case, but it seems to make no difference...
I'll try to investigate further as why the lock key is deleted.
ok I probably found the cause of the issue.
If I understood correctly the code, bull's lock keys expires about every 2.5 seconds. one of the html file I'm processing is so big that cheerio's $("a[href]").each(function() { ... }) takes so long that nodejs event loop can't process the renew timer before the other worker takes over...
I wonder how someone is supposed to build a program whose purpose is to run long and heavy tasks if he has to pay attention that every single loop doesn't last more than a couple of seconds :confused:
anyway, making the loop asynchronous with a setTimeout() seems to work, but AFAIK this isn't the best solution performance wise...
@cvlmtg you are right, and as I was reading your comments I already had an insight of the cause to the problem. The root of the problem as I see it is, how can we know that a long running process is indeed running and has not hang itself? It is quite difficult, not to say impossible, to reliably deduce if a worker is still working or not. So it is not entirely unreasonable to require that a job must not hang the event loop more than a second or two. I am all open for suggestions on how to solve this better...
@manast I understand your point of view and I currently haven't a better idea.
Anyway, I can think about two ways to solve my problem. one is to make the renew timer configurable. The default value is good enough most of the time, but for corner cases it might be useful to customize it (my program is basically a background daemon with no user interaction at all, jobs are added by an express server running in another process, so I suppose that an event loop running every 10 seconds or more might not be a big problem).
The other solution is to change a synchronous loop into an asynchronous one with the use of setTimeout :smile: This might not be elegant, but it's doable, however what really bothers me is the fact that you need to care about loops timing, but as I said I currently have no better ideas.
@manast looks like there's no easy way to loop asynchronously an array of cheerio's objects. $(selector).toArray() or $(selector).get()return an array of "dumb" objects, i.e. they are not cheerio objects, with all the setters etc. I could somehow loop on these dumb objects and then use their id to query again the document with $(id).attr('name', value) etc... but I think I'm starting to add just a bit too much code, which, moreover, I have to repeat for every handler that has to do some kind of processing on some html file, just in case it's another big one...
So I'm thinking about adding an option to bull to disable the TTL on the lock key. I understand that this means that it's now the program's responsibility to handle stuck jobs etc, but probably this is still better that having jobs that runs more than once, or that need to be slower or more complicated than needed just to give a timer the chance to run every couple of seconds...
Would you accept such a pull request? I'll obviously follow style guidelines, update the docs etc... I just wanted to know if you are ok with such a change or if you think it can be better resolved in another way.
@manast just curious, will this affect long running jobs that do not have long blocking procedures? For example, an email runner that loops through a lot of stuff
@manast why not enable the client of the api / creator of the job to specify their own TTL? they ought to have a reasonable idea how long a job should take vs. being considered hung
@evanhuan8 having a longer TTL should just affect the case where a queue has started a job and crashed. with a short TTL it should start faster than with a longer. so being this an edge case, having a longer TTL should not be a problem. @davisford, I agree that this could be configurable.
Pretty sure I've just hit this issue where some jobs seem to be duplicated – a lot of my jobs are heavy, long-running processes so it fits with what has been described here.
What's the suggested fix for now? Disabling the TTL in Bull?
As things stand, I'm having to write my own workarounds by keeping a record of which jobs have already been run and not allowing a duplicate to occur if a previous job run can be found. It's not an ideal solution.
@robhawkes you can check my fork of bull, where I've added a no_ttl branch. I've not looked into it recently for two reasons: the first one is that the server that's using my fork is used only sporadically, the second one is that it seems it has run fine so far. to sum up: it seems it works, but I cannot make any guarantee :)
@cvlmtg: Thanks for that – I tried it out and it didn't fix my problem, which was a shame. Perhaps I have a slightly different duplication bug, or disabling the TTL doesn't fix it for me.
Either way, I've since tried my exact same process but with Kue instead of Bull. It works perfectly without duplication so I'm going to use that for the time-being. :)
See https://github.com/OptimalBits/bull/pull/254 that adds a 'stalled' event that's very useful for debugging this:
queue.on('stalled', function(job){
console.log('stalled job, restarting it again!', job.queue.name, job.data);
});
You can also set the LOCK_RENEW_TIME like this. I have mine set to a minute to give the event loop a lot of overhead, with the tradeoff that (real) stalled jobs will be picked up a minute later:
var queue = new Queue('my queue');
queue.LOCK_RENEW_TIME = 60 * 1000; // 1min
@robhawkes if you can post a process function that reproduces the problems we can surely figure out what is going on.
Regarding long running processes, maybe fibers could be of some help:
https://www.npmjs.com/package/fibers
I will close this since we did not get more info from the reporter, reopen if needed.
I was seeing this issue.
When calling this method the test method is run multiple times and outputs job with the same job id. Thanks for the tip @bradvogel.
testQueue.add({},
{
attempts: 2,
timeout: 120000
}).then(function(job) {
console.log("ABS -- job" + job);
});
testQueue.process(5, function(job, done){
test(job);
});
function test(job) {
console.log("ABS -- job: ", job.jobId);
for( var i = 0; i < 2; i++) {
(function (i) {
setTimeout(function () {
console.log("ABS -- i: " + i);
}, 5000 * i);
}(i));
}
}
@manast If you have time I can show you some code with this issue on Skype.
I am running the same issue. Bull 3.11.0 version.
My job processor needs to call a couple of external apis which takes several seconds. I notice that if I comment out those api calls, the job only runs once. But if I enable those api calls, the job is always processed twice. I set queue.LOCK_RENEW_TIME = 120 * 1000; but same result. Any suggestion on what I should do next?
@yangju can you write a processor where the api calls are replaced by "delays" and post a bug report where the issue is reproduced?
After reviewing my code, I found that it was not double processing of a job. It is my code made a callback and then continue executions after the callback was made. It was a false alarm.
Sorry about this. Please close this thread.
Most helpful comment
See https://github.com/OptimalBits/bull/pull/254 that adds a 'stalled' event that's very useful for debugging this:
You can also set the LOCK_RENEW_TIME like this. I have mine set to a minute to give the event loop a lot of overhead, with the tradeoff that (real) stalled jobs will be picked up a minute later: