Bull: removeOnFail not working as expected

Created on 10 Jan 2019 · 11Comments · Source: OptimalBits/bull

Description

Our setup is pretty simple. We are emitting a simple job with removeOnFail option set to true. The jobs are getting removed after failure successfully (most of the times), but sometimes, the failed job sticks around. How do we fix this? (This is not easy to reproduce).

Bull version

3.5.1

cannot reproduce

Source

iamgr0ot

Most helpful comment

Ok, I will review the code that handles failed tasks to see if there is any hazard that may explain the issue.

manast on 3 Mar 2019

❤3

All 11 comments

@manast - This happens when the task goes to stuck state. Do you know about why tasks goto stuck state?

iamgr0ot on 7 Feb 2019

I have this issue as well! I have to manually remove the queue to get it going again.

fluffybunnies on 7 Feb 2019

@iamgr0ot in fact there is not a stuck state. I think you mean that the job has failed due "got stucked" too many times. In that case maybe the logic for removing the failed jobs is not working. I will have to look into it. But I recommend you that you find out why this happens, jobs failing because of this are normally due to processors that hold the event loop too long, i.e, you should avoid this anyway.

manast on 7 Feb 2019

@manast - I queried the job state from bull redis and this is what I get:
{"state":"stuck"}. This stuck is not documented and it is hard to figure out what happened. This is not easy to reproduce but we hit it several times out of thousands of iterations. When a job reaches this state, the queue does allow us to remove the job thus permanently blocking the job forever. We have been isolating those jobs to their own queues and we end up deleting the entire queue when things like this happen.

iamgr0ot on 7 Feb 2019

@iamgr0ot you are probably keeping the event loop busy for too long. You need to be careful that your process functions do not take too much CPU, in that case divide the job by doing calls to process.nextTick().

manast on 7 Feb 2019

Got same bug here. Using same jobId to prevent job duplicates, so it's critical for my case. That's what I got in redis after job got stucked

IvanMMM on 19 Feb 2019

@IvanMMM can you provide a bit more of background of your case? and if possible some code that reproduces the issue in your end?

manast on 20 Feb 2019

@manast I'm afraid not. It happens randomly with some tasks and after flushing tasks it works fine. Also I noticed that failed tasks as I'v send on my previous message are not showing in any list and the only way to remove them is to delete them directly from redis.

IvanMMM on 3 Mar 2019

👍2

Ok, I will review the code that handles failed tasks to see if there is any hazard that may explain the issue.

manast on 3 Mar 2019

❤3

As a temp solution, I wrote a script to search for stuck tasks and remove them from redis.

async function removeStuckJobs() {
        let removed = 0;
        let allJobs = [];
        await Promise.each([
            'postUpdater',
            'collector',
            'engager',
            'statistics',
            'proxy',
            'master'
        ], async (taskName) => {
            allJobs = allJobs.concat(await global.conn.redis.keysAsync(`sstiv:${taskName}:*`))
        });
        await Promise.map(allJobs, async (key) => {
            const result = await global.conn.redis.evalAsync('' +
                'if redis.call("TYPE", KEYS[1])["ok"] ~= "hash" then ' +
                'return; ' +
                'end; ' +
                'if redis.call("HEXISTS", KEYS[1],"stacktrace") == 1 then ' +
                'return redis.call("DEL", KEYS[1]); ' +
                'end',
                1, key);
            if (result === 1) removed++;
        });
        console.log(`Removed ${removed}/${allJobs.length} jobs`);
    }

setInterval(removeStuckJobs, 1000 * 60 * 1);

IvanMMM on 6 Mar 2019

I also noticed the exactly same issue. The main key is in Redis (and if I run q.getJob() for it, the job is returned!), but this job is not visible in any lists, e.g. q.getJobs() does not return it.

This happens with failed jobs only. E.g. in the snippet above, Ivan distinguishes such redis keys by "stacktrace" key in the hash as a work-around. I've never seen stuck succeeded jobs, I see only those which were aborted by an exception (an exception may be unrelated to redis, as in my case).

Sounds like a non-atomic operation somewhere which remains the main job key (aka its name) in Redis, but wipes it from all lists.