Framework: [5.1] [5.2] Multiple queue workers using Redis

Created on 27 Apr 2015  路  53Comments  路  Source: laravel/framework

Possible cause/result of #8576.

It seems that using several workers (php artisan queue:work --daemon ...) via supervisor cause the jobs to be picked up by multiple workers at the same time, causing all kinds of havoc.

I have 10 workers in my case, that download a file from an FTP server, save the file itself locally, inserts a row in a database and then delete the original file from the FTP site.

I've experienced several times now, that several rows appear in the database, and sometimes I get an error informing me that the file on the FTP site doesn't exist but the row appears in the database meaning that it has already been processed. I am also pummelled with the redis errors found in #8576 (Predis\Transaction\AbortedMultiExecException)

I should say that this doesn't happen all the time, only for some jobs...

FYI, I've temporarily reduced the number of workers to 1 and I haven't had a user error since, altho I do still get the Predis errors.

I've tried to reproduce this locally but cannot, it seems that it works as expected. I don't get the Predis error either. The only difference is that I'm using supervisor in production and 10 terminals on my Mac...

bug

Most helpful comment

Maybe I missed it but would be nice if this is documented in the docs. Did not realize what expires does untill this issue :)

All 53 comments

+1

@m4tthumphrey Any luck resolving this issue. I've seen similar behavior with running a large worker farm (+50 workers).

Looks like a limitation with that driver. Perhaps use beanstalk - it's more suited to this use.

@raja Yeh I definitely would not use Redis to power mission critical queues, at least not within Laravel. I switched to Beanstalkd and have had a 100% success rate. Also, the author of redis is working on a dedicated message queue system for Redis, Disque.

In case of multiple hosts, I would not advise to use beanstalkd as if one machine goes down, you would loose all your jobs on that machine.

Maybe implementing a lock algorithm in the RedisQueue.php driver in the pop function (maybe using blpop like here : http://www.justincarmony.com/blog/2012/01/10/php-workers-with-redis-solo/) or with something like https://github.com/ronnylt/redlock-php.

if I come up with something I'll let you know.

I have experienced the problem with Redis, too. Still don't know why the issue has been closed.

If it's related to redis then it should be mentioned in the docs that multiple workers can not be run when the queue driver has been set to redis.

I had this problem, it happens for any job that runs for more than 60 seconds.

I have a thread in Laracasts that explain why this happen, how to avoid it and a lot of information about how Laravel handles Queues since I had to dig the source code a lot to understand how it works.

Check it for more information: https://laracasts.com/discuss/channels/laravel/running-multiple-queue-workers

@henriquebim Amazing work man! I knew it was going to be todo with Laravel's implementation and not Redis itself. I think I did actually set the expire time to 0 or null and still encountered the issue.

I'm quite astounded that this has been closed though as it's definitely a bug within Laravel and at the very least should be mentioned in the documentation.

I'm still running with beanstalkd however with no problems whatsoever.

@m4tthumphrey if you set it to 0 you will make it worse, the code checks with is_null and is_null(0) = false. So with it set to 0, virtually all jobs will be able to duplicate (any job that takes more than 0 seconds), but this depends on the amount of jobs and workers running.

Setting it to null or to a really high number (so that no jobs run for more than that number in seconds) fixes the issue, BUT any job that fails will get stuck (won't go to failed_jobs nor run again) or will take that amount of seconds to be retried/put on failed_jobs.

For those that can use beanstalkd, I suggest doing so.

Will this be fixed or .. will this issue be reopened?

I agree with @GrahamCampbell , this is not a bug.

But it is defective.

I was also wondering why laravel was not using rpoplpush when I was working on port laravel redisqueue to symfony, since in redis documentation it was supposed to use for a reliable queue. http://redis.io/commands/rpoplpush#pattern-reliable-queue

This is not a bug, but I vote to reopen + 1

Does this occur if you use another queue driver or is it particular to Redis?

I would also be curious to know the values you are using for sleep and timeout on your queue worker configuration.

@taylorotwell Thanks for reopening this.

It is particular to the Redis driver. I have been using beanstalkd ever since with no issue. The command that is being run is

php artisan queue:work --daemon --sleep=10 --tries=10 --queue=process_calls

The timeout option is only applicable when using the queue:listen command.

Hmm, does anyone have an idea of what we can do to fix the issue with the Redis driver?

@taylorotwell Looks like @phoenixgao might be on to something with rpoplpush

Yup, rpoplpush looks like a viable solution

To summarize so people don't have to read the whole Laracasts thread posted above (the most complete post there is https://laracasts.com/discuss/channels/laravel/running-multiple-queue-workers/replies/60762, and also https://laracasts.com/discuss/channels/laravel/running-multiple-queue-workers/replies/92055):
The problem is that the RedisQueue driver stores all started job at a :reserved queue and before executing any job, it migrate all jobs older than 60 seconds (by default) from the :reserved queue to the main queue. That means, if your job runs for more than 60 seconds and you have more than one worker, the jobs will eventually be put back to the main queue and picked up to run again.

My fix was to set a value of NULL to the expire value (overwriting the 60 seconds default), this means that the jobs are never migrated. The problem with this is that failed jobs will never rerun, they will forever be stuck at :reserved queue. So it's not a fix the framework could have in it source code (maybe a hint at the documentation).

For a permanent fix, I don't have much clue on how to have that fixed, I think it would need to have a way to check if the job crashed (to move to the "retry" queue) or check if the job is running (to avoid rerun), so it would avoid relying on the time. I don't think rpoplpush would help in this situation because the jobs would still use the main, :reserved and :delayed queues in the same way I believe.

I will check how the other drivers handle those things to see if I come with a solution.

That means, if your job runs for more than 60 seconds and you have more than one worker, the jobs will eventually be put back to the main queue and picked up to run again.

Just set the timeout to be much larger then?

I've just checked one of my apps, and i have a timeout set as 86400 rather than 60.

@GrahamCampbell The point of that value is to retry the job (and maybe something else I didn't understand), so that means if any of your jobs fail, they would wait 24h before a retry, it would depend on the use case to know if it makes sense to retry the job after a day (if it was to generate a report for example, the user would just hit the "generate" button again and again until he gives up, and later would receive a bunch of those reports).

For my case it was fine that jobs would never be retried, it was something that should either be sent at the time or never again, so I set it to null and kept an eye for errors on the log, if there was an error I would stop the queues, fix it and start again.

For other use cases, it would become a guessing job to set a time that makes sure no job runs greater than that time and also it retries the jobs fast enough to make sense to retry it.

he point of that value is to retry the job (and maybe something else I didn't understand), so that means if any of your jobs fail, they would wait 24h before a retry

Only if they fail with a fatal error. If they throw an exception, then get put back on the queue again, I think... Ping @taylorotwell.

@GrahamCampbell Thanks for reopening this, keeping it open does not mean you confirm it is a bug in laravel, it just let people keep an eye on it and can do more investigation on why some people get this "error" and only close it when someone give a solution or explanation or conclusion, so new people who get the same problem can know what they did wrongly. It helps the community.

@henriquebim I think you misunderstand the point of timeout, it's not the time to wait before retrying a failed job (that should be --delay), timeout is the value you think a job should never run longer than that, and job are processed longer than timeout without finishing will be considered died as zombie and will be put back onto the queue to re-run (not retry in the code, retry is retry the jobs threw any exception), if you have a miss-set timeout value (too small for your jobs), that those are picked again and again is exactly expected.

if your jobs normally should run less than 5 secs, but sometimes due to some reason it last more than 2 mins, you should add timeout limitation in your job, e.g., add a timeout in your curl.

In this case even you have only one worker, those timeouted jobs would be picked again and again, they were pushed into the tail of the queue. I think you need investigate more to see if the duplicated jobs are timeout jobs, if they are, then it actually works (for you).

I was suspecting that multiple workers compete to pop the same job at the same time (it is what @m4tthumphrey described in first report), and I also doubt, because in theory it shouldn't be because lpop will get the job and remove it from the list atomically.

When I mentioned rpushlpop is because, laravel worker(client) pop a job, then zadd it to :reversed (reserved?) , it is possible a client dies between pop and zadd, so the job is totally lost. (not reliable enough for some one)

So back to this thread, did any of you guys actually see jobs are picked by different clients at the same time?

Then let's think more about timeout case. If client B find a job is expired and want to put it back onto the working queue, it should make sure client A is not processing that job, by killing (restarting) A or force A stop that job and move to next one, maybe?

This is not possible atm in laravel queue because the queue manager have no acknowledge about workers (host, pid, etc) but there is space of improvement.

I've done something in symfony to have a list of all clients info I'll see if I can have anything, @GrahamCampbell or maybe you have better ideas :p

@phoenixgao actually I'm talking about another value, called "expire" (you can change it at config/queue.php under redis driver, I didn't find a way to pass it with the command, maybe because it's Redis specific).

When Laravel pops a job from the main queue, it zadd it to the :reserved queue with a score of timestamp + expire. Before processing the queue, Laravel "migrateAllExpiredJobs", which migrates jobs that have a score lower than current timestamp (https://github.com/laravel/framework/blob/5.2/src/Illuminate/Queue/RedisQueue.php#L133).

That's exactly what causes jobs to be duplicated, while the first worker is busy processing the job and the second worker is idle, the job is at :reserved, if the job takes more than 60 seconds (assuming default), the second worker will move it from :reserved to the main queue and then will also start processing it.

@henriquebim
I was also talking about this. expire in your config is the max time a job
is supposed to run. if your job need more time than 1 minute to finish,
increase that value. It does not mean you need wait longer to retry a
FAILED job. you must understand the differece between a failed job and a
timeout (expired) job.
failed job is an exception. they will be retried immediately(--delay) after
failed.

the expire value is used here
https://github.com/laravel/framework/blob/5.2/src%2FIlluminate%2FQueue%2FRedisQueue.php#L139

and is reading from config here

https://github.com/laravel/framework/blob/5.2/src%2FIlluminate%2FQueue%2FConnectors%2FRedisConnector.php#L50

For your case, I'm sure they designed it to work like this.

When worker B comes to the queue, it found a job has been running for more than 60 seconds (or other configured expire time), worker B doesn't know the existence or status of other workers. so it think this job is timeout and to make sure this job can be finished at least one time it will migrate it back to queue. This is the expected behavior, if you have jobs run for long time you should increase expire

@phoenixgao Well, there is a --timeout param, if "expire" acts like --timeout, it should copy it's value.

When I set expire to null, the jobs were never retried unless I moved them manually, that's why I'm talking about it working as a retry.

Also, the code acts just like that, if "expire" is exceeded, it assumes the job failed and moves it to be retried (or added to failed_jobs in case it exceeds max retry), so move it back to the main queue and try it again, that's how I see ":reserved" works, keep track of the job and retry if needed. Maybe there is a "nice" fail and a "hard" fail then? Where the "nice" is retried immediately and "hard" is retried only after Expire.

@henriquebim
sorry there is no --timeout for command work

https://github.com/laravel/framework/blob/5.2/src%2FIlluminate%2FQueue%2FConsole%2FWorkCommand.php#L160

timeout is for listen, and you are not supposed to use listen other than debugging.

a job exceeds max time is a expired job, it doesn't act like a failed job, it will be run again and again, unless it's done or fail (in context of the code, an exception thrown)

(Failed job is here)
https://github.com/laravel/framework/blob/5.2/src%2FIlluminate%2FQueue%2FWorker.php#L213

An exipred job is neither maked as Done nor Fail, it is unknown, lost track, and will be put back to the main queue. I know you digged deeply in the code, would you mind read them again :D I also read every piece I even wrote another redis queue system by following most part of laravel redis queue.

@phoenixgao well, ok, it was some months ago that I worked with this, I don't remember somethings, the app that used this was abandoned and I don't have any other software using Queue (it wouldn't use Redis if I had anyway). I will see if I can get some time to explore the source code again, but from what you are saying, it's more of Documentation fault than Code fault (as in "Note: If you are using Redis, set 'expire' to a value greater than the max seconds a job would run", and also "Note: Never use Listen other than for debug purposes").

Sorry, we don't process requests/proposals here. Please see https://github.com/laravel/internals/issues.

Opened new thread in the new/proper place. https://github.com/laravel/internals/issues/30

This should never have been closed. It's not a proposal. It's a bug.

Sorry about that. I even labeled it with the bug tag too, ages ago.

If you want to be sure that a job in not handled more than once, and you don't care if it fails or not, you can set --tries option to 1. Except that Redis driver does not count an attempt when fatal error occurs. It is because RedisJob class increments attempts counter during release function. For MySQL driver, however, this counter is incrementing via releaseJobsThatHaveBeenReservedTooLong() function of DatabaseQueue class, which will be applied for jobs terminated via fatal errors.

If I am about to post a pull request for this, should it be for 5.1?

Yes, unless the fix is controversial.

Will these changes be OK for everyone?

[Redis queue driver in Laravel 5.3 will] increment attempts counter when reserving job, instead of when releasing it, so that fatal errors be considered as tries.

It means if a worker is killed because timeout is exceeded, it wont be retried for ever.

Is there already some kind of resolution / cause found?

I just ran into this building a SSH executor that runs for a long time. I use the redis driver and started 16 runners in deamon mode through Forge using a timeout of 900 and set the tries to 1.

I also have this in my failed method on the Job classes:

    /**
     * Handle a job failure.
     *
     * @return void
     */
    public function failed()
    {
        $this->output .= PHP_EOL . '// Failed to execute job!';

        $this->updateStatus(Job::$STATUS_FINISHED_WITH_ERRORS, $this->output);

        $this->job->delete();
    }

So this should rule out thet failed job run multiple times.

So I made this bash script to be executed from the Job which should take around 120 seconds to execute:

echo "Sleeping job $JOB for 60 seconds..."
sleep 60
echo "Hello there $JOB!"

echo "Hello from $JOB" >> /tmp/test.log

echo "Sleeping job $JOB for 60 seconds..."
sleep 60
echo "Hello there $JOB!"

echo "Done!"

Executing this once results in /tmp/test.log containing:

Hello from 47dd5355-1d52-4fdf-94d7-1c979e21cae0
Hello from 47dd5355-1d52-4fdf-94d7-1c979e21cae0
Hello from 47dd5355-1d52-4fdf-94d7-1c979e21cae0

So the job is executed 3 times... I am puzzled! Any clues?

Looking into this... is it possisble it has something to do with the expire/ttr setting for the drivers?

        'redis' => [
            'driver'     => 'redis',
            'connection' => 'default',
            'queue'      => 'default',
            'expire'     => 60,
        ],

The default is set to 60, so if it runs longer (because your timeout allows it) redis/beanstalk/db will release it back the the pool of open jobs. Correct?

After changing the expire value to match my timeout I can no longer reproduce the error in production it seems.

Yes. The expire time should be adjusted based on how long you expect your jobs to reasonably run.

Note that in Laravel 5.3 the daemon runner supports timeouts too, so you might want to sync those two times up. ;)

Maybe I missed it but would be nice if this is documented in the docs. Did not realize what expires does untill this issue :)

In my view , a job have an expire time, means ,if a job had executed for a unexpected long time, the job should throw a timeout exception and be force stopped.

That would be easy to understand and handled

If you want to be sure that a job in not handled more than once, and you don't care if it fails or not, you can set --tries option to 1. Except that Redis driver does not count an attempt when fatal error occurs. It is because RedisJob class increments attempts counter during release function. For MySQL driver, however, this counter is incrementing via releaseJobsThatHaveBeenReservedTooLong() function of DatabaseQueue class, which will be applied for jobs terminated via fatal errors.

If I am about to post a pull request for this, should it be for 5.1?

That isn't the bug being described here, the bug is that 2 redis queue workers will work 1 job at the exact same time (both are first attempts).

Looking into this... is it possisble it has something to do with the expire/ttr setting for the drivers?

I doubt this, because my 2nd queue worker is starting the job at the same time as the first one. That could still be a seperate bug though.


Has this been fixed in any version of laravel? I'm not on the latest 5.1, but I'm about to update to the latest 5.2. It's been hard to decipher because a lot of people have shown up to report slightly different bugs (or not even read the original bug description). If this has not been fixed, then I'll summarize that the only working solution right now is "use beanstalkd instead of redis for queues". Let me know if I'm wrong.

@InstanceOfMichael I don't believe from reading the thread here that it's been fixed. I think the solution you have is correct atm, "Use beanstalkd or rabbit or something other than redis for multiple of the same queue"

I find out laravel 5.3 is using LuaScripts.php to call redis pop command.

According to redis doc, I assume this problem (multiple queue workers working on the same job) is resolved?

According to redis doc, I assume this problem (multiple queue workers working on the same job) is resolved?

Maybe it is resolved in Laravel 5.3 then.

Also Redis guarantees that a script is executed in an atomic way: no other script or Redis command will be executed while a script is being executed

I take there's been no more reports of the issue since Oct 17 (or some of the people moved to using beanstalkd?). I'm interested in using Redis for a queue and just making sure I'm not shooting myself in the foot. Thanks.

just started using a redis worker and I swear there is absolutely nothing for this worker to do. It is running on a fairly new Forge instance on php 7.1. There have been no queued jobs and the worker is running in daemon mode and yet it is using 100% cpu non-stop. This is on a 5$ DO instance but still!! It totally crashed my server. Gonna stop it now and try beanstalkd instead

Thanks for the comment, @vesper8 . I guess I would have shot myself in the foot. I'm using Forge, so I went the easier beanstalkd path.

to fix this I just switched to beanstalkd.

was using database driver and got this problem even when NOT using supervisord... it should be added to docs that database driver have this problem when running multiple workers.

Okay, so we were kinda facing the same thing here, I guess I'll share my experience:

1 Redis database, 2 queue workers on separate servers (each only having one process). Some jobs we use have a $timeout setting set directly on the job, however there are other jobs like notifications or mailable classes, which did not. Our retry_after was set to default of 90 seconds, however, there was no timeout set on the CLI command run by supervisord. What I assume was happening here is that some job had an issue, where it was processing for more than 90 seconds. The retry_after setting only retries the job if it's already running, hence placing it back into the queue again and not killing it (that's what timeout is for), which resulted in duplicates.

I'll monitor the situation and see what happens in the next few days, but I presume that's what was happening for all the people in this thread who had to migrate to Beanstalkd.

@Numline1 I've migrated from Redis to Beanstalkd and still have a problem. The job does not fail ( at least failed() is not triggered ) and it duplicates a job until it reaches --tries limit and creates a new record in failed_jobs table:

Illuminate\Queue\MaxAttemptsExceededException: App\Job\RegistrationProcess has been attempted too many times or run too long. The job may have previously timed out. 

@pihlersimon did you set up the retry_after and timeout as I've mentioned in my comment? We haven't had this issue since.

Was this page helpful?
0 / 5 - 0 ratings