today we use the redis command BRPOPLPUSH, which allows us to move a job atomically from the wait list to the active list. The advantages are:
However, there are a number of disadvantages which, in my opinion are too much of a burden:
So I have been thinking if we can have a new mechanism that includes the advantages without the named disadvantages. This is what I propose:
Basically when a worker starts, the first think he does is set the subscription to the "add job" event. After that it checks if there are jobs to process in the wait list, if not, he keeps idling until an "add job" event has been received. If the connection to redis is lost, start the process again by polling once the wait list.
When the worker finds a job in the queue, he can move it atomically from wait to active as well as to set the lock on the job.
If there is a guarantee that redis delivers all the published messages as long as there is a valid connection, then this mechanism should work. According to redis documentation:
"Because Redis Pub/Sub is fire and forget currently there is no way to use this feature if you application demands reliable notification of events, that is, if your Pub/Sub client disconnects, and reconnects later, all the events delivered during the time the client was disconnected are lost."
So the question is if we can "reliably" decide that a client has been disconnected, if so we can poll on the next reconnect and everything should be fine.
Big +1 on this.
If we fix this and are finally able to move jobs from to active and lock them in a single atomic call (which we can't currently due to https://github.com/OptimalBits/bull/issues/258), we could:
1) greatly simplify checking for stalled jobs. This would allow us to rename moveUnlockedJobsToWait back to checkForStalledJobs since we won't have to check for Case B here: https://github.com/OptimalBits/bull/blob/master/lib/scripts.js#L442. We would no longer need to store lockAcquired in Redis. So checkForStalledJobs can simply just scan for all unlocked jobs in the active queue, and assume they're actually stalled.
2) Remove the need for ensureActive here. That only exists to make sure that when a job is moved (atomically) from wait to active, that it's STILL active by the time the worker takes the lock. In that time it could have gotten picked up by the stalled job processor.
ok, then I will make it happen :D
Awesome. What do you think about starting in a new codebase? I think the changes will be substantial enough that it's probably worth starting over fresh and copying over needed code.
yeah, it is tempting to start over, but on the other hand, so much development time has been invested already in the current codebase, that starting over would probably require a lot of extra time and effort. I prefer to do incremental improvements, cleaning and refactoring the code, but always having a product in good shape so that it can be released often. Nevertheless, for this feature I will aim to a major release, so that I do not need to keep 100% backwards compatibility, and also to let code mature before most users upgrade to this new version.
One thing to note is the following: If we aim to support redis replication, for high availability setups, we will need to keep using redlock, which by its nature cannot be atomic. Still, we can solve most of the issues with the current implementation by taking the lock before moving the job, which is something that cannot be today with the brpoplpush command.
I can imagine a solution where redlock is only used if you have a setup with several redis master and slaves, whereas for single redis instances we can use a much faster atomic lock+move operation.
Most helpful comment
yeah, it is tempting to start over, but on the other hand, so much development time has been invested already in the current codebase, that starting over would probably require a lot of extra time and effort. I prefer to do incremental improvements, cleaning and refactoring the code, but always having a product in good shape so that it can be released often. Nevertheless, for this feature I will aim to a major release, so that I do not need to keep 100% backwards compatibility, and also to let code mature before most users upgrade to this new version.