Bull: [Feature] re-architecting the non-polling mechanism

Created on 30 Mar 2017  路  5Comments  路  Source: OptimalBits/bull

today we use the redis command BRPOPLPUSH, which allows us to move a job atomically from the wait list to the active list. The advantages are:

  • very low latency to start processing a job (with empty queue).
  • atomically move a job from one list to another.
  • low CPU usage since no polling is required.

However, there are a number of disadvantages which, in my opinion are too much of a burden:

  • cannot perform the move operation and other operations (such as job lock) atomically.
  • requires a dedicated extra connection to redis.

So I have been thinking if we can have a new mechanism that includes the advantages without the named disadvantages. This is what I propose:

  • poll only once after processing a job.
  • use pubsub to notify workers that new jobs have been added.
  • listen to close/connect events checking if there are elements to process in the wait list.

Basically when a worker starts, the first think he does is set the subscription to the "add job" event. After that it checks if there are jobs to process in the wait list, if not, he keeps idling until an "add job" event has been received. If the connection to redis is lost, start the process again by polling once the wait list.

When the worker finds a job in the queue, he can move it atomically from wait to active as well as to set the lock on the job.

If there is a guarantee that redis delivers all the published messages as long as there is a valid connection, then this mechanism should work. According to redis documentation:
"Because Redis Pub/Sub is fire and forget currently there is no way to use this feature if you application demands reliable notification of events, that is, if your Pub/Sub client disconnects, and reconnects later, all the events delivered during the time the client was disconnected are lost."
So the question is if we can "reliably" decide that a client has been disconnected, if so we can poll on the next reconnect and everything should be fine.

Most helpful comment

yeah, it is tempting to start over, but on the other hand, so much development time has been invested already in the current codebase, that starting over would probably require a lot of extra time and effort. I prefer to do incremental improvements, cleaning and refactoring the code, but always having a product in good shape so that it can be released often. Nevertheless, for this feature I will aim to a major release, so that I do not need to keep 100% backwards compatibility, and also to let code mature before most users upgrade to this new version.

All 5 comments

Big +1 on this.

If we fix this and are finally able to move jobs from to active and lock them in a single atomic call (which we can't currently due to https://github.com/OptimalBits/bull/issues/258), we could:
1) greatly simplify checking for stalled jobs. This would allow us to rename moveUnlockedJobsToWait back to checkForStalledJobs since we won't have to check for Case B here: https://github.com/OptimalBits/bull/blob/master/lib/scripts.js#L442. We would no longer need to store lockAcquired in Redis. So checkForStalledJobs can simply just scan for all unlocked jobs in the active queue, and assume they're actually stalled.
2) Remove the need for ensureActive here. That only exists to make sure that when a job is moved (atomically) from wait to active, that it's STILL active by the time the worker takes the lock. In that time it could have gotten picked up by the stalled job processor.

ok, then I will make it happen :D

Awesome. What do you think about starting in a new codebase? I think the changes will be substantial enough that it's probably worth starting over fresh and copying over needed code.

yeah, it is tempting to start over, but on the other hand, so much development time has been invested already in the current codebase, that starting over would probably require a lot of extra time and effort. I prefer to do incremental improvements, cleaning and refactoring the code, but always having a product in good shape so that it can be released often. Nevertheless, for this feature I will aim to a major release, so that I do not need to keep 100% backwards compatibility, and also to let code mature before most users upgrade to this new version.

One thing to note is the following: If we aim to support redis replication, for high availability setups, we will need to keep using redlock, which by its nature cannot be atomic. Still, we can solve most of the issues with the current implementation by taking the lock before moving the job, which is something that cannot be today with the brpoplpush command.
I can imagine a solution where redlock is only used if you have a setup with several redis master and slaves, whereas for single redis instances we can use a much faster atomic lock+move operation.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

weeco picture weeco  路  3Comments

adamsoffer picture adamsoffer  路  4Comments

NicolasDuran picture NicolasDuran  路  4Comments

pigaov10 picture pigaov10  路  3Comments

rodrigoords picture rodrigoords  路  4Comments