Ruby 2.4.2
Sidekiq 5.0.4 / Pro 3.5.1 / Ent 1.6.0
Hi,
Is it possible to add a third callback type that fires only when all jobs have truly completed, _including_ retries?
The current "complete" callback fires as soon as every job has run once, even if some of the jobs failed and are waiting to retry. This is almost never the desired behavior for us. In some cases we can deal with it, but in other cases we really need to know when the whole batch is done. We've accomplished that by implementing workers that manage their own retries. For example:
class MyWorker
include Sidekiq::Worker
MAX_RETRIES = 10
sidekiq_options :retry => false
def perform(foo, bar, retry_count = 0)
# do the actual work here
rescue StandardError => e
try_again_later(foo, bar, retry_count) if retry_count < MAX_RETRIES
raise e
end
private
def try_again_later(foo, bar, retry_count)
retry_at = ... # implement backoff algorithm similar to what Sidekiq uses natively
batch.jobs do
self.class.perform_in(foo, bar, retry_count + 1)
end
end
end
There's a few obvious downsides:
1) We need to maintain our own retry and backoff logic, which Sidekiq already does excellently
2) It pollutes the perform method signature with retry data
3) It is difficult to enqueue the retry job on the correct queue (we use Sidekiq::Client.push a lot to enqueue the same worker class on different queues). We ended up having to implement middleware to make the queue available in each worker.
4) It doesn't play nicely with the statistics or job error tracking on the batch status object that you receive in the callback.
Here's an example of why we needed to do this kind of thing: Our users can perform operations in bulk, e.g. send 100 emails. Each email is sent in its own job in a single batch. When the batch is done we'd like to report back to the user to let them know the bulk operation is complete (or failed). However, transient errors which are soon recoverable are not uncommon (e.g. our third-party email service times out). Using the current "complete" callback means we'd be telling our customer too soon that some of the emails could not be sent. In reality we are still going to retry and they will probably succeed. And using the "success" callback may mean that we never let the customer know what happened at all, in the event that there is a fatal flaw with one of the emails.
Sounds like you want on(:success). There's no other callback. You could write some reconciliation code to work on Batches that are older than one week.
http://www.mikeperham.com/2014/05/27/the-reconciliation-step/
Hi Mike, thanks for your response. I don't think I want on(:success), because I know that some or maybe even all of the jobs in the batch may never succeed. And I am ok with them not succeeding. I would just like to know when they are done making retry attempts.
For instance, considering this page: https://github.com/mperham/sidekiq/wiki/Really-Complex-Workflows-with-Batches
Imagine jobs B, C, D, and E succeed, but F raises an error on its first attempt. Using on(:complete), G would start running right away, even though F is going to retry in, say, 30 seconds. I would like the ability to wait until F succeeds OR exhausts its retries before deciding whether I want to move on to G or not. I don't have that flexibility with Sidekiq today, and for us it would make working with batches simpler and more useful than they already are.
I hope I have explained myself well. Is it not technically possible or desirable to add such a callback?
There's no way to do that today as the retry subsystem is orthogonal to the batch subsystem. To implement what you ask for, we'd need much tighter coupling between the two.
I see. Thanks for the clarification.
So yeah, to finish my thought: capabilities flow directly from API design. I want to give you as many callbacks for various events as possible but retry is in Sidekiq, batch is in Sidekiq Pro. I have to keep a clean separation between the two.
There are various ways around this: it might be possible to add a global sidekiq_retries_exhausted handler which triggers a batch check, etc. Comment here if you get something that works.
Most helpful comment
Hi Mike, thanks for your response. I don't think I want on(:success), because I know that some or maybe even all of the jobs in the batch may never succeed. And I am ok with them not succeeding. I would just like to know when they are done making retry attempts.
For instance, considering this page: https://github.com/mperham/sidekiq/wiki/Really-Complex-Workflows-with-Batches
Imagine jobs B, C, D, and E succeed, but F raises an error on its first attempt. Using on(:complete), G would start running right away, even though F is going to retry in, say, 30 seconds. I would like the ability to wait until F succeeds OR exhausts its retries before deciding whether I want to move on to G or not. I don't have that flexibility with Sidekiq today, and for us it would make working with batches simpler and more useful than they already are.
I hope I have explained myself well. Is it not technically possible or desirable to add such a callback?