Sidekiq: Do not retry for certain exceptions

Created on 4 Dec 2014 · 43Comments · Source: mperham/sidekiq

Hello Mike,

What do you think of having a declarative class method for saying "do not retry if the error class is this and that"? A typical use case is something like this:

class ThumbnailWorker
  include Sidekiq::Worker
  sidekiq_options retry: 5
  # do_not_retry_if Sequel::NoMatchingRow,
  #                 RemoteFetcher::ResourceNotFoundError

  def perform(image_id)
    # Find the DB record with URLs.
    image = Image.with_pk!(image_id)
    # Download the image, resize it, do whatever with it.
    Thumbnailer.new(image).generate
  rescue Sequel::NoMatchingRow
    # There is no point in retrying since the record does not exist.
  rescue RemoteFetcher::ResourceNotFoundError
    # Amazon S3 said 404, no point in retrying.
    # However, we DO want to try again if it was a timeout.
  end
end

The point is to not retry if it doesn't make sense.

Thanks,
Ollie

Source

ollie

👍5

Most helpful comment

How is my reply impolite?

mperham on 4 Dec 2014

👍13 👎2

All 43 comments

As you show, your worker should explicit catch and ignore any errors. I'm not taking feature requests for the retry logic anymore.

mperham on 4 Dec 2014

👍5

So much for politeness. :-1:

ollie on 4 Dec 2014

👎26 😕15 👍3

How is my reply impolite?

mperham on 4 Dec 2014

👍13 👎2

@ollie check out https://github.com/govdelivery/sidekiq-retries

substars on 4 Dec 2014

That response and closing the issue actually feels like shoving something off the table. That feels arrogant.

Thanks @substars, I'll take a look.

ollie on 4 Dec 2014

👍11 👎8

How about the ActiveJob::DeserializationError with Rails 4.2. Does anyone know how we can set it so it is not retried? It never seems to enter the perform method so I am not sure how to deal with this.

pomartel on 8 Jan 2015

@pomartel What is the actual problem? Why is that error raised? Sidekiq does not have any knowledge about Rails and I really don't want to add Rails-specific code paths if possible.

mperham on 8 Jan 2015

I could see how this could be helpful depending on the application. There may be some application logic that would want to suppress retries based upon one issue or another. For example a worker that is accessing an API may be throttled that receives a throughput exceeded exception, or a record not found exception.

The code may look like:

def initialize(options = {})
  @max_retries = options.fetch(:max_retries, DEFAULT_MAX_RETRY_ATTEMPTS)
  @ignored_exceptions = options.fetch(:ignored_exceptions, [])
end

def call(worker, msg, queue)
  yield
rescue Sidekiq::Shutdown
  # ignore, will be pushed back onto queue during hard_shutdown
  raise
rescue Exception => e
  # ignore, will be pushed back onto queue during hard_shutdown
  raise Sidekiq::Shutdown if exception_caused_by_shutdown?(e)

  raise e unless msg['retry']
  raise e if @ignored_exceptions.include?(e.class)
  attempt_retry(worker, msg, queue, e)
end

Velkitor on 8 Jan 2015

@mperham The Deserialization error happens because of a record not found exception. This typically happens when I queue a job to happen at a later time and meanwhile the object on which the job was set to run is deleted. An example scenario :

1- A new user registers to the application and a welcome email is set be delivered 1 day from now.
2- The user deletes his account.
3- A deserialization error is raised when the job runs because the record cannot be found (deleted).

I just saw that Active Job provides a rescue_from method. I will try to see if I can catch the exception with that method : http://edgeguides.rubyonrails.org/active_job_basics.html#exceptions

pomartel on 8 Jan 2015

On retries, I don't think I would want such a failure to prevent retries. I would want it to continue to error out to force manual intervention. If someone is deleting records that were expected to be used by a job, it's possible that it was still required to process the job successfully. In other cases, it may be that the job can be safely deleted. There should definitely be an easy way to control that in the application though.

aprescott on 8 Jan 2015

@pomartel That is exactly why I told the Rails team that GlobalID is a bad idea and why Sidekiq doesn't do it already. Looking up records means dealing with exceptions and various edge cases (e.g. transaction boundaries) that the application really should be responsible for. I was overruled in the matter.

mperham on 8 Jan 2015

Just so you know, this works :

rescue_from(ActiveJob::DeserializationError) do |exception|
  # The user record has been deleted. Ignore this.
end

It's not ideal since I might be catching deserialization errors that are not related to records not found but it will work for my use case. Thanks for your help!

pomartel on 8 Jan 2015

I didn't really see anyone explaining the way I see this, so I'm trying now: Some exceptions just don't make sense to be retried at all - a cliche example is the ActiveRecord::RecordNotFound. Applications that interact with third-party services on behalf of users might face exceptions like SomeService::ProfileNotFound that don't make sense to be retried as well.

So, for cases like this, I feel error-prone to reimplement the same rescue block in multiple workers.
Also, in many cases that I could simply do SomeClass.delay.method and I want to skip a common exception from retry, I'll have to wrap in a custom worker class.

This behavior is used in gems such as Airbrake and Rollbar:

_I don't think we should/need to have a default list of skipped exceptions like Airbrake. I like that Sidekiq won't be having a lot of Rails-specific code_

@mperham It would be a pleasure to write a PR if you think you can accept that. Thanks.

rafaelsales on 23 Jul 2015

👍3

I'm still struggling to see the use case.

I tend to actually care about ActiveRecord::RecordNotFound. I almost always want an issue with a missing record to be anything but skipped over. Instead I want it in a list of retry jobs, so I can identify the problem, because it could be a sign of an underlying bug. For the cases where I expect something to potentially not exist, using methods that return nil instead of raising lets you return early.

For something like an intermittent third party API that claims there's a missing record, I want the retry because I've seen it correct itself. But again, it could be a legitimate issue with an incorrect unique ID or something. I wouldn't want to paint over problems by skipping retries.

Is this something that a custom exception handler can implement along with a declarative interface at the job level?

aprescott on 23 Jul 2015

@aprescott Thanks for your attention here.

Adding a global non-retry list with the scenarios I described doesn't really mean we would silently fail. I still implement the Sidekiq global exception handler to notify me on Rollbar properly when I know it's something I should be aware of.

My point is: I'm currently about to refactor fair amount of code to stop using Model.delay.method because when we have a RecordNotFound in this, it's really safe that I shouldn't retry. E.g:

As you can see in this image, the id was persisted because it had an id, but it couldn't be found - there's really no reason to retry this. It's quite awkward it happened though, so the simplest way I see to handle this is to not even retry a single time and notice the team via Rollbar (with the global exception handler)

rafaelsales on 23 Jul 2015

Retrying RecordNotFound is critical to deal with this issue:

https://github.com/mperham/sidekiq/wiki/FAQ#why-am-i-seeing-a-lot-of-cant-find-modelname-with-id12345-errors-with-sidekiq

For every error you want to ignore, there's almost always a case where you want to retry it.

mperham on 23 Jul 2015

Lots of issues are caused by bad practices and wrong usage. Not retrying AR::RecordNotFound encourages developers to continue enqueueing jobs in even before_save - Sidekiq will silently retry them and it would work at some point. In one of the projects I worked recently, I ignored AR::RecordNotFound manually in every worker that did Model.find() because we were certain that we wouldn't schedule a job for a record that was not persisted yet. When this exception is thrown in this product, it's a sign that someone deleted the record before the job got executed by Sidekiq - which is perfectly acceptable.

So, I really don't see a reason to not add a feature thinking on developers that are using wrong practices that are even advised in gem's documentation.

rafaelsales on 23 Jul 2015

When this exception is thrown in this product, it's a sign that someone deleted the record before the job got executed by Sidekiq - which is perfectly acceptable.

If that's an expected use case, you should use find_by_id so it returns nil in that case, not find. Raising an error should indicate there's a bug in the system.

mperham on 23 Jul 2015

👍2

The whole point is to have this flexibility while doing ARModel.delay.operation.

What really are the disadvantages of allowing a developer to specify a global list of exceptions that shouldn't be retried?

rafaelsales on 23 Jul 2015

@rafaelsales ah, the case is where ARModel.operation should ordinarily raise, but where adding .delay should not retry if it does raise? It does seem fairly niche. Does the custom exception handler idea in a third party gem work?

aprescott on 23 Jul 2015

@aprescott Yes. There are really some cases where ar_model.operation would work but ar_model.delay.operation won't - you can see the image I posted above. If the job takes like 30 minutes to get executed and the record was removed meanwhile. There are other scenarios and exceptions that I had ignored in nearly every job and I would put in a global retry ignore list if possible

rafaelsales on 23 Jul 2015

Im seeing a lot of Net::SMTPServerBusy: 401 4.1.3 Bad recipient address syntax because the email address doesn't exists, so it would be really useful if i could ignore these.

askehansen on 2 Nov 2015

Just to throw my 2c in, I'm currently building a system on top of Sidekiq which uses Capybara to scrape a partner's website (don't ask, its a harrowing tale). If they change the markup of their product pages then Capybara is going to throw an exception - that's definitely something I want to be notified of, and I'd like the job in the dead job queue so I can retry it later, but there's no point in retrying until the code has been updated to do the right thing.

jellybob on 19 Nov 2015

@jellybob that sounds different. Is it not handled by a configuration of retry: 0 on the worker?

From the wiki:

Only jobs configured with 0 or greater retries will go to the Dead Job Queue.

aprescott on 19 Nov 2015

Interesting. I'm sure I read it as zero retry jobs not going to the DJQ. Anyway, the particular job can fail in a variety of ways, some of which do warrant retrying.

I guess I could break it into two jobs, but it would complicate matters quite a bit for the sake of doing this.

Mostly thinking out loud since I'm not at a computer, but can I explicitly push a job to the DJQ myself? That would seem like a decent compromise, I'll have to look into it tomorrow.

jellybob on 19 Nov 2015

Just throwing in my 2 cents here. Why don't you just write a simple custom middleware that handles the exceptions you want to prevent retries for?

Basic example:

def call(worker, msg, queue)
  begin
    yield
  rescue ActiveRecord::RecordNotFound => e
    msg['retry'] = false
    raise
  end
end

Maybe even do this:

def call(worker, msg, queue)
  begin
    yield
  rescue ActiveRecord::RecordNotFound => e
    msg['retry'] = false
    raise
  rescue Exception => e
    if worker.respond_to?(:handle_error)
      worker.handle_error(e)
    else
      raise
    end
  end
end

jhoffner on 1 Dec 2015

👍12

Because I hadn't realised it was possible, that looks perfect to me!

On Tue, 1 Dec 2015 02:51 Jake Hoffner [email protected] wrote:

Just throwing in my 2 cents here. Why don't you just write a simple custom
middleware that handles the exceptions you want to prevent retries for.

Basic example:

def call(worker, msg, queue)
begin
yield
rescue ActiveRecord::RecordNotFound => e
msg['retry'] = false
raise
endend

—
Reply to this email directly or view it on GitHub
https://github.com/mperham/sidekiq/issues/2072#issuecomment-160832087.

jellybob on 1 Dec 2015

And if you don't want it to end up on dead list either, you can set msg['dead'] = true.

module MyApp
  class NonRetriableJobs
    def call(worker, msg, queue)
      yield
    rescue ActiveRecord::RecordNotFound
      msg['retry'] = false
      msg['dead'] = false
      raise
    end
  end
end

Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add MyApp::NonRetriableJobs
  end
end

assembler on 16 Sep 2016

👍11

In case people land here trying to achieve the same thing, this technique as mentioned by https://github.com/mperham/sidekiq/issues/2072#issuecomment-160832087 no longer works in Sidekiq 5+. The technique used to work when Sidekiq had a RetryJobs middleware that looks at the job hash passed to it, but that's no longer the case in Sidekiq 5+.

To stop a job from retrying simply return within the rescue block. You can include any error reporting in the same rescue block before returning, or if your error reporting is handled by another middleware (like Bugsnag::Sidekiq), then nothing needs to be done as long as that middleware runs first.

Please correct me if I'm wrong though.

d9su on 6 Jun 2018

👍1

@d9su Thanks so much for bringing this up! We just upgraded to Sidekiq 5 and I did not realize that this behavior changed.

@rafaelsales Note that those of us using databases with read-replicas in real time have to deal with eventual consistency, where RecordNotFound may be related to replication delay.

kigster on 25 Jun 2018

👍2

we can just set the sidekiq_options retry: 0 in the rescue block for that specific exception and then just reset this in normal cases.
```ruby
Class.sidekiq_options retry: default_value
rescue SomeClass::Error => error
Class.sidekiq_options retry: 0
end

AdarshSandilya on 30 Aug 2018

👍3

Hey @d9su, thanks for pointing out that 5+ does not support this middleware approach. Do you know any workaround in 5+ how to achieve a following thing?

Let's say we have a following job:

class TestRetryJob
  class RetryException < StandardError; end

  include Sidekiq::Worker

  sidekiq_options retry: 0

  def perform
    puts "Working..."
    sleep(5)
    raise RetryException
  end
end

With the "old" solution I would add a following middleware to retry when RetryException happens and simply mark the job as failed (with no retry) when other errors occur:

class NonRetriableJobs
  def call(_worker, msg, _queue)
    yield
  rescue TestRetryJob::RetryException
    msg["retry"] = (msg["retry"] || 0) + 1
    msg["dead"] = false
    raise
  end
end

Do you have any idea how to do the same thing in Sidekiq 5+? I'd like to avoid reenqueing the job manually, because it'll mess up a bit with batching mechanism.

Bajena on 22 Oct 2019

@Bajena If you look at my comment above:

To stop a job from retrying simply return within the rescue block. You can include any error reporting in the same rescue block before returning, or if your error reporting is handled by another middleware (like Bugsnag::Sidekiq), then nothing needs to be done as long as that middleware runs first.

Have you tried that?

d9su on 22 Oct 2019

Hey @d9su thanks for a quick response! The problem I think with this approach is that the job won't be marked as failed in case of a non retriable error:

class TestRetryJob
  class RetryException < StandardError; end

  include Sidekiq::Worker

  sidekiq_options retry: 999999

  def perform
    do_work
  rescue RetryException
    puts "Retrying..."
    raise
  rescue StandardError
    # If we do nothing here, the job will be marked as successful
  end
end

Bajena on 22 Oct 2019

@Bajena IIRC, jobs marked as failed are meant to be automatically retried, the two concepts are inseparable. I think the issue is the sidekiq job lifecycle is pretty opinionated and not very flexible for manually sending things around the pipeline. It's understandable since sidekiq relies on these automatic behaviors to function as it is.

So if you want anything custom, it's better to build your own error reporting mechanism.

For example:

class NonRetriableJobs
  def call(_worker, msg, _queue)
    yield
  rescue TestRetryJob::RetryException
    MyLogger.info('Non retriable error occurred.')
  end
end

d9su on 22 Oct 2019

Thank you again @d9su! So it seems that it won't be possible to achieve the same thing as in sidekiq 4. I'll have to come up with another solution then. Thanks anyways for super quick answers :)

Bajena on 23 Oct 2019

👍1

You don’t need middleware to achieve what you want. You can write a module
with a perform method in it and then include that module instead of sidekiq
worker. When included the module would wrap the perform method by
prepending itself into the inheritance hierarchy.

Once there you could grab the job instance and modify its singleton class
with custom retry options after the call. You might need to override the
retry_options class method too, and call it on the jobs singleton class, to
override retry logic as you see fit.

My 2c.

On Tue, Oct 22, 2019 at 11:09 PM Jan Bajena notifications@github.com

kigster on 24 Oct 2019

👍2

Thanks for an idea @kigster! Isn't such approach non thread-safe? Won't changing class options affect other workers that may be processing jobs at the same time?

Bajena on 24 Oct 2019

@bajena I’m glad someone is thinking of thread safety!

When sidekiq runs a job it instiates a worker and disposed of the instance after the job completes. If you are able to intercept the job instance in the module (or in your middleware) you can modify its private “class” called singleton class, which is only attached to this instance and has tightly coupled lifecycle. What I meant is that you’d need to apply your retry queue changes to this class and not to the shared worker class. Wether this works or not unfortunately depends on how sidekiq looks up retry options, but in ruby you can overwrite almost anything by reopening it and modifying it.

kigster on 24 Oct 2019

👍1

Read this for more info:

https://www.google.com/amp/leohetsch.com/demystifying-ruby-singleton-classes/amp/

kigster on 24 Oct 2019

@kigster that's a pretty smart use of duck typing and monkey patching 😛! A bit contrived but may be what's needed if you just want to take matters in your own hands.

d9su on 24 Oct 2019

Thanks for the explanation guys! It seems that singleton class trick should do the job here :)

Bajena on 24 Oct 2019

@Bajena would you be able to post a concrete example of what this looks like? Would help a lot!