Sidekiq: Job retries on Sidekiq startup

Created on 28 Dec 2016  路  16Comments  路  Source: mperham/sidekiq

Ruby version:
Sidekiq / Pro / Enterprise version(s): 4.2.6 / 3.4.1 / -

If relevant, please include your initializer and any error message with the full backtrace.

require 'resolv-replace'

defined?(Sidekiq) and Sidekiq.configure_server do |config|
    # This needs to be within the configure_server block
    config.reliable_fetch!
    config.reliable_scheduler!
    config.redis = { :namespace => ENV['APP_NAME'] }
    # Poll for scheduled jobs every 5 seconds
    config.average_scheduled_poll_interval = 5
    size = Sidekiq.options[:concurrency]
    Platform::Database.reconnect(size)
  end

defined?(Sidekiq) and Sidekiq.configure_client do |config|
    config.redis = { :namespace => ENV['APP_NAME'] }
  end

defined?(Sidekiq) and Sidekiq::Client.reliable_push!

if defined?(Sidekiq)
  # We change the DelayedMailer of sidekiq to use mailer queue
  class Sidekiq::Extensions::DelayedMailer
    sidekiq_options :queue => :mailer
  end
end

Hey,

I am facing a very weird issue which I suspect to be related to reliable fetch.

Say I have the following code:

class BadWorker
  include Sidekiq::Worker
  def perform(some_json)
    # parse event_json. This raises JSON:ParserError
    JSON.parse(some_json)
  end
end

BadWorker.perform_async('{ "a":NaN }')

The worker just parses a given json. Since we had a case in which a work was queued with invalid json, Sidekiq retries the worker using the retry mechanism, which is great. But the job never moved to the job queue, probably becasue

The thing is that Sidekiq retries the job every time Sidekiq process starts. Is this by design?

Also, I have no way to delete these jobs (which will never be fixed as the problem is with the enqueued argument).

Thanks

Most helpful comment

We have been struggling with the same symptoms. ( Sidekiq v4.2.7 / Sidekiq Pro v3.4.1 )

A large job (1.5mb+) is completed (and email is sent via our postmark account) and then is held on to somewhere we can't find in the Pro ui (or via the api as of yet).

When sidekiq is stopped, we see the job being put back in the queue:

Dec 27 09:46:27 sidekiq.log:  2016-12-27T15:46:27.809Z 8734 TID-ou1wd3sp4 INFO: Shutting down
Dec 27 09:46:27 sidekiq.log:  2016-12-27T15:46:27.809Z 8734 TID-ou1wd3sp4 INFO: Terminating quiet workers
Dec 27 09:46:27 sidekiq.log:  2016-12-27T15:46:27.809Z 8734 TID-ou1x2qbsk INFO: Scheduler exiting...
Dec 27 09:46:27 sidekiq.log:  2016-12-27T15:46:27.910Z 8734 TID-ou1wd3sp4 INFO: Pausing to allow workers to finish...
Dec 27 09:46:29 sidekiq.log:  2016-12-27T15:46:29.712Z 8734 TID-ou1wd3sp4 INFO: Moving job from queue:sq_ip-10-92-5-144:8734:4f110c8ea59c_default back to queue:default
Dec 27 09:46:29 sidekiq.log:  2016-12-27T15:46:29.716Z 8734 TID-ou1wd3sp4 INFO: Moving job from queue:sq_ip-10-92-5-144:8734:4f110c8ea59c_mailers back to queue:mailers
Dec 27 09:46:29 sidekiq.log:  2016-12-27T15:46:29.722Z 8734 TID-ou1wd3sp4 INFO: Moving job from queue:sq_ip-10-92-5-144:8734:4f110c8ea59c_mailers back to queue:mailers
Dec 27 09:46:29 sidekiq.log:  2016-12-27T15:46:29.725Z 8734 TID-ou1wd3sp4 INFO: Moving job from queue:sq_ip-10-92-5-144:8734:4f110c8ea59c_mailers back to queue:mailers
Dec 27 09:46:29 sidekiq.log:  2016-12-27T15:46:29.727Z 8734 TID-ou1wd3sp4 INFO: Bye!

So when another server comes along, the-jobs are picked up & rerun, but still held at the end.

when the queue is stopped, we can see them in the web-ui, but deleting them fails due to timeout issues (upping it to 5 seconds hasn't helped, we'll see about more later).

All 16 comments

We have been struggling with the same symptoms. ( Sidekiq v4.2.7 / Sidekiq Pro v3.4.1 )

A large job (1.5mb+) is completed (and email is sent via our postmark account) and then is held on to somewhere we can't find in the Pro ui (or via the api as of yet).

When sidekiq is stopped, we see the job being put back in the queue:

Dec 27 09:46:27 sidekiq.log:  2016-12-27T15:46:27.809Z 8734 TID-ou1wd3sp4 INFO: Shutting down
Dec 27 09:46:27 sidekiq.log:  2016-12-27T15:46:27.809Z 8734 TID-ou1wd3sp4 INFO: Terminating quiet workers
Dec 27 09:46:27 sidekiq.log:  2016-12-27T15:46:27.809Z 8734 TID-ou1x2qbsk INFO: Scheduler exiting...
Dec 27 09:46:27 sidekiq.log:  2016-12-27T15:46:27.910Z 8734 TID-ou1wd3sp4 INFO: Pausing to allow workers to finish...
Dec 27 09:46:29 sidekiq.log:  2016-12-27T15:46:29.712Z 8734 TID-ou1wd3sp4 INFO: Moving job from queue:sq_ip-10-92-5-144:8734:4f110c8ea59c_default back to queue:default
Dec 27 09:46:29 sidekiq.log:  2016-12-27T15:46:29.716Z 8734 TID-ou1wd3sp4 INFO: Moving job from queue:sq_ip-10-92-5-144:8734:4f110c8ea59c_mailers back to queue:mailers
Dec 27 09:46:29 sidekiq.log:  2016-12-27T15:46:29.722Z 8734 TID-ou1wd3sp4 INFO: Moving job from queue:sq_ip-10-92-5-144:8734:4f110c8ea59c_mailers back to queue:mailers
Dec 27 09:46:29 sidekiq.log:  2016-12-27T15:46:29.725Z 8734 TID-ou1wd3sp4 INFO: Moving job from queue:sq_ip-10-92-5-144:8734:4f110c8ea59c_mailers back to queue:mailers
Dec 27 09:46:29 sidekiq.log:  2016-12-27T15:46:29.727Z 8734 TID-ou1wd3sp4 INFO: Bye!

So when another server comes along, the-jobs are picked up & rerun, but still held at the end.

when the queue is stopped, we can see them in the web-ui, but deleting them fails due to timeout issues (upping it to 5 seconds hasn't helped, we'll see about more later).

There's a couple things that can be done better here for sure.

  1. Sidekiq's job format is JSON. If you supply invalid JSON, you've created an invalid job and all bets are off. I think that the JSON API is buggy in creating this bad JSON. If you can give a simple reproduction, I'd open a JSON issue because JSON.dump(obj) should raise an error and never return invalid JSON as a successful result.

  2. Sidekiq parses the job before running it through middleware and executing it. Since the retry behavior is a middleware, any exception in the parsing will cause Sidekiq to drop the job. Since you both are using a reliable fetch mechanism, Sidekiq will pull the job from Redis and re-execute the job again and again until it processes successfully (i.e. never). There's no mechanism for destroying these jobs (which I call "poison pills"). You need to dive into Redis, find the job within the private queue and delete it.

I just got an idea!

If Sidekiq rescues a JSON parse error, it should immediately move the job to the Dead queue. This allows the developer to see the bad job and delete it or reschedule a fixed job easily. What do you think?

I feel that my issue isn't related to the OP. Only some similarity in the fact that jobs are moved into [the worker's private queue] and are not visible, only to be moved back out for another worker to fight with.

Perhaps the solution is the same, but since the jobs which are being re-run are arguably made of valid json (they run without error & are deleted when super_fetch is disabled) but our cause is different.

If the job were moved to the dead queue for inspection along with the reason for the action, it would certainly be an improvement.

@mperham Interesting. I wasn't thinking that the job was stuck due to parse error of Sidekiq itself :).

An example Hash that is passed as an argument to Sidekiq:

a = {}
# That's the bug that was fixed in the code AFTER a job was already submitted to Sidekiq
a[:poison1] = 1.0/0.0 # Infinity
a[:poison2] = 0.0/0.0 # NaN
# Send to Sidekiq
BadWorker.perform_async(a)

I like your idea of moving the "poising pills" into the dead queue! This will definitely solve the issue on one hand and give us the ability to track these down and handle manually.

@YesThatAllen If this issue is not related to yours, please open a new issue. An open issue is much easier for all to track than a series of emails.

flori was kind enough to point out the problem. I suspect you are creating the JSON by calling JSON.dump but parsing it with JSON.parse. These two APIs are mismatched. You need to use the JSON.dump/load or JSON.generate/parse pairs.

http://ruby-doc.org/stdlib-2.0.0/libdoc/json/rdoc/JSON.html#method-i-generate

@mperham Thanks!

In my case I don't do the JSON serializations but pass a Hash to Sidekiq's worker perform_async call:

a = {}
a['inf'] = 1.0/0.0
a['nan'] = 0.0/0.0

Worker.perform_async a

This reproduces the infinite job retries with reliable fetch.

It looks like you have something else in your app breaking the JSON cycle. I can't reproduce with bare Ruby 2.4.0 and Sidekiq 4.2.7:

$ ruby -v break.rb 
ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-darwin16]
/Users/mikeperham/.gem/ruby/2.4.0/gems/redis-3.3.2/lib/redis/client.rb:459: warning: constant ::Fixnum is deprecated
/Users/mikeperham/.rubies/ruby-2.4.0/lib/ruby/2.4.0/json/common.rb:224:in `generate': 862: Infinity not allowed in JSON (JSON::GeneratorError)
    from /Users/mikeperham/.rubies/ruby-2.4.0/lib/ruby/2.4.0/json/common.rb:224:in `generate'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/sidekiq-4.2.7/lib/sidekiq.rb:180:in `dump_json'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/sidekiq-4.2.7/lib/sidekiq/client.rb:196:in `block in atomic_push'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/sidekiq-4.2.7/lib/sidekiq/client.rb:194:in `map'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/sidekiq-4.2.7/lib/sidekiq/client.rb:194:in `atomic_push'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/sidekiq-4.2.7/lib/sidekiq/client.rb:179:in `block (2 levels) in raw_push'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/redis-3.3.2/lib/redis.rb:2303:in `block in multi'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/redis-3.3.2/lib/redis.rb:58:in `block in synchronize'
    from /Users/mikeperham/.rubies/ruby-2.4.0/lib/ruby/2.4.0/monitor.rb:214:in `mon_synchronize'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/redis-3.3.2/lib/redis.rb:58:in `synchronize'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/redis-3.3.2/lib/redis.rb:2296:in `multi'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/sidekiq-4.2.7/lib/sidekiq/client.rb:178:in `block in raw_push'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/connection_pool-2.2.1/lib/connection_pool.rb:64:in `block (2 levels) in with'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/connection_pool-2.2.1/lib/connection_pool.rb:63:in `handle_interrupt'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/connection_pool-2.2.1/lib/connection_pool.rb:63:in `block in with'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/connection_pool-2.2.1/lib/connection_pool.rb:60:in `handle_interrupt'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/connection_pool-2.2.1/lib/connection_pool.rb:60:in `with'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/sidekiq-4.2.7/lib/sidekiq/client.rb:177:in `raw_push'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/sidekiq-4.2.7/lib/sidekiq/client.rb:68:in `push'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/sidekiq-4.2.7/lib/sidekiq/worker.rb:116:in `client_push'
    from /Users/mikeperham/.gem/ruby/2.4.0/gems/sidekiq-4.2.7/lib/sidekiq/worker.rb:60:in `perform_async'
    from break.rb:12:in `<main>'
require 'json'
require 'sidekiq'

class SomeWorker
  include Sidekiq::Worker
end

a = { 'inf' => 1.0 / 0.0, 'nan' => 0.0 / 0.0 }
#str = JSON.generate(a)
#JSON.parse(str)

p SomeWorker.perform_async(a)

@mperham - Thanks again. I found the culprit. Seems like the hash was wrapped with Rails' (3.2) HashWithIndifferentAccess.

require 'json'
require 'sidekiq'

class SomeWorker
  include Sidekiq::Worker
end

a = { 'inf' => 1.0 / 0.0, 'nan' => 0.0 / 0.0 }
a = ActiveSupport::HashWithIndifferentAccess.new(a)
#str = JSON.generate(a) 
#JSON.parse(str)

p SomeWorker.perform_async(a)

This will cause Sidekiq to enqueue the job serializing the object as Json that will fail to parse.

@motymichaely I need a very simple working script outside of your application which reproduces your problem. Your script does not work as is (it's missing the AS require) and fails for me on the last line as I would expect. There are so many things gems can do to subtly change the semantics of how Ruby does JSON that it's quite possible this is another gem's fault.

$ ruby -v break2.rb 
ruby 2.3.2p217 (2016-11-15 revision 56796) [x86_64-darwin16]
/Users/mikeperham/.gem/ruby/2.3.2/gems/json-2.0.2/lib/json/common.rb:224:in `to_json': 862: Infinity not allowed in JSON (JSON::GeneratorError)
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/json-2.0.2/lib/json/common.rb:224:in `generate'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/json-2.0.2/lib/json/common.rb:224:in `generate'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/sidekiq-4.2.7/lib/sidekiq.rb:180:in `dump_json'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/sidekiq-4.2.7/lib/sidekiq/client.rb:196:in `block in atomic_push'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/sidekiq-4.2.7/lib/sidekiq/client.rb:194:in `map'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/sidekiq-4.2.7/lib/sidekiq/client.rb:194:in `atomic_push'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/sidekiq-4.2.7/lib/sidekiq/client.rb:179:in `block (2 levels) in raw_push'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/redis-3.3.2/lib/redis.rb:2303:in `block in multi'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/redis-3.3.2/lib/redis.rb:58:in `block in synchronize'
    from /Users/mikeperham/.rubies/ruby-2.3.2/lib/ruby/2.3.0/monitor.rb:214:in `mon_synchronize'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/redis-3.3.2/lib/redis.rb:58:in `synchronize'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/redis-3.3.2/lib/redis.rb:2296:in `multi'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/sidekiq-4.2.7/lib/sidekiq/client.rb:178:in `block in raw_push'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/connection_pool-2.2.1/lib/connection_pool.rb:64:in `block (2 levels) in with'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/connection_pool-2.2.1/lib/connection_pool.rb:63:in `handle_interrupt'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/connection_pool-2.2.1/lib/connection_pool.rb:63:in `block in with'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/connection_pool-2.2.1/lib/connection_pool.rb:60:in `handle_interrupt'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/connection_pool-2.2.1/lib/connection_pool.rb:60:in `with'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/sidekiq-4.2.7/lib/sidekiq/client.rb:177:in `raw_push'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/sidekiq-4.2.7/lib/sidekiq/client.rb:68:in `push'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/sidekiq-4.2.7/lib/sidekiq/worker.rb:116:in `client_push'
    from /Users/mikeperham/.gem/ruby/2.3.2/gems/sidekiq-4.2.7/lib/sidekiq/worker.rb:60:in `perform_async'
    from break2.rb:14:in `<main>'

@mperham - Sorry for my late reply, had busy days (who doesn't?).

Seems like the issue is with ActiveSupport 3.2. You can follow this gist to reproduce it.

Looks like Rails 4.1.0 moved away from their own custom JSON code to use stdlib.

Removed the old pure-Ruby JSON encoder and switched to a new encoder based on the built-in JSON gem.

Support for encoding BigDecimal as a JSON number, as well as defining custom encode_json methods to control the JSON output has been removed from core. The new encoder will always encode BigDecimals as Strings and ignore any custom encode_json methods.

The old encoder has been extracted into the activesupport-json_encoder gem. Installing that gem will bring back the ability to encode BigDecimals as numbers as well as encode_json support.

Setting the related configuration ActiveSupport.encode_big_decimal_as_string without the activesupport-json_encoder gem installed will raise an error.

ruby-install can't even install Ruby 2.0. I guess it's time to drop Ruby 2.0 and Rails 3.2 support in Sidekiq 5.0.

Sidekiq is calling JSON.generate/parse. I'm not sure what else I can do if other libraries muck with JSON processing. Any thoughts/ideas?

@mperham it's not an issue with Rails 3.2 but with ActiveSupport::HashWithIndifferentAccess JSON generation. So deprecating support for Rails 3.2 doesn't make sense to me.

For instance, the way I was able to get the exact culprit was by using tracer:

require 'tracer'
require 'json'

a = { 'inf' => 1.0 / 0.0, 'nan' => 0.0 / 0.0 }
a = ActiveSupport::HashWithIndifferentAccess.new(a)

Tracer.on
str = JSON.generate(a) 
Tracer.off

I looked for to_json within the trace and found which gem was causing the exception.

In my case, I would love to get an easier way to remove the poison pill rather than trying to handle the JSON generation. (Actually, I wasn't able to remove the bad message yet :().

Can you be more specific? Which gem was causing a problem?

There's no easy way to get rid of poison pills in 4.x, you'll need to dig out the private queue name and remove the job from the list manually. This is slated to improve, #3296 covers how Sidekiq 5.0 will deal with malformed jobs.

Was this page helpful?
0 / 5 - 0 ratings