Sidekiq: Sidekiq batches are not completing.

Created on 6 Mar 2017  Â·  11Comments  Â·  Source: mperham/sidekiq

Sidekiq Version: Sidekiq 4.2.9 / Pro 3.4.3 / Ent 1.5.1
Redis Version: 3.2.7
Ruby: ruby 2.2.5p319 (2016-04-26 revision 54774) [x86_64-darwin15]

While executing some tests on batches I've noticed that some are not completing and jobs seems to be disappearing.
All queues are cleared, all retries are executed, and batch still shows on UI with a pending/error count bigger than zero.

Example application:
https://github.com/raivil/sidekiq_issue_3312/, master branch using super_fetch and branch sidekiq_v4.2.9-timed_fetch using timed_fetch.
https://github.com/raivil/sidekiq_issue_3312/tree/master/app/models/jobs/temp
https://github.com/raivil/sidekiq_issue_3312/blob/master/config/initializers/sidekiq.rb

Scenario:
Jobs::Temp::ExampleBatchJob is a job that creates a batch with many subjobs. The number of subjobs is provided as parameter.
Jobs::Temp::ExampleBatchJob code:

class Jobs::Temp::ExampleBatchJob
  include Sidekiq::Worker
  sidekiq_options queue: :default, unique_for: 1.hour
  def perform(number_of_subjobs)
    batch = Sidekiq::Batch.new
    batch.description = "#{self.class.name} - #{number_of_subjobs}"
    batch.on(:success, self.class, "model_id" => number_of_subjobs)
    batch.jobs do
      1.upto(number_of_subjobs).each do |i|
        Jobs::Temp::ExampleSubJob.perform_async(number_of_subjobs, i)
      end
    end
  end
  def on_success(_status, options)
    Rails.logger.info("Batch processing done!")
  end
end

Each subjob will sleep for up to 9 seconds to simulate processing.
A small percentage (10%) will raise exception in order to simulate a failure and be re-executed.
Jobs::Temp::ExampleSubBatchJob code:

class Jobs::Temp::ExampleSubJob
  include Sidekiq::Worker
  sidekiq_options queue: :default, unique_for: 24.hours

  def perform(parent_total_jobs, id)
    prng = Random.new
    time = prng.rand(9)
    sleep time # fake processing delay...
    Rails.logger.info("Running job #{id} of #{parent_total_jobs} for #{time} seconds and failing: #{should_fail?(time)}")
    raise(Exception, "Random Exception raised 10% of time") if should_fail?(time) #fail 10% of time
  end

  private

  def should_fail?(time)
    (time % 9) == 1
  end

end

Small and large batches will complete sometimes, but the majority will not:

screen shot 2017-03-03 at 15 11 34
screen shot 2017-03-03 at 15 11 22
screen shot 2017-03-03 at 15 21 33

Enqueueing a job:

Jobs::Temp::ExampleBatchJob.perform_async(30031)

Running sidekiq, rails and rails console:

COUNT=8 bundle exec sidekiqswarm -q critical -q default -q medium -q low
bundle exec rails s
bundle exec rails c

Sidekiq was not restarted while any batch/job was executing.

Sidekiq config:
https://github.com/raivil/sidekiq_issue_3312/blob/master/config/initializers/sidekiq.rb

require_relative "redis"
Sidekiq.configure_server do |config|
  config.redis = { url: ENV["REDIS_PROVIDER"] }
  config.super_fetch! # https://github.com/mperham/sidekiq/wiki/Pro-Reliability-Server
  config.reliable_scheduler! # https://github.com/mperham/sidekiq/wiki/Reliability#scheduler
  database_url = ENV["DATABASE_URL"]
  if database_url.present?
    ENV["DATABASE_URL"] = "#{database_url}?pool=25"
    ActiveRecord::Base.establish_connection
  end
end
Sidekiq.configure_client do |config|
  config.redis = { url: ENV["REDIS_PROVIDER"] }
end
Sidekiq.default_worker_options = { "backtrace" => true }
Sidekiq::Client.reliable_push! unless Rails.env.test? # https://github.com/mperham/sidekiq/wiki/Pro-Reliability-Client
Sidekiq::Enterprise.unique! unless Rails.env.test? # https://github.com/mperham/sidekiq/wiki/Ent-Unique-Jobs

Testing with older version
Running the same jobs with a previous version of Sidekiq results in all batches completing successfully.
Sidekiq Version: Sidekiq 4.1.3 / Pro 3.3.2 / Ent 1.3.2
Redis Version: 3.2.7
Ruby: ruby 2.2.5p319 (2016-04-26 revision 54774) [x86_64-darwin15]

Example application:
https://github.com/raivil/sidekiq_issue_3312/, branch "sidekiq_v4.1.3".
Gemfile.lock https://github.com/raivil/sidekiq_issue_3312/commit/cd63de1167161d7c04fed66aaab6357ec9719c6b

Most helpful comment

Sidekiq Pro 3.4.5 has been released with this fix. Thank you for your hard work in reproducing this issue!

All 11 comments

Can you reproduce with the latest Pro? It has an optimization which I think might apply here.

On Mar 6, 2017, at 04:17, Ronaldo Raivil notifications@github.com wrote:

Sidekiq Version: Sidekiq 4.2.9 / Pro 3.4.3 / Ent 1.5.1
Redis Version: 3.2.7
Ruby: ruby 2.2.5p319 (2016-04-26 revision 54774) [x86_64-darwin15]

While executing some tests on batches I've noticed that some are not completing and jobs seems to be disappearing.
All queues are cleared, all retries are executed, and batch still shows on UI with a pending/error count bigger than zero.

Example application:
https://github.com/raivil/sidekiq_issue_3312/, master branch using super_fetch and branch sidekiq_v4.2.9-timed_fetch using timed_fetch.
https://github.com/raivil/sidekiq_issue_3312/tree/master/app/models/jobs/temp
https://github.com/raivil/sidekiq_issue_3312/blob/master/config/initializers/sidekiq.rb

Scenario:
Jobs::Temp::ExampleBatchJob is a job that creates a batch with many subjobs. The number of subjobs is provided as parameter.
Jobs::Temp::ExampleBatchJob code:

class Jobs::Temp::ExampleBatchJob
include Sidekiq::Worker
sidekiq_options queue: :default, unique_for: 1.hour
def perform(number_of_subjobs)
batch = Sidekiq::Batch.new
batch.description = "#{self.class.name} - #{number_of_subjobs}"
batch.on(:success, self.class, "model_id" => number_of_subjobs)
batch.jobs do
1.upto(number_of_subjobs).each do |i|
Jobs::Temp::ExampleSubJob.perform_async(number_of_subjobs, i)
end
end
end
def on_success(_status, options)
Rails.logger.info("Batch processing done!")
end
end
Each subjob will sleep for up to 9 seconds to simulate processing.
A small percentage (10%) will raise exception in order to simulate a failure and be re-executed.
Jobs::Temp::ExampleSubBatchJob code:

class Jobs::Temp::ExampleSubJob
include Sidekiq::Worker
sidekiq_options queue: :default, unique_for: 24.hours

def perform(parent_total_jobs, id)
prng = Random.new
time = prng.rand(9)
sleep time # fake processing delay...
Rails.logger.info("Running job #{id} of #{parent_total_jobs} for #{time} seconds and failing: #{should_fail?(time)}")
raise(Exception, "Random Exception raised 10% of time") if should_fail?(time) #fail 10% of time
end

private

def should_fail?(time)
(time % 9) == 1
end

end
Small and large batches will complete sometimes, but the majority will not:

Enqueueing a job:

Jobs::Temp::ExampleBatchJob.perform_async(30031)
Running sidekiq, rails and rails console:

COUNT=8 bundle exec sidekiqswarm -q critical -q default -q medium -q low
bundle exec rails s
bundle exec rails c
Sidekiq was not restarted while any batch/job was executing.

Sidekiq config:
https://github.com/raivil/sidekiq_issue_3312/blob/master/config/initializers/sidekiq.rb

require_relative "redis"
Sidekiq.configure_server do |config|
config.redis = { url: ENV["REDIS_PROVIDER"] }
config.super_fetch! # https://github.com/mperham/sidekiq/wiki/Pro-Reliability-Server
config.reliable_scheduler! # https://github.com/mperham/sidekiq/wiki/Reliability#scheduler
database_url = ENV["DATABASE_URL"]
if database_url.present?
ENV["DATABASE_URL"] = "#{database_url}?pool=25"
ActiveRecord::Base.establish_connection
end
end
Sidekiq.configure_client do |config|
config.redis = { url: ENV["REDIS_PROVIDER"] }
end
Sidekiq.default_worker_options = { "backtrace" => true }
Sidekiq::Client.reliable_push! unless Rails.env.test? # https://github.com/mperham/sidekiq/wiki/Pro-Reliability-Client
Sidekiq::Enterprise.unique! unless Rails.env.test? # https://github.com/mperham/sidekiq/wiki/Ent-Unique-Jobs
Testing with older version
Running the same jobs with a previous version of Sidekiq results in all batches completing successfully.
Sidekiq Version: Sidekiq 4.1.3 / Pro 3.3.2 / Ent 1.3.2
Redis Version: 3.2.7
Ruby: ruby 2.2.5p319 (2016-04-26 revision 54774) [x86_64-darwin15]

Example application:
https://github.com/raivil/sidekiq_issue_3312/, branch "sidekiq_v4.1.3".
Gemfile.lock raivil/sidekiq_issue_3312@cd63de1

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

Yes, same results with Sidekiq 4.2.9 / Pro 3.4.4 / Ent 1.5.2.

Some smaller batches with up to 1k jobs completed.

screen shot 2017-03-06 at 11 27 35

Can you reproduce with the latest running timed_fetch? IOW, is the problem the version bump or super_fetch?

Does the problem go away if you remove the unique_for option in the Worker?

I have observed this behavior with both timed_fetch and super_fetch. for Sidekiq 4.2.9 / Pro 3.4.3 / Ent 1.5.1.

On a previous version (Sidekiq 4.1.3 / Pro 3.3.2 / Ent 1.3.2) using timed_fetch I dont see this issue.

I'll make some tests without unique_for and let you know.

@mperham
Executed same tests again but this time without unique_for. Most of batches were not completed.

Would you gist me the output of redis-cli info against your Redis where the jobs are disappearing?

So far, all of the batches I've created with your Rails app have finished as expected. I still can't reproduce the behavior.

I was able to reproduce the problem with a 1,000 job batch. When I change the non-standard Exception to a standard RuntimeError, the batch worked. I'm not sure why yet.

http://www.mikeperham.com/2012/03/03/the-perils-of-rescue-exception/

--- a/app/models/jobs/temp/example_sub_job.rb
+++ b/app/models/jobs/temp/example_sub_job.rb
@@ -7,7 +7,7 @@ class Jobs::Temp::ExampleSubJob
     time = prng.rand(9)
     sleep time # fake processing delay...
     Rails.logger.info("Running job #{id} of #{parent_total_jobs} for #{time} seconds and failing: #{should_fail?(t
-    raise(Exception, "Random Exception raised 11% of time") if should_fail?(time) #fail 11% of time
+    raise("Random Exception raised 11% of time") if should_fail?(time) #fail 11% of time
   end

Hmm, try disabling reliable scheduler. I've got a bad feeling it has a subtle timing bug.

Sidekiq Pro 3.4.5 has been released with this fix. Thank you for your hard work in reproducing this issue!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

paul-ylz picture paul-ylz  Â·  4Comments

aglushkov picture aglushkov  Â·  3Comments

fatcatt316 picture fatcatt316  Â·  4Comments

sandstrom picture sandstrom  Â·  3Comments

nikhilm492 picture nikhilm492  Â·  4Comments