Sidekiq Version: Sidekiq 4.2.9 / Pro 3.4.3 / Ent 1.5.1
Redis Version: 3.2.7
Ruby: ruby 2.2.5p319 (2016-04-26 revision 54774) [x86_64-darwin15]
While executing some tests on batches I've noticed that some are not completing and jobs seems to be disappearing.
All queues are cleared, all retries are executed, and batch still shows on UI with a pending/error count bigger than zero.
Example application:
https://github.com/raivil/sidekiq_issue_3312/, master branch using super_fetch and branch sidekiq_v4.2.9-timed_fetch using timed_fetch.
https://github.com/raivil/sidekiq_issue_3312/tree/master/app/models/jobs/temp
https://github.com/raivil/sidekiq_issue_3312/blob/master/config/initializers/sidekiq.rb
Scenario:
Jobs::Temp::ExampleBatchJob is a job that creates a batch with many subjobs. The number of subjobs is provided as parameter.
Jobs::Temp::ExampleBatchJob code:
class Jobs::Temp::ExampleBatchJob
include Sidekiq::Worker
sidekiq_options queue: :default, unique_for: 1.hour
def perform(number_of_subjobs)
batch = Sidekiq::Batch.new
batch.description = "#{self.class.name} - #{number_of_subjobs}"
batch.on(:success, self.class, "model_id" => number_of_subjobs)
batch.jobs do
1.upto(number_of_subjobs).each do |i|
Jobs::Temp::ExampleSubJob.perform_async(number_of_subjobs, i)
end
end
end
def on_success(_status, options)
Rails.logger.info("Batch processing done!")
end
end
Each subjob will sleep for up to 9 seconds to simulate processing.
A small percentage (10%) will raise exception in order to simulate a failure and be re-executed.
Jobs::Temp::ExampleSubBatchJob code:
class Jobs::Temp::ExampleSubJob
include Sidekiq::Worker
sidekiq_options queue: :default, unique_for: 24.hours
def perform(parent_total_jobs, id)
prng = Random.new
time = prng.rand(9)
sleep time # fake processing delay...
Rails.logger.info("Running job #{id} of #{parent_total_jobs} for #{time} seconds and failing: #{should_fail?(time)}")
raise(Exception, "Random Exception raised 10% of time") if should_fail?(time) #fail 10% of time
end
private
def should_fail?(time)
(time % 9) == 1
end
end
Small and large batches will complete sometimes, but the majority will not:



Enqueueing a job:
Jobs::Temp::ExampleBatchJob.perform_async(30031)
Running sidekiq, rails and rails console:
COUNT=8 bundle exec sidekiqswarm -q critical -q default -q medium -q low
bundle exec rails s
bundle exec rails c
Sidekiq was not restarted while any batch/job was executing.
Sidekiq config:
https://github.com/raivil/sidekiq_issue_3312/blob/master/config/initializers/sidekiq.rb
require_relative "redis"
Sidekiq.configure_server do |config|
config.redis = { url: ENV["REDIS_PROVIDER"] }
config.super_fetch! # https://github.com/mperham/sidekiq/wiki/Pro-Reliability-Server
config.reliable_scheduler! # https://github.com/mperham/sidekiq/wiki/Reliability#scheduler
database_url = ENV["DATABASE_URL"]
if database_url.present?
ENV["DATABASE_URL"] = "#{database_url}?pool=25"
ActiveRecord::Base.establish_connection
end
end
Sidekiq.configure_client do |config|
config.redis = { url: ENV["REDIS_PROVIDER"] }
end
Sidekiq.default_worker_options = { "backtrace" => true }
Sidekiq::Client.reliable_push! unless Rails.env.test? # https://github.com/mperham/sidekiq/wiki/Pro-Reliability-Client
Sidekiq::Enterprise.unique! unless Rails.env.test? # https://github.com/mperham/sidekiq/wiki/Ent-Unique-Jobs
Testing with older version
Running the same jobs with a previous version of Sidekiq results in all batches completing successfully.
Sidekiq Version: Sidekiq 4.1.3 / Pro 3.3.2 / Ent 1.3.2
Redis Version: 3.2.7
Ruby: ruby 2.2.5p319 (2016-04-26 revision 54774) [x86_64-darwin15]
Example application:
https://github.com/raivil/sidekiq_issue_3312/, branch "sidekiq_v4.1.3".
Gemfile.lock https://github.com/raivil/sidekiq_issue_3312/commit/cd63de1167161d7c04fed66aaab6357ec9719c6b
Can you reproduce with the latest Pro? It has an optimization which I think might apply here.
On Mar 6, 2017, at 04:17, Ronaldo Raivil notifications@github.com wrote:
Sidekiq Version: Sidekiq 4.2.9 / Pro 3.4.3 / Ent 1.5.1
Redis Version: 3.2.7
Ruby: ruby 2.2.5p319 (2016-04-26 revision 54774) [x86_64-darwin15]While executing some tests on batches I've noticed that some are not completing and jobs seems to be disappearing.
All queues are cleared, all retries are executed, and batch still shows on UI with a pending/error count bigger than zero.Example application:
https://github.com/raivil/sidekiq_issue_3312/, master branch using super_fetch and branch sidekiq_v4.2.9-timed_fetch using timed_fetch.
https://github.com/raivil/sidekiq_issue_3312/tree/master/app/models/jobs/temp
https://github.com/raivil/sidekiq_issue_3312/blob/master/config/initializers/sidekiq.rbScenario:
Jobs::Temp::ExampleBatchJob is a job that creates a batch with many subjobs. The number of subjobs is provided as parameter.
Jobs::Temp::ExampleBatchJob code:class Jobs::Temp::ExampleBatchJob
include Sidekiq::Worker
sidekiq_options queue: :default, unique_for: 1.hour
def perform(number_of_subjobs)
batch = Sidekiq::Batch.new
batch.description = "#{self.class.name} - #{number_of_subjobs}"
batch.on(:success, self.class, "model_id" => number_of_subjobs)
batch.jobs do
1.upto(number_of_subjobs).each do |i|
Jobs::Temp::ExampleSubJob.perform_async(number_of_subjobs, i)
end
end
end
def on_success(_status, options)
Rails.logger.info("Batch processing done!")
end
end
Each subjob will sleep for up to 9 seconds to simulate processing.
A small percentage (10%) will raise exception in order to simulate a failure and be re-executed.
Jobs::Temp::ExampleSubBatchJob code:class Jobs::Temp::ExampleSubJob
include Sidekiq::Worker
sidekiq_options queue: :default, unique_for: 24.hoursdef perform(parent_total_jobs, id)
prng = Random.new
time = prng.rand(9)
sleep time # fake processing delay...
Rails.logger.info("Running job #{id} of #{parent_total_jobs} for #{time} seconds and failing: #{should_fail?(time)}")
raise(Exception, "Random Exception raised 10% of time") if should_fail?(time) #fail 10% of time
endprivate
def should_fail?(time)
(time % 9) == 1
endend
Small and large batches will complete sometimes, but the majority will not:Enqueueing a job:
Jobs::Temp::ExampleBatchJob.perform_async(30031)
Running sidekiq, rails and rails console:COUNT=8 bundle exec sidekiqswarm -q critical -q default -q medium -q low
bundle exec rails s
bundle exec rails c
Sidekiq was not restarted while any batch/job was executing.Sidekiq config:
https://github.com/raivil/sidekiq_issue_3312/blob/master/config/initializers/sidekiq.rbrequire_relative "redis"
Sidekiq.configure_server do |config|
config.redis = { url: ENV["REDIS_PROVIDER"] }
config.super_fetch! # https://github.com/mperham/sidekiq/wiki/Pro-Reliability-Server
config.reliable_scheduler! # https://github.com/mperham/sidekiq/wiki/Reliability#scheduler
database_url = ENV["DATABASE_URL"]
if database_url.present?
ENV["DATABASE_URL"] = "#{database_url}?pool=25"
ActiveRecord::Base.establish_connection
end
end
Sidekiq.configure_client do |config|
config.redis = { url: ENV["REDIS_PROVIDER"] }
end
Sidekiq.default_worker_options = { "backtrace" => true }
Sidekiq::Client.reliable_push! unless Rails.env.test? # https://github.com/mperham/sidekiq/wiki/Pro-Reliability-Client
Sidekiq::Enterprise.unique! unless Rails.env.test? # https://github.com/mperham/sidekiq/wiki/Ent-Unique-Jobs
Testing with older version
Running the same jobs with a previous version of Sidekiq results in all batches completing successfully.
Sidekiq Version: Sidekiq 4.1.3 / Pro 3.3.2 / Ent 1.3.2
Redis Version: 3.2.7
Ruby: ruby 2.2.5p319 (2016-04-26 revision 54774) [x86_64-darwin15]Example application:
https://github.com/raivil/sidekiq_issue_3312/, branch "sidekiq_v4.1.3".
Gemfile.lock raivil/sidekiq_issue_3312@cd63de1—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
Yes, same results with Sidekiq 4.2.9 / Pro 3.4.4 / Ent 1.5.2.
Some smaller batches with up to 1k jobs completed.

Can you reproduce with the latest running timed_fetch? IOW, is the problem the version bump or super_fetch?
Does the problem go away if you remove the unique_for option in the Worker?
I have observed this behavior with both timed_fetch and super_fetch. for Sidekiq 4.2.9 / Pro 3.4.3 / Ent 1.5.1.
On a previous version (Sidekiq 4.1.3 / Pro 3.3.2 / Ent 1.3.2) using timed_fetch I dont see this issue.
I'll make some tests without unique_for and let you know.
@mperham
Executed same tests again but this time without unique_for. Most of batches were not completed.
Would you gist me the output of redis-cli info against your Redis where the jobs are disappearing?
So far, all of the batches I've created with your Rails app have finished as expected. I still can't reproduce the behavior.
I was able to reproduce the problem with a 1,000 job batch. When I change the non-standard Exception to a standard RuntimeError, the batch worked. I'm not sure why yet.
http://www.mikeperham.com/2012/03/03/the-perils-of-rescue-exception/
--- a/app/models/jobs/temp/example_sub_job.rb
+++ b/app/models/jobs/temp/example_sub_job.rb
@@ -7,7 +7,7 @@ class Jobs::Temp::ExampleSubJob
time = prng.rand(9)
sleep time # fake processing delay...
Rails.logger.info("Running job #{id} of #{parent_total_jobs} for #{time} seconds and failing: #{should_fail?(t
- raise(Exception, "Random Exception raised 11% of time") if should_fail?(time) #fail 11% of time
+ raise("Random Exception raised 11% of time") if should_fail?(time) #fail 11% of time
end
Hmm, try disabling reliable scheduler. I've got a bad feeling it has a subtle timing bug.
Sidekiq Pro 3.4.5 has been released with this fix. Thank you for your hard work in reproducing this issue!
Most helpful comment
Sidekiq Pro 3.4.5 has been released with this fix. Thank you for your hard work in reproducing this issue!