Ruby version: 2.5.7, 2.6.6
Sidekiq / Pro / Enterprise version(s): sidekiq (6.1.1) sidekiq-pro (5.1.1)
We had faced some intermittent DNS issues with one of our Redis Instances. When that happened jobs that should be sent to Sidekiq A, got sent to Sidekiq B. With reliable_push enabled on Sidekiq-Pro.
Seems like that somewhere down the line the local_queue lost track on the correct @redis_pool assigned to each job. Can't dig up any further now :(
I made a reproduction script, creating 2 Sidekiq Clients. Each client pushes to a different DB and queue on Redis. Simulating that one of them is having 'connection issues'. It also fails with two separated redis instances.
At the end I expect that jobs pushed to Client B don't end up on Client A. This should work as it's stated here, right?
# frozen_string_literal: true
require 'bundler/inline'
gemfile(true) do
source "https://rubygems.org"
git_source(:github) { |repo| "https://github.com/#{repo}.git" }
gem 'minitest'
gem 'redis'
gem 'byebug'
source "https://gems.contribsys.com/" do
gem 'sidekiq-pro', '< 6'
end
end
class RedisWithIssues < Redis
@@count = 0
def lpush(*args)
@@count += 1
if @@count % 3 == 0
super(*args)
else
raise Redis::BaseError.new('Intermittent connection issues')
end
end
end
REDIS_A = ConnectionPool.new { RedisWithIssues.new(db: 1) }
REDIS_B = ConnectionPool.new { Redis.new(db: 2) }
Sidekiq::Client.reliable_push!
class MyWorker; include Sidekiq::Worker; end
require "minitest/autorun"
class BugTest < Minitest::Test
def setup
@db_one = Redis.new(db: 1)
@db_two = Redis.new(db: 2)
cleanup_keys = %w(queues queue:a_queue queue:b_queue)
@db_one.del(cleanup_keys)
@db_two.del(cleanup_keys)
@client_A = Sidekiq::Client.new(REDIS_A)
@client_B = Sidekiq::Client.new(REDIS_B)
end
def test_pushing_jobs_for_two_no_threads
assert_equal [], @db_one.smembers('queues')
assert_equal [], @db_two.smembers('queues')
10.times { @client_A.push('queue' => 'a_queue', 'class' => 'AWork', 'args' => [1, 2, 3]) }
10.times { @client_B.push('queue' => 'b_queue', 'class' => 'BWork', 'args' => [1, 2, 3]) }
assert_equal ["a_queue"], @db_one.smembers('queues')
assert_equal ["b_queue"], @db_two.smembers('queues')
end
end
Fetching source index from https://gems.contribsys.com/
Fetching gem metadata from https://rubygems.org/.......
Fetching source index from https://rubygems.org/
Resolving dependencies.....
Using bundler 2.1.4
Using byebug 11.1.3
Using connection_pool 2.2.3
Using minitest 5.14.1
Using rack 2.2.3
Using redis 4.2.1
Using sidekiq 6.1.1
Using sidekiq-pro 5.1.1
Run options: --seed 9098
# Running:
2020-08-15T19:49:34.699Z pid=24419 tid=oxrtihsa3 WARN: [ReliablePush] Enqueueing job locally due to Redis::BaseError: Intermittent connection issues
2020-08-15T19:49:34.700Z pid=24419 tid=oxrtihsa3 INFO: [ReliablePush] Connectivity restored, draining local queue
2020-08-15T19:49:34.700Z pid=24419 tid=oxrtihsa3 INFO: [ReliablePush] Connectivity restored, draining local queue
2020-08-15T19:49:34.701Z pid=24419 tid=oxrtihsa3 INFO: [ReliablePush] Connectivity restored, draining local queue
2020-08-15T19:49:34.701Z pid=24419 tid=oxrtihsa3 INFO: [ReliablePush] Connectivity restored, draining local queue
2020-08-15T19:49:34.702Z pid=24419 tid=oxrtihsa3 INFO: [ReliablePush] Connectivity restored, draining local queue
2020-08-15T19:49:34.703Z pid=24419 tid=oxrtihsa3 INFO: [ReliablePush] Connectivity restored, draining local queue
2020-08-15T19:49:34.704Z pid=24419 tid=oxrtihsa3 INFO: [ReliablePush] Connectivity restored, draining local queue
2020-08-15T19:49:34.704Z pid=24419 tid=oxrtihsa3 INFO: [ReliablePush] Connectivity restored, draining local queue
2020-08-15T19:49:34.704Z pid=24419 tid=oxrtihsa3 INFO: [ReliablePush] Connectivity restored, draining local queue
F
Finished in 0.008914s, 112.1831 runs/s, 336.5492 assertions/s.
1) Failure:
BugTest#test_pushing_jobs_for_two_no_threads [issue.rb:62]:
Expected: ["a_queue"]
Actual: ["a_queue", "b_queue"]
1 runs, 3 assertions, 1 failures, 0 errors, 0 skips
From running monitor on Redis. I saw LPUSH being called on DB 1, while it should be sent to DB 2.
1597521041.377776 [0 127.0.0.1:56186] "select" "1"
1597521041.377903 [1 127.0.0.1:56186] "del" "queues" "queue:a_queue" "queue:b_queue"
1597521041.378255 [0 127.0.0.1:56187] "select" "2"
1597521041.378427 [2 127.0.0.1:56187] "del" "queues" "queue:a_queue" "queue:b_queue"
1597521041.378603 [1 127.0.0.1:56186] "smembers" "queues"
1597521041.378748 [2 127.0.0.1:56187] "smembers" "queues"
1597521041.379812 [0 127.0.0.1:56188] "select" "1"
1597521041.379932 [1 127.0.0.1:56188] "multi"
1597521041.380020 [1 127.0.0.1:56188] "sadd" "queues" "a_queue"
1597521041.380032 [1 127.0.0.1:56188] "lpush" "queue:a_queue" "{\"retry\":true,\"queue\":\"a_queue\",\"class\":\"AWork\",\"args\":[1,2,3],\"jid\":\"9fc06a546eca74a0c9fc2c88\",\"created_at\":1597521041.379431,\"enqueued_at\":1597521041.379452}"
1597521041.380060 [1 127.0.0.1:56188] "exec"
1597521041.380720 [1 127.0.0.1:56188] "multi"
1597521041.380730 [1 127.0.0.1:56188] "sadd" "queues" "a_queue"
1597521041.380734 [1 127.0.0.1:56188] "lpush" "queue:a_queue" "{\"retry\":true,\"queue\":\"a_queue\",\"class\":\"AWork\",\"args\":[1,2,3],\"jid\":\"620cc9befdc6fb33ffe6057c\",\"created_at\":1597521041.380595,\"enqueued_at\":1597521041.3806121}"
1597521041.380755 [1 127.0.0.1:56188] "exec"
1597521041.381151 [1 127.0.0.1:56188] "multi"
1597521041.381409 [1 127.0.0.1:56188] "sadd" "queues" "a_queue"
1597521041.381415 [1 127.0.0.1:56188] "lpush" "queue:a_queue" "{\"retry\":true,\"queue\":\"a_queue\",\"class\":\"AWork\",\"args\":[1,2,3],\"jid\":\"d39924e7f26640d34e8de41d\",\"created_at\":1597521041.381037,\"enqueued_at\":1597521041.381054}"
1597521041.381449 [1 127.0.0.1:56188] "exec"
1597521041.382053 [1 127.0.0.1:56188] "multi"
1597521041.382080 [1 127.0.0.1:56188] "sadd" "queues" "a_queue"
1597521041.382084 [1 127.0.0.1:56188] "lpush" "queue:a_queue" "{\"retry\":true,\"queue\":\"a_queue\",\"class\":\"AWork\",\"args\":[1,2,3],\"jid\":\"3be62352be1fe6b13eed3d14\",\"created_at\":1597521041.381906,\"enqueued_at\":1597521041.381925}"
1597521041.382104 [1 127.0.0.1:56188] "exec"
1597521041.383085 [0 127.0.0.1:56189] "select" "2"
1597521041.383260 [2 127.0.0.1:56189] "multi"
1597521041.383345 [2 127.0.0.1:56189] "sadd" "queues" "b_queue"
1597521041.383353 [2 127.0.0.1:56189] "lpush" "queue:b_queue" "{\"retry\":true,\"queue\":\"b_queue\",\"class\":\"BWork\",\"args\":[1,2,3],\"jid\":\"7afa41d7d9595e503e977062\",\"created_at\":1597521041.382675,\"enqueued_at\":1597521041.382745}"
1597521041.383405 [2 127.0.0.1:56189] "exec"
1597521041.383749 [1 127.0.0.1:56188] "multi"
1597521041.383789 [1 127.0.0.1:56188] "sadd" "queues" "a_queue"
1597521041.383794 [1 127.0.0.1:56188] "lpush" "queue:a_queue" "{\"retry\":true,\"queue\":\"a_queue\",\"class\":\"AWork\",\"args\":[1,2,3],\"jid\":\"0aca677f324890975a5ab8d8\",\"created_at\":1597521041.3793511,\"enqueued_at\":1597521041.3835871}"
1597521041.383839 [1 127.0.0.1:56188] "exec"
1597521041.384521 [1 127.0.0.1:56188] "multi"
1597521041.384544 [1 127.0.0.1:56188] "sadd" "queues" "b_queue"
1597521041.384549 [1 127.0.0.1:56188] "lpush" "queue:b_queue" "{\"retry\":true,\"queue\":\"b_queue\",\"class\":\"BWork\",\"args\":[1,2,3],\"jid\":\"8c057027cea84d5057db52a5\",\"created_at\":1597521041.384383,\"enqueued_at\":1597521041.3844}"
1597521041.384569 [1 127.0.0.1:56188] "exec"
1597521041.385134 [1 127.0.0.1:56188] "multi"
1597521041.385469 [1 127.0.0.1:56188] "sadd" "queues" "b_queue"
1597521041.385475 [1 127.0.0.1:56188] "lpush" "queue:b_queue" "{\"retry\":true,\"queue\":\"b_queue\",\"class\":\"BWork\",\"args\":[1,2,3],\"jid\":\"7b9402f290776446b29f2744\",\"created_at\":1597521041.385003,\"enqueued_at\":1597521041.3850188}"
1597521041.385496 [1 127.0.0.1:56188] "exec"
1597521041.386031 [1 127.0.0.1:56188] "multi"
1597521041.386073 [1 127.0.0.1:56188] "sadd" "queues" "b_queue"
1597521041.386091 [1 127.0.0.1:56188] "lpush" "queue:b_queue" "{\"retry\":true,\"queue\":\"b_queue\",\"class\":\"BWork\",\"args\":[1,2,3],\"jid\":\"9a64a9ad44b497c83e923566\",\"created_at\":1597521041.3859258,\"enqueued_at\":1597521041.385946}"
1597521041.386131 [1 127.0.0.1:56188] "exec"
1597521041.386783 [1 127.0.0.1:56188] "multi"
1597521041.386808 [1 127.0.0.1:56188] "sadd" "queues" "b_queue"
1597521041.386813 [1 127.0.0.1:56188] "lpush" "queue:b_queue" "{\"retry\":true,\"queue\":\"b_queue\",\"class\":\"BWork\",\"args\":[1,2,3],\"jid\":\"2b63fe085ca95cad63a35a8d\",\"created_at\":1597521041.386692,\"enqueued_at\":1597521041.386709}"
1597521041.386860 [1 127.0.0.1:56188] "exec"
1597521041.387225 [1 127.0.0.1:56186] "smembers" "queues"
1597521041.388060 [1 127.0.0.1:56188] "multi"
1597521041.388086 [1 127.0.0.1:56188] "sadd" "queues" "a_queue"
1597521041.388090 [1 127.0.0.1:56188] "lpush" "queue:a_queue" "{\"retry\":true,\"queue\":\"a_queue\",\"class\":\"AWork\",\"args\":[1,2,3],\"jid\":\"0f553b15288f766457764d43\",\"created_at\":1597521041.380961,\"enqueued_at\":1597521041.387948}"
1597521041.388138 [1 127.0.0.1:56188] "exec"
This is a wonderfully thorough issue, thank you for your hard work in submitting it. I was able to reproduce locally, determine the issue and fix it. During the drain process, a second Redis exception could cause the wrong pool to be associated with the client. So B's client would get A's pool.
Most helpful comment
This is a wonderfully thorough issue, thank you for your hard work in submitting it. I was able to reproduce locally, determine the issue and fix it. During the drain process, a second Redis exception could cause the wrong pool to be associated with the client. So B's client would get A's pool.