Ruby version: 2.3.3
Sidekiq version: 4.2.10
Sidekiq Pro version: 3.4.5
Sidekiq::ProcessSet.new.size returns a number > 100)The jobs in the RetrySet and ScheduledSet are worked.
Immediately after restarting, pending jobs in the RetrySet and ScheduledSet are worked. Following that, no forward progress is made on the RetrySet or ScheduledSet for 20 - 30 minutes. The admin page will show Next Retry as "20 minutes ago". After this 20 - 30 minute period, the RetrySet and ScheduledSet have their jobs regularly worked again.
The issue seems to be caused by Poller#random_poll_interval, which is defined as:
# Calculates a random interval that is 卤50% the desired average.
def random_poll_interval
poll_interval_average * rand + poll_interval_average.to_f / 2
end
When the app restarts, the following occurs:
Poller#initial_wait (5 - 15 seconds)initial_wait, all workers check for work on the RetrySet and ScheduledSet.poll_interval_average / 2 seconds before checking for work again.By default, poll_interval_average is set to 15 * ProcessSet.size, so if you have more than 100 workers, workers wait at least 100*15 / 2 = 750 seconds before checking the RetrySet and ScheduleSet again.
Here's a graph showing the frequency of checks to Redis with 1000 workers.

The large spike at the beginning is all workers checking within initial_wait. Note the gap immediately afterwards where Redis is not checked at all. Ignore the trail at the tail... we set up our test to poll exactly twice.
We've remedied the issue in our own app by monkey patching random_poll_interval as follows:
# Calculates a random interval that is 卤100% the desired average.
def random_poll_interval
2 * poll_interval_average.to_f * rand
end
This implementation allows each worker to wait poll_interval_average seconds on average, though with greater variance. A benefit of this greater variance is that it removes the floor of poll_interval_average / 2, which eliminates the gap in the graph above. Here's the distribution of calls to Redis with the above implementation:

With this implementation, both the RetrySet and ScheduleSet are checked regularly.
Note that just setting the poll_interval_average config option wouldn't work. If we set it low enough so that there's no noticeable gap in checking, each worker checks that often, which slams Redis. A nice property of the ProcessSet based implementation is that it considers the overall frequency of checks to Redis, not the frequency of checks made per worker.
I looked at the history and found https://github.com/mperham/sidekiq/issues/2317. It looks this 2 * rand implementation was the original, but was replaced with the 卤50% implementation to tighten the variance and remove outliers. While that's important, I claim that having no progress made on the RetrySet / ScheduledSet is a worse behaviour than outliers in the wait time.
Some options I'm happy to write a PR for:
random_poll_interval to the implementation above of 2 * poll_interval_average.to_f * randpoll_interval_average. I'd set it to 100%, @cainlevy could set it to 50% for his use case. I do think that seems like a lot of sophistication and I don't think it's an easily understood config parameter.Looking for any advice on what patch to submit upstream.
Yikes. So this is a problem with cold starts when all the workers are synchronized? It takes one or two cycles for the workers to settle out into a staggered polling pattern?
I wonder if that could be addressed more directly. Fixing the polling gap on cold starts could be complementary to the ongoing reliability of the 卤50% implementation. Choosing between the two sounds like a tough choice to force on any developer, and an even tougher choice for a developer to discover is one that matters for them.
No more configuration options. The end result should be a poll every N seconds on average. We should be able to math our way past this issue. I'll be happy to see any PR with improvements.
Yikes. So this is a problem with cold starts when all the workers are synchronized? It takes one or two cycles for the workers to settle out into a staggered polling pattern?
Yes.
I wonder if that could be addressed more directly. Fixing the polling gap on cold starts could be complementary to the ongoing reliability of the 卤50% implementation. Choosing between the two sounds like a tough choice to force on any developer, and an even tougher choice for a developer to discover is one that matters for them.
Before changing random_poll_interval, we also tried spreading out the checks in the initial period by having workers wait a random amount of time in initial_wait (between 0 and poll_interval_average / 2, instead of 5 - 15 seconds). Unfortunately, this had the effect of blocking workers for checking their own queues after the app restarted.
I like @mperham's approach of mathing our way out of this and not exposing any more config options. One other idea that just came to mind is to have the variance reduce over time based on the number of polls. e.g., start at 卤100% for the first poll, 卤95% on the second poll, and trend down to 卤50% (or even 卤25%). Slightly more complicated, but solves both the initial gap issue and the large variance issue. I'll run some simulations with that to validate that it results in a smooth distribution of a poll every N seconds on average and get a PR out.
I suppose it's unavoidable that relying on randomness will have more frequent worst case scenarios (e.g. everything randomly choosing +100%) with smaller numbers of processes. Thankfully those scenarios are also less costly, on the order of seconds rather than minutes.
@kenrose I'm closing this for now. Please open a PR if you get something better than now.
Most helpful comment
No more configuration options. The end result should be a poll every N seconds on average. We should be able to math our way past this issue. I'll be happy to see any PR with improvements.