Sidekiq: RetrySet and ScheduledSet not worked when using a large number of processes

Created on 12 Apr 2017  路  5Comments  路  Source: mperham/sidekiq

Ruby version: 2.3.3
Sidekiq version: 4.2.10
Sidekiq Pro version: 3.4.5

Problem Description

Steps to reproduce

  1. Set up Sidekiq with 100+ workers (so that Sidekiq::ProcessSet.new.size returns a number > 100)
  2. Create a few scheduled jobs or jobs that require retry
  3. Restart your application

Expected Result

The jobs in the RetrySet and ScheduledSet are worked.

Actual Result

Immediately after restarting, pending jobs in the RetrySet and ScheduledSet are worked. Following that, no forward progress is made on the RetrySet or ScheduledSet for 20 - 30 minutes. The admin page will show Next Retry as "20 minutes ago". After this 20 - 30 minute period, the RetrySet and ScheduledSet have their jobs regularly worked again.

Analysis

The issue seems to be caused by Poller#random_poll_interval, which is defined as:

# Calculates a random interval that is 卤50% the desired average.
def random_poll_interval
  poll_interval_average * rand + poll_interval_average.to_f / 2
end

When the app restarts, the following occurs:

  1. All workers wait for Poller#initial_wait (5 - 15 seconds)
  2. After initial_wait, all workers check for work on the RetrySet and ScheduledSet.
  3. All workers then wait at least poll_interval_average / 2 seconds before checking for work again.

By default, poll_interval_average is set to 15 * ProcessSet.size, so if you have more than 100 workers, workers wait at least 100*15 / 2 = 750 seconds before checking the RetrySet and ScheduleSet again.

Here's a graph showing the frequency of checks to Redis with 1000 workers.
50percent

The large spike at the beginning is all workers checking within initial_wait. Note the gap immediately afterwards where Redis is not checked at all. Ignore the trail at the tail... we set up our test to poll exactly twice.

A Solution

We've remedied the issue in our own app by monkey patching random_poll_interval as follows:

# Calculates a random interval that is 卤100% the desired average.
def random_poll_interval
  2 * poll_interval_average.to_f * rand
end

This implementation allows each worker to wait poll_interval_average seconds on average, though with greater variance. A benefit of this greater variance is that it removes the floor of poll_interval_average / 2, which eliminates the gap in the graph above. Here's the distribution of calls to Redis with the above implementation:

100percent

With this implementation, both the RetrySet and ScheduleSet are checked regularly.

Note that just setting the poll_interval_average config option wouldn't work. If we set it low enough so that there's no noticeable gap in checking, each worker checks that often, which slams Redis. A nice property of the ProcessSet based implementation is that it considers the overall frequency of checks to Redis, not the frequency of checks made per worker.

History

I looked at the history and found https://github.com/mperham/sidekiq/issues/2317. It looks this 2 * rand implementation was the original, but was replaced with the 卤50% implementation to tighten the variance and remove outliers. While that's important, I claim that having no progress made on the RetrySet / ScheduledSet is a worse behaviour than outliers in the wait time.

Some options I'm happy to write a PR for:

  1. Redefine random_poll_interval to the implementation above of 2 * poll_interval_average.to_f * rand
  2. Set a new param to define the bounds around poll_interval_average. I'd set it to 100%, @cainlevy could set it to 50% for his use case. I do think that seems like a lot of sophistication and I don't think it's an easily understood config parameter.

Looking for any advice on what patch to submit upstream.

Most helpful comment

No more configuration options. The end result should be a poll every N seconds on average. We should be able to math our way past this issue. I'll be happy to see any PR with improvements.

All 5 comments

Yikes. So this is a problem with cold starts when all the workers are synchronized? It takes one or two cycles for the workers to settle out into a staggered polling pattern?

I wonder if that could be addressed more directly. Fixing the polling gap on cold starts could be complementary to the ongoing reliability of the 卤50% implementation. Choosing between the two sounds like a tough choice to force on any developer, and an even tougher choice for a developer to discover is one that matters for them.

No more configuration options. The end result should be a poll every N seconds on average. We should be able to math our way past this issue. I'll be happy to see any PR with improvements.

Yikes. So this is a problem with cold starts when all the workers are synchronized? It takes one or two cycles for the workers to settle out into a staggered polling pattern?

Yes.

I wonder if that could be addressed more directly. Fixing the polling gap on cold starts could be complementary to the ongoing reliability of the 卤50% implementation. Choosing between the two sounds like a tough choice to force on any developer, and an even tougher choice for a developer to discover is one that matters for them.

Before changing random_poll_interval, we also tried spreading out the checks in the initial period by having workers wait a random amount of time in initial_wait (between 0 and poll_interval_average / 2, instead of 5 - 15 seconds). Unfortunately, this had the effect of blocking workers for checking their own queues after the app restarted.

I like @mperham's approach of mathing our way out of this and not exposing any more config options. One other idea that just came to mind is to have the variance reduce over time based on the number of polls. e.g., start at 卤100% for the first poll, 卤95% on the second poll, and trend down to 卤50% (or even 卤25%). Slightly more complicated, but solves both the initial gap issue and the large variance issue. I'll run some simulations with that to validate that it results in a smooth distribution of a poll every N seconds on average and get a PR out.

I suppose it's unavoidable that relying on randomness will have more frequent worst case scenarios (e.g. everything randomly choosing +100%) with smaller numbers of processes. Thankfully those scenarios are also less costly, on the order of seconds rather than minutes.

@kenrose I'm closing this for now. Please open a PR if you get something better than now.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nikhilm492 picture nikhilm492  路  4Comments

aglushkov picture aglushkov  路  3Comments

michaeldiscala picture michaeldiscala  路  4Comments

andrewhavens picture andrewhavens  路  4Comments

mperham picture mperham  路  3Comments