Sidekiq: Failure mode with scheduled and unique jobs when staging the job push to Redis

Created on 16 Dec 2020 · 12Comments · Source: mperham/sidekiq

(Taking this opportunity to thank you and everyone else that has contributed to Sidekiq! Very happy user over many years.)

There is an edge case I've diagnosed twice in the last 6 months in an app I help maintain. I thought it might be useful to at least document the problem for other people encountering this, but I'm actually curious whether there's any interest in better error messaging from Sidekiq itself.

The edge case may also be relevant in future if Sidekiq (or Rails) ever stages jobs to a database table to benefit from database transaction/rollback guarantees.

Summary

In summary: Redis::CommandError: ERR invalid expire time in set errors may appear if a job is (a) unique, (b) scheduled, (c) staged a database table instead of pushed to Redis, and (d) not pushed to Redis within a certain time threshold. (Part (c) is not Sidekiq's own functionality.)

Description

Suppose all of the following about a job:

Sidekiq Enterprise's uniqueness is configured for U minutes.
The job is scheduled with perform_at or perform_in, S minutes from now.
The job is not pushed directly to Redis, but rather its arguments and metadata are first staged in a database table, to accommodate job enqueues within transactions that may roll back. For example, using something like the (third-party) sidekiq_publisher gem. The actual push to Redis occurs with a separate running "publisher" process that uses Sidekiq's provided API.

In this setup, there may be a delay in pushing the jobs from the database table to Redis. Suppose a network problem occurs or there's an issue with the publisher process. If the delay for a job lasts longer than U + S minutes, the push attempt for that job will start failing with an error (the ddtrace lines aren't relevant):

Redis::CommandError: ERR invalid expire time in set
  from redis/client.rb:127:in `call'
  from ddtrace/contrib/redis/patcher.rb:46:in `block in call'
  from ddtrace/tracer.rb:283:in `trace'
  from ddtrace/contrib/redis/patcher.rb:40:in `call'
  from redis.rb:846:in `block in set'
  from redis.rb:69:in `block in synchronize'
  from monitor.rb:235:in `mon_synchronize'
  from redis.rb:69:in `synchronize'
  from redis.rb:844:in `set'
  from sidekiq-ent/unique.rb:96:in `block in call'

The issue is with the calculation of the SET command's px option within Sidekiq Ent:

px: ((ttl - now) * 1000).to_i

Because the job failed to push for U + S minutes, ttl - now goes negative, and Redis sees a SET with an invalid option (because ttl = the_scheduled_time + uniqueness_expiry_duration). Since the staged job in the database has fixed timestamps, the job will never successfully push to Redis.

There are a few obvious options, including increase the uniqueness threshold (U) or the schedule threshold (S), and fixing the issue in the third-party code with retries or a clearer exception.

One I'd like to propose though: clearer error messaging from within Sidekiq if sidekiq-ent would ever call conn.set with a negative px. Perhaps something about the job's relevant timestamps being negative, so that someone encountering this bug has a better chance to know what the problem area is. Currently, the error is very vague and doesn't provide a lot of insight into the possible source.

That said, I would totally understand a "this is a third-party problem" won't fix outcome!

Versions

For completeness:

sidekiq 6.1.1
sidekiq-pro 5.1.1
sidekiq-ent 2.1.1
sidekiq_publisher 1.7.1 (third-party!)

Source

aprescott

Most helpful comment

Fair enough. I'll update it to push without the unique lock.

mperham on 16 Dec 2020

❤1 👍1

All 12 comments

Thanks for the kind words, Adam!

I'm happy to handle rare errors like this as long as it doesn't add significant overhead. Should the client middleware raise an error and break the publish process or should it log an error and return nil?

mperham on 16 Dec 2020

I think it makes sense for this case to generate a raised exception, lowering the chances of silently dropping work. Maybe with exception messaging along the lines of "unique job expiration time must be in the future". (It occurs to me that with a <= check it could also catch misconfiguration like unique_for: 0, which based on #3753 at least isn't considered valid.)

It did occur to me that it _might_ be acceptable to log a warning about the negative px value, and then essentially ignore the uniqueness condition and push the job to Redis anyway? App/library code that is dealing with deferred pushes could certainly handle that case explicitly to prevent/handle the exception coming out of Sidekiq. But maybe, in the general case, "you're asking me to make this job unique for -5 seconds, I'm going to pretend there is no uniqueness condition" behavior is too lax for Sidekiq itself?

aprescott on 16 Dec 2020

I'd err on simplicity and log an error rather than raise. Uniqueness is already documented as best effort. This is another edge case where uniqueness might fail.

mperham on 16 Dec 2020

When the error is logged, would the job still push to Redis, or would the whole push attempt be cut off? (I had read your "log an error and return nil" as the latter, but maybe you meant the former. I'm certainly on board if the job push still happens!)

aprescott on 16 Dec 2020

The fix logs an error and does not push the job payload. The idea being that if you are publishing in a loop, I don't want one bad job to lead to further job pushes failing. wdyt?

mperham on 16 Dec 2020

With log+skip it feels like it would be harder to understand why jobs weren't executed when it seems like they were (unless you know what logging to look for). With an exception (as today), a mini stuck backlog forms that requires manual intervention, which isn't great. With log+push-anyway the best-effort uniqueness is followed and job processing auto-corrects after the intermittent problem is over.

Log+push-anyway is basically the manual fix we were going to go with anyway, and I don't think our publish loop is affected in that case. Is there something I'm overlooking with the publish loop aspect?

aprescott on 16 Dec 2020

Put another way: if there's a Redis connectivity problem for 10 minutes that causes this failure mode, it seems _much_ better for jobs to show up late than for the work to be dropped.

aprescott on 16 Dec 2020

👍1

Well if you have a job that is unique for a minute and you’ve been down for
60 minutes you could possibly have 60 jobs to be published all at once and
get quite a stack of duplicate jobs.

On Wed, Dec 16, 2020 at 11:28 Adam Prescott notifications@github.com
wrote:

With log+skip it feels like it would be harder to understand why jobs
weren't executed when it seems like they were (unless you know what logging
to look for). With an exception (as today), a mini stuck backlog forms that
requires manual intervention, which isn't great. With log+push-anyway the
best-effort uniqueness is followed and job processing auto-corrects after
the intermittent problem is over.

Log+push-anyway is basically the manual fix we were going to go with
anyway, and I don't think our publish loop is affected in that case. Is
there something I'm overlooking with the publish loop aspect?

—
You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub
https://github.com/mperham/sidekiq/issues/4763#issuecomment-746879625,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAAAWX6TAGXCDPPEHIN24KDSVEC5LANCNFSM4U6KYDVA
.

mperham on 16 Dec 2020

Does that fall under the umbrella of best effort? Any actual duplicate work (from the 1 to 59 excess jobs) that happens would, to me, be a sign that the job isn't structured properly anyway and needs something more than just best-effort uniqueness. It definitely seems really surprising to have a failure mode that favors a 0-job outcome over duplication.

aprescott on 16 Dec 2020

To anchor the example in my head for when there _isn't_ a big stack of duplicate jobs: with FooJob.perform_at(5.minutes.from_now, order_id), with unique_for: 5.minutes, in the face of >10 minutes of Redis connection failure it seems preferable to have that arrive late than be dropped.

aprescott on 16 Dec 2020

Fair enough. I'll update it to push without the unique lock.

mperham on 16 Dec 2020

❤1 👍1

Sounds great, thank you!

aprescott on 17 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Sidekiq-Pro 3.4.2 uninitialized constant Rack::Protection

bartimaeus · 3Comments

Uninitialized constant error

smanolloff · 3Comments

How can I get the execution time for a given job?

BeRMaNyA · 3Comments

PG::UnableToSend: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.

igorkasyanchuk · 3Comments

Redesign reliable_fetch

mperham · 4Comments