(Taking this opportunity to thank you and everyone else that has contributed to Sidekiq! Very happy user over many years.)
There is an edge case I've diagnosed twice in the last 6 months in an app I help maintain. I thought it might be useful to at least document the problem for other people encountering this, but I'm actually curious whether there's any interest in better error messaging from Sidekiq itself.
The edge case may also be relevant in future if Sidekiq (or Rails) ever stages jobs to a database table to benefit from database transaction/rollback guarantees.
In summary: Redis::CommandError: ERR invalid expire time in set errors may appear if a job is (a) unique, (b) scheduled, (c) staged a database table instead of pushed to Redis, and (d) not pushed to Redis within a certain time threshold. (Part (c) is not Sidekiq's own functionality.)
Suppose all of the following about a job:
perform_at or perform_in, S minutes from now.In this setup, there may be a delay in pushing the jobs from the database table to Redis. Suppose a network problem occurs or there's an issue with the publisher process. If the delay for a job lasts longer than U + S minutes, the push attempt for that job will start failing with an error (the ddtrace lines aren't relevant):
Redis::CommandError: ERR invalid expire time in set
from redis/client.rb:127:in `call'
from ddtrace/contrib/redis/patcher.rb:46:in `block in call'
from ddtrace/tracer.rb:283:in `trace'
from ddtrace/contrib/redis/patcher.rb:40:in `call'
from redis.rb:846:in `block in set'
from redis.rb:69:in `block in synchronize'
from monitor.rb:235:in `mon_synchronize'
from redis.rb:69:in `synchronize'
from redis.rb:844:in `set'
from sidekiq-ent/unique.rb:96:in `block in call'
The issue is with the calculation of the SET command's px option within Sidekiq Ent:
px: ((ttl - now) * 1000).to_i
Because the job failed to push for U + S minutes, ttl - now goes negative, and Redis sees a SET with an invalid option (because ttl = the_scheduled_time + uniqueness_expiry_duration). Since the staged job in the database has fixed timestamps, the job will never successfully push to Redis.
There are a few obvious options, including increase the uniqueness threshold (U) or the schedule threshold (S), and fixing the issue in the third-party code with retries or a clearer exception.
One I'd like to propose though: clearer error messaging from within Sidekiq if sidekiq-ent would ever call conn.set with a negative px. Perhaps something about the job's relevant timestamps being negative, so that someone encountering this bug has a better chance to know what the problem area is. Currently, the error is very vague and doesn't provide a lot of insight into the possible source.
That said, I would totally understand a "this is a third-party problem" won't fix outcome!
For completeness:
Thanks for the kind words, Adam!
I'm happy to handle rare errors like this as long as it doesn't add significant overhead. Should the client middleware raise an error and break the publish process or should it log an error and return nil?
I think it makes sense for this case to generate a raised exception, lowering the chances of silently dropping work. Maybe with exception messaging along the lines of "unique job expiration time must be in the future". (It occurs to me that with a <= check it could also catch misconfiguration like unique_for: 0, which based on #3753 at least isn't considered valid.)
It did occur to me that it _might_ be acceptable to log a warning about the negative px value, and then essentially ignore the uniqueness condition and push the job to Redis anyway? App/library code that is dealing with deferred pushes could certainly handle that case explicitly to prevent/handle the exception coming out of Sidekiq. But maybe, in the general case, "you're asking me to make this job unique for -5 seconds, I'm going to pretend there is no uniqueness condition" behavior is too lax for Sidekiq itself?
I'd err on simplicity and log an error rather than raise. Uniqueness is already documented as best effort. This is another edge case where uniqueness might fail.
When the error is logged, would the job still push to Redis, or would the whole push attempt be cut off? (I had read your "log an error and return nil" as the latter, but maybe you meant the former. I'm certainly on board if the job push still happens!)
The fix logs an error and does not push the job payload. The idea being that if you are publishing in a loop, I don't want one bad job to lead to further job pushes failing. wdyt?
With log+skip it feels like it would be harder to understand why jobs weren't executed when it seems like they were (unless you know what logging to look for). With an exception (as today), a mini stuck backlog forms that requires manual intervention, which isn't great. With log+push-anyway the best-effort uniqueness is followed and job processing auto-corrects after the intermittent problem is over.
Log+push-anyway is basically the manual fix we were going to go with anyway, and I don't think our publish loop is affected in that case. Is there something I'm overlooking with the publish loop aspect?
Put another way: if there's a Redis connectivity problem for 10 minutes that causes this failure mode, it seems _much_ better for jobs to show up late than for the work to be dropped.
Well if you have a job that is unique for a minute and you’ve been down for
60 minutes you could possibly have 60 jobs to be published all at once and
get quite a stack of duplicate jobs.
On Wed, Dec 16, 2020 at 11:28 Adam Prescott notifications@github.com
wrote:
With log+skip it feels like it would be harder to understand why jobs
weren't executed when it seems like they were (unless you know what logging
to look for). With an exception (as today), a mini stuck backlog forms that
requires manual intervention, which isn't great. With log+push-anyway the
best-effort uniqueness is followed and job processing auto-corrects after
the intermittent problem is over.Log+push-anyway is basically the manual fix we were going to go with
anyway, and I don't think our publish loop is affected in that case. Is
there something I'm overlooking with the publish loop aspect?—
You are receiving this because you modified the open/close state.Reply to this email directly, view it on GitHub
https://github.com/mperham/sidekiq/issues/4763#issuecomment-746879625,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAAAWX6TAGXCDPPEHIN24KDSVEC5LANCNFSM4U6KYDVA
.
Does that fall under the umbrella of best effort? Any actual duplicate work (from the 1 to 59 excess jobs) that happens would, to me, be a sign that the job isn't structured properly anyway and needs something more than just best-effort uniqueness. It definitely seems really surprising to have a failure mode that favors a 0-job outcome over duplication.
To anchor the example in my head for when there _isn't_ a big stack of duplicate jobs: with FooJob.perform_at(5.minutes.from_now, order_id), with unique_for: 5.minutes, in the face of >10 minutes of Redis connection failure it seems preferable to have that arrive late than be dropped.
Fair enough. I'll update it to push without the unique lock.
Sounds great, thank you!
Most helpful comment
Fair enough. I'll update it to push without the unique lock.