Sidekiq: Job started / retry date is in 100+ years from now

Created on 30 Nov 2016  ·  46Comments  ·  Source: mperham/sidekiq

UPDATE https://github.com/mperham/sidekiq/issues/3265#issuecomment-271681275 It was my fault of course -- Future devs take note!

Ruby version: 2.3.1
Sidekiq / Pro / Enterprise version(s): 4.1.4

If relevant, please include your initializer and any error message with the full backtrace.

I've started getting RequestTimeTooSkewed errors from S3 in my sidekiq jobs. I tried the Fog "sync_clock" option in my carrierwave initializer, but no dice.

I noticed my job started at times and retry times are being set to this date:

screen shot 2016-11-30 at 10 13 24 am

Sometimes the year is 2769. Update: another one came in for the year 2059 I can't find any info through googling about what is happening here. Please help, this is happening on my production server which is Heroku, which reports the correct current time in console :(

Most helpful comment

I don't take donations but I'm not demanding a purchase either. 🍰 You can always make the case for the next company you work for; I'll still be here years from now.

All 46 comments

Update I so far cannot reproduce this problem on my local environment with my production branch

screen shot 2016-11-30 at 11 03 54 am

The machine's clock is screwed up, possibly a hardware problem. I'd destroy the dyno and get a new one. Sidekiq and Ruby just use whatever the OS returns for the current time so there's nothing to be done with them.

This doesn't explain why in console, rails tells me the correct current time, and the logs show the correct time. Sometimes the sidekiq jobs do show the correct time for started but then the retry date is 100+ years later. Also, only sidekiq seems to be having problems.

When I run the jobs "inline" they work fine with S3 operations.

I contacted heroku support and they're not answering me almost 2 hours later even though I indicated that this is a production server issue.

Are you sure there isn't any debugging I can do to look into this more? I'm at an absolute loss. I'm stuck. I'm the only developer on this project, I have no one to ask for help, and I don't know what to do.

The only other problem would be if you are using sidekiq_retry_in and your code is returning a bad value. https://github.com/mperham/sidekiq/wiki/Error-Handling#configuration

Nah, I'm not altering the default, and that string does not appear in my code. :(

Are you using any other sidekiq gems? bundle | grep side

I do have retry count set to 3, but that's been like that for months. This problem started today. I understand it may be related to heroku dynos, but I can't grasp how Time.current can be correct, log times are correct, and inline jobs don't have a problem with S3 or showing incorrect year in the emails (We use the current year in some of our emails)

technovation-app master % bundle | grep side
Using sidekiq 4.1.4
technovation-app master %
# config/initializers/sidekiq.rb

Sidekiq.default_worker_options = {
  backtrace: true,
  retry: 3,
}

To get the retry time, Sidekiq adds seconds to the current time. I don't see how this can screw up unless you've got something fundamentally broken but I don't have any further ideas about what that might be. If this behavior just started, you need to track down what changed.

https://github.com/mperham/sidekiq/blob/master/lib/sidekiq/middleware/server/retry_jobs.rb#L126

Sure. And as I said, sometimes the "started" time is in the far future, too, not only the retry time.

Since I got one of our emails saying "Welcome to Technovation 2063" I guess it's probably not sidekiq's fault, right? Can sidekiq somehow set the time for the code being executed? Or is it all in the worker process?

Sounds like Ruby's Time API is returning the wrong date in the Sidekiq process. The job started timestamp is just Time.now.to_i.

https://github.com/mperham/sidekiq/blob/master/lib/sidekiq/processor.rb#L171

In rails console on production, Time.now.to_i is the correct epoch time :-/ I really appreciate you giving this your time and effort.

I'm convinced a problem is happening on heroku. Would be nice if their support team was more prompt for production server issues.

I guess their disclaimer is 1 business day. Heroku support engineers who visit this link and see my bitching about your response time: Fair enough. I'm sorry.

You can start an irb console on Heroku and see what it returns but I wonder if this isn't a problem with the specific dyno your Sidekiq process is running on.

As I said, I'd ps:scale down to 0 and then back up to 1. Get a new dyno and see what happens.

Oh. Yeah. For sure. I did that (through the web UI for dynos and workers). The problem persisted. So then I also restarted dynos. The problem appeared to go away. Some jobs worked. And then the problem came right back, and jobs and times were messed up again.

I just did the scaling with the Heroku CLI instead and it appears to be working properly again. I'll report back after some time has passed, or if it starts to fail again. Not saying it's on you to help me fix it at this point, but for the sake of anyone else who might have this issue and googles for it.

Final update After re-scaling the web and worker dynos with the Heroku CLI (and apparently not the Web UI), my sidekiq jobs continue, hours later, to be performing correctly.

Thanks again for your time @mperham I recognize and appreciate that you had no responsibility for this issue. Let me know if I can make it up to you in some way. Cheers.

At some point in the future, consider buying Pro or Enterprise for your company. I can provide this support because people buy my products. :-)

As we are a non profit I imagine making that case would be a challenge.

However, can I donate to your org?

I don't take donations but I'm not demanding a purchase either. 🍰 You can always make the case for the next company you work for; I'll still be here years from now.

@joemsak Sorry to hear about this. Personally, I've never had time-based issues with Sidekiq, so I'm curious what's going on. Do you have an open ticket with Heroku? If so, what's the ID? If you come across it again, @-me on twitter (@jmccartie) so we can catch it in the act.

-Jon
Heroku Support

Hey @mperham --

I was helping Joe on this today and observed the issue, while the underlying server time returned fine (both in Ruby and Bash). We also checked the underlying instance for NTP drift, but found no evidence of that.

I was able to enqueue a job (both through ActiveJob and directly through Sidekiq), which resulted in a "2168" year in the web console. Here's one sitting in the retry queue:

image

Ever seen this before?

Nope, I've never seen that before. What does the job payload look like in Redis?

On Dec 1, 2016, at 15:54, Jon McCartie notifications@github.com wrote:

Hey @mperham --

I was helping Joe on this today and observed the issue, while the underlying server time returned fine (both in Ruby and Bash). We also checked the underlying instance for NTP drift, but found no evidence of that.

I was able to enqueue a job (both through ActiveJob and directly through Sidekiq), which resulted in a "2168" year in the web console. Here's one sitting in the retry queue:

Ever seen this before?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@mperham

I just saw this happen again. Job was enqueued ok. Timestamp looked fine under "Busy". Here was payload:

{"queue"=>"default", "payload"=>{"class"=>"ActiveJob::QueueAdapters::SidekiqAdapter::JobWrapper", "wrapped"=>"ExportJob", "queue"=>"default", "args"=>[{"job_class"=>"ExportJob", "job_id"=>"52239a1a-bf63-4ab7-88df-f29353ff648f", "queue_name"=>"default", "arguments"=>[{"_aj_globalid"=>"gid://techapp/AdminProfile/2"}, "retry"=>3, "backtrace"=>true, "jid"=>"a26d9485a3564cc5583e5d3c", "created_at"=>1480638705.8035896, "enqueued_at"=>1480638705.8038297}, "run_at"=>1480638705}

Job failed. Now in the retry queue (not sure where to get the payload here):

=> {"job_class"=>"ExportJob", "job_id"=>"52239a1a-bf63-4ab7-88df-f29353ff648f", "queue_name"=>"default", "arguments"=>[{"_aj_globalid"=>"gid://technovation-app/AdminProfile/2"}, "locale"=>"en"}

Now, job is in Retry queue with a retry of 2640-08-01 08:00:20 UTC. I'm having @joemsak update Sidekiq to latest (4.2.7) to make sure it's nothing that was fixed in the earlier version (he's on 4.1.1)

Sidekiq::RetrySet.new.each do |job|
  puts job.value
end

@mperham Relevant bits:

"job_id":"52239a1a-bf63-4ab7-88df-f29353ff648f","queue_name":"default"
...
,"retry":3,"backtrace":true,"jid":"a26d9485a3564cc5583e5d3c","created_at":1480638705.8035896,"enqueued_at":1480638705.8038297,"

image

Looks fine to me, find a payload that is displaying the bad values and show me the exact payload data in Redis via job.value. Maybe the timestamp is being corrupted somehow (perhaps getting an additional digit?).

> Time.at(1480638705.8035896)
=> 2016-12-01 16:31:45 -0800
irb(main):003:0> Time.at(1480638705.8038297)
=> 2016-12-01 16:31:45 -0800

The above is from running Sidekiq::RetrySet.new.first.value. It spits out

=> "{\"class\":\"ActiveJob::QueueAdapters::SidekiqAdapter::JobWrapper\",\"wrapped\":\"ExportJob\",\"queue\":\"default\",\"args\":[{\"job_class\":\"ExportJob\",\"job_id\":\"52239a1a-bf63-4ab7-88df-f29353ff648f\",\"queue_name\":\"default\",\"arguments\"

...(redacted params to preserve customer data)...

\"retry\":3,\"backtrace\":true,\"jid\":\"a26d9485a3564cc5583e5d3c\",\"created_at\":1480638705.8035896,\"enqueued_at\":1480638705.8038297,\"error_message\"

...(error message from S3)...

... (backtrace) ...

Customer has updated to 4.2.7, btw

~ $ cat Gemfile | grep sidekiq
gem 'sidekiq', '~> 4.2.7'

Ah, ok, you're missing the actual retry date. Get it like so:

job = Sidekiq::RetrySet.new.first
puts Time.at(job.score)

If that value is malformed, it's possible the user's app has modified Float#to_s to output invalid values.

Confirmed:

irb(main):002:0> puts Time.at(job.score)
2640-08-01 08:00:20 +0000

@mperham So your guess is that @joemsak monkey-patched Float#to_s somewhere?

Checked into this in the customer's app and couldn't confirm:

> (123.4).method(:to_s).source_location
=> ["/app/vendor/bundle/ruby/2.3.0/gems/activesupport-4.2.6/lib/active_support/core_ext/numeric/conversions.rb", 124]

Thanks for all your help here, Mike.

What is the actual value of score?

On Dec 1, 2016, at 18:52, Jon McCartie notifications@github.com wrote:

Checked into this in the customer's app and couldn't confirm:

irb(main):001:0> m = (123.4).method(:to_s)
=> #
irb(main):002:0> m.source_location
=> ["/app/vendor/bundle/ruby/2.3.0/gems/activesupport-4.2.6/lib/active_support/core_ext/numeric/conversions.rb", 124]
Thanks for all your help here, Mike.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

irb(main):006:0> j.score
=> 21161548820.0

If Time.now.to_f is fine, could a config value throw off delay_for ?https://github.com/mperham/sidekiq/blob/master/lib/sidekiq/middleware/server/retry_jobs.rb#L124

Man, @jmccartie and @mperham both I really appreciate all of this investigation. You'll find the only sidekiq config I'm working with is here https://github.com/mperham/sidekiq/issues/3265#issuecomment-263965449

and my Procfile as @jmccartie can see is

worker: bundle exec sidekiq -e $RACK_ENV -c $SIDEKIQ_SERVER_LIMIT -q default -q mailers -q elasticsearch

You'll have to determine how that odd score came to be. Note that times usually have microseconds on them - the fact that the score has no microseconds is very suspicious to me.

> Time.now.to_f + 10
=> 1480694933.076567

It just so happens that I log the delay value:

logger.debug { "Failure! Retry #{count} in #{delay} seconds" }

You can turn on -v, generate a retry and see what the calculated delay is.

Thanks @mperham

Welp, predictably enough, this one was 100% purely my fault. I used Timecop in production to check users' ages at a future cut-off date, which manipulated the time for multi-threaded workers.

Even though I understood Timecop with a block arg to return time to normal at the end of the block (actually I understood it to affect time ONLY for the block), it still manipulates time for multi-threaded workers.

Lesson I should have already learned, was learned the exceptionally hard way.

I'm very sorry to have used up your guys' time due to my sloppy work. In the future I will slow down and do a more careful review of my own code before posting support issues.

@joemsak Glad you got it figured out!

Was this page helpful?
0 / 5 - 0 ratings