Sidekiq: Job started / retry date is in 100+ years from now

Created on 30 Nov 2016 · 46Comments · Source: mperham/sidekiq

UPDATE https://github.com/mperham/sidekiq/issues/3265#issuecomment-271681275 It was my fault of course -- Future devs take note!

Ruby version: 2.3.1
Sidekiq / Pro / Enterprise version(s): 4.1.4

If relevant, please include your initializer and any error message with the full backtrace.

I've started getting RequestTimeTooSkewed errors from S3 in my sidekiq jobs. I tried the Fog "sync_clock" option in my carrierwave initializer, but no dice.

I noticed my job started at times and retry times are being set to this date:

screen shot 2016-11-30 at 10 13 24 am

Sometimes the year is 2769. Update: another one came in for the year 2059 I can't find any info through googling about what is happening here. Please help, this is happening on my production server which is Heroku, which reports the correct current time in console :(

Source

joemsak

Most helpful comment

I don't take donations but I'm not demanding a purchase either. 🍰 You can always make the case for the next company you work for; I'll still be here years from now.

mperham on 30 Nov 2016

❤2 👍1

All 46 comments

Update I so far cannot reproduce this problem on my local environment with my production branch

screen shot 2016-11-30 at 11 03 54 am

joemsak on 30 Nov 2016

The machine's clock is screwed up, possibly a hardware problem. I'd destroy the dyno and get a new one. Sidekiq and Ruby just use whatever the OS returns for the current time so there's nothing to be done with them.

mperham on 30 Nov 2016

This doesn't explain why in console, rails tells me the correct current time, and the logs show the correct time. Sometimes the sidekiq jobs do show the correct time for started but then the retry date is 100+ years later. Also, only sidekiq seems to be having problems.

When I run the jobs "inline" they work fine with S3 operations.

I contacted heroku support and they're not answering me almost 2 hours later even though I indicated that this is a production server issue.

Are you sure there isn't any debugging I can do to look into this more? I'm at an absolute loss. I'm stuck. I'm the only developer on this project, I have no one to ask for help, and I don't know what to do.

joemsak on 30 Nov 2016

The only other problem would be if you are using sidekiq_retry_in and your code is returning a bad value. https://github.com/mperham/sidekiq/wiki/Error-Handling#configuration

mperham on 30 Nov 2016

Nah, I'm not altering the default, and that string does not appear in my code. :(

joemsak on 30 Nov 2016

Are you using any other sidekiq gems? bundle | grep side

mperham on 30 Nov 2016

I do have retry count set to 3, but that's been like that for months. This problem started today. I understand it may be related to heroku dynos, but I can't grasp how Time.current can be correct, log times are correct, and inline jobs don't have a problem with S3 or showing incorrect year in the emails (We use the current year in some of our emails)

joemsak on 30 Nov 2016

technovation-app master % bundle | grep side
Using sidekiq 4.1.4
technovation-app master %

joemsak on 30 Nov 2016

# config/initializers/sidekiq.rb

Sidekiq.default_worker_options = {
  backtrace: true,
  retry: 3,
}

joemsak on 30 Nov 2016

To get the retry time, Sidekiq adds seconds to the current time. I don't see how this can screw up unless you've got something fundamentally broken but I don't have any further ideas about what that might be. If this behavior just started, you need to track down what changed.

https://github.com/mperham/sidekiq/blob/master/lib/sidekiq/middleware/server/retry_jobs.rb#L126

mperham on 30 Nov 2016

Sure. And as I said, sometimes the "started" time is in the far future, too, not only the retry time.

joemsak on 30 Nov 2016

Since I got one of our emails saying "Welcome to Technovation 2063" I guess it's probably not sidekiq's fault, right? Can sidekiq somehow set the time for the code being executed? Or is it all in the worker process?

joemsak on 30 Nov 2016

Sounds like Ruby's Time API is returning the wrong date in the Sidekiq process. The job started timestamp is just Time.now.to_i.

https://github.com/mperham/sidekiq/blob/master/lib/sidekiq/processor.rb#L171

mperham on 30 Nov 2016

In rails console on production, Time.now.to_i is the correct epoch time :-/ I really appreciate you giving this your time and effort.

joemsak on 30 Nov 2016

I'm convinced a problem is happening on heroku. Would be nice if their support team was more prompt for production server issues.

joemsak on 30 Nov 2016

I guess their disclaimer is 1 business day. Heroku support engineers who visit this link and see my bitching about your response time: Fair enough. I'm sorry.

joemsak on 30 Nov 2016

You can start an irb console on Heroku and see what it returns but I wonder if this isn't a problem with the specific dyno your Sidekiq process is running on.

mperham on 30 Nov 2016

As I said, I'd ps:scale down to 0 and then back up to 1. Get a new dyno and see what happens.

mperham on 30 Nov 2016

Oh. Yeah. For sure. I did that (through the web UI for dynos and workers). The problem persisted. So then I also restarted dynos. The problem appeared to go away. Some jobs worked. And then the problem came right back, and jobs and times were messed up again.

joemsak on 30 Nov 2016

I just did the scaling with the Heroku CLI instead and it appears to be working properly again. I'll report back after some time has passed, or if it starts to fail again. Not saying it's on you to help me fix it at this point, but for the sake of anyone else who might have this issue and googles for it.

joemsak on 30 Nov 2016

Final update After re-scaling the web and worker dynos with the Heroku CLI (and apparently not the Web UI), my sidekiq jobs continue, hours later, to be performing correctly.

Thanks again for your time @mperham I recognize and appreciate that you had no responsibility for this issue. Let me know if I can make it up to you in some way. Cheers.

joemsak on 30 Nov 2016

At some point in the future, consider buying Pro or Enterprise for your company. I can provide this support because people buy my products. :-)

mperham on 30 Nov 2016

As we are a non profit I imagine making that case would be a challenge.

However, can I donate to your org?

joemsak on 30 Nov 2016

I don't take donations but I'm not demanding a purchase either. 🍰 You can always make the case for the next company you work for; I'll still be here years from now.

mperham on 30 Nov 2016

❤2 👍1

@joemsak Sorry to hear about this. Personally, I've never had time-based issues with Sidekiq, so I'm curious what's going on. Do you have an open ticket with Heroku? If so, what's the ID? If you come across it again, @-me on twitter (@jmccartie) so we can catch it in the act.

-Jon
Heroku Support

jmccartie on 1 Dec 2016

Hey @mperham --

I was helping Joe on this today and observed the issue, while the underlying server time returned fine (both in Ruby and Bash). We also checked the underlying instance for NTP drift, but found no evidence of that.

I was able to enqueue a job (both through ActiveJob and directly through Sidekiq), which resulted in a "2168" year in the web console. Here's one sitting in the retry queue:

Ever seen this before?

jmccartie on 2 Dec 2016

Nope, I've never seen that before. What does the job payload look like in Redis?

On Dec 1, 2016, at 15:54, Jon McCartie notifications@github.com wrote:

Hey @mperham --

I was helping Joe on this today and observed the issue, while the underlying server time returned fine (both in Ruby and Bash). We also checked the underlying instance for NTP drift, but found no evidence of that.

I was able to enqueue a job (both through ActiveJob and directly through Sidekiq), which resulted in a "2168" year in the web console. Here's one sitting in the retry queue:

Ever seen this before?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

mperham on 2 Dec 2016

@mperham

I just saw this happen again. Job was enqueued ok. Timestamp looked fine under "Busy". Here was payload:

{"queue"=>"default", "payload"=>{"class"=>"ActiveJob::QueueAdapters::SidekiqAdapter::JobWrapper", "wrapped"=>"ExportJob", "queue"=>"default", "args"=>[{"job_class"=>"ExportJob", "job_id"=>"52239a1a-bf63-4ab7-88df-f29353ff648f", "queue_name"=>"default", "arguments"=>[{"_aj_globalid"=>"gid://techapp/AdminProfile/2"}, "retry"=>3, "backtrace"=>true, "jid"=>"a26d9485a3564cc5583e5d3c", "created_at"=>1480638705.8035896, "enqueued_at"=>1480638705.8038297}, "run_at"=>1480638705}

Job failed. Now in the retry queue (not sure where to get the payload here):

=> {"job_class"=>"ExportJob", "job_id"=>"52239a1a-bf63-4ab7-88df-f29353ff648f", "queue_name"=>"default", "arguments"=>[{"_aj_globalid"=>"gid://technovation-app/AdminProfile/2"}, "locale"=>"en"}

Now, job is in Retry queue with a retry of 2640-08-01 08:00:20 UTC. I'm having @joemsak update Sidekiq to latest (4.2.7) to make sure it's nothing that was fixed in the earlier version (he's on 4.1.1)

jmccartie on 2 Dec 2016

Sidekiq::RetrySet.new.each do |job|
  puts job.value
end

mperham on 2 Dec 2016

@mperham Relevant bits:

"job_id":"52239a1a-bf63-4ab7-88df-f29353ff648f","queue_name":"default"
...
,"retry":3,"backtrace":true,"jid":"a26d9485a3564cc5583e5d3c","created_at":1480638705.8035896,"enqueued_at":1480638705.8038297,"

jmccartie on 2 Dec 2016

Looks fine to me, find a payload that is displaying the bad values and show me the exact payload data in Redis via job.value. Maybe the timestamp is being corrupted somehow (perhaps getting an additional digit?).

> Time.at(1480638705.8035896)
=> 2016-12-01 16:31:45 -0800
irb(main):003:0> Time.at(1480638705.8038297)
=> 2016-12-01 16:31:45 -0800

mperham on 2 Dec 2016

The above is from running Sidekiq::RetrySet.new.first.value. It spits out

=> "{\"class\":\"ActiveJob::QueueAdapters::SidekiqAdapter::JobWrapper\",\"wrapped\":\"ExportJob\",\"queue\":\"default\",\"args\":[{\"job_class\":\"ExportJob\",\"job_id\":\"52239a1a-bf63-4ab7-88df-f29353ff648f\",\"queue_name\":\"default\",\"arguments\"

...(redacted params to preserve customer data)...

\"retry\":3,\"backtrace\":true,\"jid\":\"a26d9485a3564cc5583e5d3c\",\"created_at\":1480638705.8035896,\"enqueued_at\":1480638705.8038297,\"error_message\"

...(error message from S3)...

... (backtrace) ...

jmccartie on 2 Dec 2016

Customer has updated to 4.2.7, btw

~ $ cat Gemfile | grep sidekiq
gem 'sidekiq', '~> 4.2.7'

jmccartie on 2 Dec 2016

Ah, ok, you're missing the actual retry date. Get it like so:

job = Sidekiq::RetrySet.new.first
puts Time.at(job.score)

mperham on 2 Dec 2016

If that value is malformed, it's possible the user's app has modified Float#to_s to output invalid values.

mperham on 2 Dec 2016

Confirmed:

irb(main):002:0> puts Time.at(job.score)
2640-08-01 08:00:20 +0000

@mperham So your guess is that @joemsak monkey-patched Float#to_s somewhere?

jmccartie on 2 Dec 2016

Checked into this in the customer's app and couldn't confirm:

> (123.4).method(:to_s).source_location
=> ["/app/vendor/bundle/ruby/2.3.0/gems/activesupport-4.2.6/lib/active_support/core_ext/numeric/conversions.rb", 124]

Thanks for all your help here, Mike.

jmccartie on 2 Dec 2016

What is the actual value of score?

On Dec 1, 2016, at 18:52, Jon McCartie notifications@github.com wrote:

Checked into this in the customer's app and couldn't confirm:

irb(main):001:0> m = (123.4).method(:to_s)
=> #
irb(main):002:0> m.source_location
=> ["/app/vendor/bundle/ruby/2.3.0/gems/activesupport-4.2.6/lib/active_support/core_ext/numeric/conversions.rb", 124]
Thanks for all your help here, Mike.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

mperham on 2 Dec 2016

irb(main):006:0> j.score
=> 21161548820.0

jmccartie on 2 Dec 2016

If Time.now.to_f is fine, could a config value throw off delay_for ?https://github.com/mperham/sidekiq/blob/master/lib/sidekiq/middleware/server/retry_jobs.rb#L124

jmccartie on 2 Dec 2016

Man, @jmccartie and @mperham both I really appreciate all of this investigation. You'll find the only sidekiq config I'm working with is here https://github.com/mperham/sidekiq/issues/3265#issuecomment-263965449

and my Procfile as @jmccartie can see is

worker: bundle exec sidekiq -e $RACK_ENV -c $SIDEKIQ_SERVER_LIMIT -q default -q mailers -q elasticsearch

joemsak on 2 Dec 2016

You'll have to determine how that odd score came to be. Note that times usually have microseconds on them - the fact that the score has no microseconds is very suspicious to me.

> Time.now.to_f + 10
=> 1480694933.076567

mperham on 2 Dec 2016

It just so happens that I log the delay value:

logger.debug { "Failure! Retry #{count} in #{delay} seconds" }

You can turn on -v, generate a retry and see what the calculated delay is.

mperham on 2 Dec 2016

Thanks @mperham

jmccartie on 2 Dec 2016

Welp, predictably enough, this one was 100% purely my fault. I used Timecop in production to check users' ages at a future cut-off date, which manipulated the time for multi-threaded workers.

Even though I understood Timecop with a block arg to return time to normal at the end of the block (actually I understood it to affect time ONLY for the block), it still manipulates time for multi-threaded workers.

Lesson I should have already learned, was learned the exceptionally hard way.

I'm very sorry to have used up your guys' time due to my sloppy work. In the future I will slow down and do a more careful review of my own code before posting support issues.

joemsak on 10 Jan 2017

@joemsak Glad you got it figured out!

jmccartie on 10 Jan 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Sidekiq-Pro 3.4.2 uninitialized constant Rack::Protection

bartimaeus · 3Comments

Memory issues in sidekiq-pro (3.4.5)

fatcatt316 · 4Comments

Review official Ruby and Rails version support

mperham · 3Comments

How can I get the execution time for a given job?

BeRMaNyA · 3Comments

PG::UnableToSend: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.

igorkasyanchuk · 3Comments