Sidekiq: Scheduled jobs disappearing without a trace

Created on 20 Sep 2016 · 20Comments · Source: mperham/sidekiq

Ruby version: 2.2.4
Sidekiq / Pro / Enterprise version(s): Sidekiq 4.1.2 (switched to Sidekiq Pro 3.3.3. which does not seem to be affected)

For the last few weeks, we have been having problems with disappearing Sidekiq jobs. We’re using Rails v4.2.7 and Sidekiq 4.1.2.

We have implemented tracking of jobs through the database to help us identified our issues. We create a new record when a job is enqueued and then update it when executing a job. This is implemented using ActiveJob’s before_enqueue and around_perform callbacks.

From 13th September to 16th of September we have logged 7 cases of disappearing jobs (out of over 10000 tracked jobs). In the Rails log, we can see that the job was successfully enqueued, however, Sidekiq log has no mention of the job with that ID anywhere in it.

Rails Log:

Sep 13 09:56:28 [166]: Started POST "/api/webhooks/google_calendar_push_notification" for x.x.x.x. at 2016-09-13 14:56:28 +0000
Sep 13 09:56:28 [166]: Processing by API::WebhooksController#google_calendar_push_notification as */*
Sep 13 09:56:28 [166]: Calendar Load (0.3ms)  SELECT  "calendars".* FROM "calendars" WHERE "calendars"."id" = $1 LIMIT 1  [["id", 163]]
Sep 13 09:56:28 [166]: [ActiveJob]    (0.2ms)  BEGIN
Sep 13 09:56:28 [166]: [ActiveJob]   SQL (0.3ms)  INSERT INTO "background_job_statuses" ("job_id", "job_name", "job_argument_id", "job_argument_type", "enqueued_at", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5, $6, $7) RETURNING "id"  [["job_id", "7b124d00-69bc-410d-a52c-4a3c2e0b7f3f"], ["job_name", "SyncWithCalendarChangesJob"], ["job_argument_id", 163], ["job_argument_type", "Calendar"], ["enqueued_at", "2016-09-13 14:56:28.948052"], ["created_at", "2016-09-13 14:56:28.949949"], ["updated_at", "2016-09-13 14:56:28.949949"]]
Sep 13 09:56:28 [166]: [ActiveJob]    (1.6ms)  COMMIT
Sep 13 09:56:28 [166]: [ActiveJob] Enqueued SyncWithCalendarChangesJob (Job ID: 7b124d00-69bc-410d-a52c-4a3c2e0b7f3f) to Sidekiq(default) with arguments: gid://top-scheduler/Calendar/163
Sep 13 09:56:28 [166]: Completed 200 OK in 12ms (ActiveRecord: 2.4ms)

Sidekiq Log:

Sep 13 09:56:21 [9]: [ActiveJob] [SyncWithCalendarChangesJob] [9d0ddaac-d883-482b-be73-a55b52b9ed97] Performed SyncWithCalendarChangesJob from Sidekiq(default) in 1136.37ms
Sep 13 09:56:45 [9]: [Rollbar] Sending item
Sep 13 09:56:45 [9]: [Rollbar] Success
Sep 13 09:56:47 [9]: Calendar Load (0.3ms)  SELECT  "calendars".* FROM "calendars" WHERE "calendars"."id" = $1 LIMIT 1  [["id", 180]]
Sep 13 09:56:47 [9]: [ActiveJob] [SyncWithCalendarChangesJob] [45a6dafc-1ee2-4ccd-8b34-ed1292157f48] Performing SyncWithCalendarChangesJob from Sidekiq(default) with arguments: gid://top-scheduler/Calendar/180

We can see the job IDs of jobs executed before and after the disappeared job but there’s simply no trace the disappear one. Redis log shows no errors at that time. We have Redis configured to use noeviction but there are no evictions logged and it didn’t use more than 3% of allocated memory (32mb out of 1gb) in last 45 days.

We switched to Sidekiq Pro on 16th of September and enabled reliable push and timed fetch. Since the we didn't have any lost jobs, but we would still love to know what exactly happened on the regular version, since there was no any errors logged anywhere but the jobs were lost.

Source

dabrorius

Most helpful comment

+1s and "me too" aren't helping, folks. If you want to help, create a minimal app that reproduces the behavior.

mperham on 12 Oct 2016

👍2

All 20 comments

I don't know what you want to achieve with this issue. How could we determine what happened to your jobs?

mperham on 26 Sep 2016

I was hoping that some of the core team members, who have much detailed knowledge of sidekiq internals, could give us some pointers where to look. Problems seem to be fixed with Sidekiq Pro and we were planning to switch to it in any case so we're good as far as our product goes. However we'd like to determine, or at least have a reasonable theory about what was going on previously. Is there any way you can help us?

dabrorius on 27 Sep 2016

Well, there's lots of things that can go wrong, up and down the stack. I generally don't comment if the situation is nebulous with little hard data to go on. For example, if you had a crashing Sidekiq process, it's possible a job got pulled from Redis but the process crashed before the Logging middleware logged the JID as starting...

mperham on 27 Sep 2016

We are encountering the same problem.

Ruby 2.3.0
Sidekiq 4.1.2
Rails 4.2.6

We have a stable application on production mode where we process ~50 jobs on Sidekiq per day. Only yesterday, we noticed one job that was queued in Sidekiq and was never executed. It simply disappeared from Sidekiq and Sidekiq jumped to the next one and continued normally.

There is no trace for the job's ID in Sidekiq's logs. The Rails, Sidekiq and Redis logs are all clean without any reported problems. All processes were up and running at that time.

I do understand that such an intermittent issue is very hard to track without any further details and that it might be occurring anywhere in the stack. However, any guidance on troubleshooting this issue at Sidekiq's side would be much appreciated.

HusseinElMotayam on 4 Oct 2016

I've got the same problem.

itmammoth on 5 Oct 2016

The same problem. Looks like .perform_async does not enqueue several jobs for some reason.

DmitryDrobotov on 7 Oct 2016

Same problem. I spent some time debugging this issue, and I found that jobs disappear at the time when they get processed. Job scheduling and enqueuing work without a hitch. The problem occurs in main process call, inside reloader call block. For testing, I commented out the following lines in Reloader class and everything started to work normally:

      def call
        # @app.reloader.wrap do
          yield
        # end
      end

So it seems that there are some issues with Rails's Reloader (I use Rails 5.0.0.1). Also, for me this whole problem only gets reproduced when Sidekiq is launched from Upstart script. If I start it manually in project folder, Reloader works just fine. Hope anything on this is helpful.

koss-lebedev on 12 Oct 2016

We have the same problem with .perform_async. As @kess-lebedev mentions, this started happing when we changed sidekiq to be loaded via upstart.

cloke on 12 Oct 2016

+1s and "me too" aren't helping, folks. If you want to help, create a minimal app that reproduces the behavior.

mperham on 12 Oct 2016

👍2

This isn't meant to be a "me too" answer, but this issue happens in a very specific manner and I was just confirming that upstart is definitely related. That being said here are some more notes from our testing, they might be different for others, but after two weeks of watching this issue we see a distinct pattern:

The issue only seems to happen on our Ubuntu 14.04 servers. When we test this on our Macs we can not reproduce the issue. In production we have Redis on a separate server which may be related, but because of #2 below I don't think it is. We've verified that the Redis server is not even contacted when the job is lost.
The issue only happens with jobs that are generated via cron or upstart scripts, but not every execution will lose jobs. If we run the exact same code in the rails console or manually run the script in the terminal via rake all jobs queue as expected.
It seems to happen when we are generating a moderate number of jobs (50 - 100 per minute). The jobs that never make it to Redis are lost in groups, so we will see 20 get enqueued, then 5 disappear, and then another 15 will enqueue.

Is there some logging that we could add that could potentially help? I can try to make a minimal app, but I don't think it will be something you can just run without replicating the environment with cron/upstart, etc.

cloke on 14 Oct 2016

I have behavior only when using reliable push, which will swallow enqueued jobs if the process enqueuing t

Sent from my iPhone

On Oct 14, 2016, at 6:20 AM, Cory Loken [email protected] wrote:

This isn't meant to be a "me too" answer, but this issue happens in a very specific manner and I was just confirming that upstart is definitely related. That being said here are some more notes from our testing, they might be different for others, but after two weeks of watching this issue we see a distinct pattern:

The issue only seems to happen on our Ubuntu 14.04 servers. When we test this on our Macs we can not reproduce the issue. In production we have Redis on a separate server which may be related, but because of #2 below I don't think it is. We've verified that the Redis server is not even contacted when the job is lost.

The issue only happens with jobs that are generated via cron or upstart scripts, but not every execution will lose jobs. If we run the exact same code in the rails console or manually run the script in the terminal via rake all jobs queue as expected.

It seems to happen when we are generating a moderate number of jobs (50 - 100 per minute). The jobs that never make it to Redis are lost in groups, so we will see 20 get enqueued, then 5 disappear, and then another 15 will enqueue.

Is there some logging that we could add that could potentially help? I can try to make a minimal app, but I don't think it will be something you can just run without replicating the environment with cron/upstart, etc.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

webandtech on 14 Oct 2016

Ugh phone hit sent while typing. Anyway. Try turning off reliable push and see if the issue goes away.

webandtech on 14 Oct 2016

reliable push always logs a warning that contains ReliablePush in the string when it detects a problem. Grep your logs for that to see if it activated and saved jobs locally.

mperham on 14 Oct 2016

I forked sidekiq and removed "@app.reloader.wrap do" as @koss-lebedev mentioned above. I left my app running for the weekend and had over 20,000 jobs enqueue and process without a single one going missing.

I know nothing about Rails Reloader, other than what I read about it today, but I'd say it is definitely related. I will try to look at trying the ReliablePush change tomorrow.

cloke on 17 Oct 2016

This is troubling. I wonder if non-development environments should skip using the reloader completely.

mperham on 17 Oct 2016

I'd be in favor of that change. We never want reloading in production.

ryansch on 17 Oct 2016

I would think that the Reloader would effectively be a no-op in production but that seems not to be the case.

mperham on 17 Oct 2016

wdyt?

mperham on 17 Oct 2016

That fix is now available in Sidekiq 4.2.3 but only affects Rails 5. I can't explain how jobs go missing in Rails 4. I'm closing this issue as there's nothing to go on; please open a new issue if someone gets hard data that might lead to the bug.

mperham on 19 Oct 2016

This is actually happening for me now when I run in development mode with Ruby 2.7.1. Rails 6.0.3. Sidekiq Pro 4.0.5 / Sidekiq Enterprise 1.8.1

When I run with bundle exec sidekiq RAILS_ENV=production the problem goes away.

Just wanted to add this here in case anyone else happens upon it