Sidekiq: Sidekiq stops processing new jobs, seems alive in system process list

Created on 19 Sep 2014 · 32Comments · Source: mperham/sidekiq

Versions:

sidekiq (3.2.4)
sidekiq-middleware (0.3.0)

Process seems alive and working:

$ pgrep -fl sidekiq
21956 sidekiq 3.2.4 tradeapp [23 of 55 busy]

But webui states there are 0 busy workers.

sidekiq.yml

:concurrency: 55
staging:
  :concurrency: 55
production:
  :concurrency: 55
:queues:
  - [default, 10]
  - [mailer, 5]
  - [slow, 1]

This happens once or twice a day. Nothing specific in the logs.

Do you guys have any idea how to tacke this?

Thanks in advance

Source

jumski

Most helpful comment

In case anyone ends up here with a similar problem, exit status 137 means your process was killed by Linux’s OOM (Out of Memory) killer.

@mperham: sorry for the trouble, but I think documenting this here might help others Googling this, and incidentally reduce the number of tickets being opened related to memory bloat issues.

jdurand on 15 Nov 2016

👍3

All 32 comments

This happened in previous versions if the heartbeat timer died, usually due to extreme Redis latency. #1884 was fixed in 3.2.2. People are still reporting this so there must still be an issue in there.

mperham on 19 Sep 2014

Correct me if I'm wrong, but this is not exacly same issue - in my case sidekiq stops processing any new jobs. The "23 of 55" figure from pgrep is invalid. Logs show no activity and jobs are not popped from queues.

jumski on 19 Sep 2014

Can you get me the TTIN signal log output when a Sidekiq process is dead? Do you see any messages in the logs with the phrase "fetch died"?

mperham on 19 Sep 2014

No 'fetch died' in logs. Nothing new in logs after kill -TTIN $pid

Thans for help!

jumski on 19 Sep 2014

Additional info: we run two processes, but they process different queues.

Second process is run with concurrency == 1 and it does not have such a problems.

This problem started to happen when I increased concurrency from 20 to 55.

jumski on 19 Sep 2014

It's possible you have threads which are starved for connections somehow. I don't recommend concurrency that high, the default 25 is typically all you need. Add more processes if you need more work done in parallel.

mperham on 19 Sep 2014

Ok will test this, But given only 23 workers are "busy" the rest of threads should be processing jobs?

jumski on 19 Sep 2014

Can you shut down the process? If you see nothing in your logs upon TTIN, that means you've probably set the Rails log level to higher than :info.

mperham on 19 Sep 2014

I've got log_level == :info for this given environment.

I've splited this to 2 processes 25 threads each, lets see if this helps.
But it got stuck when only around 10-15 workers were active so dont know if this could be a reason.

Anyway, thanks a lot for you effort. I will update this thread with progress.

jumski on 19 Sep 2014

What the version of celluloid in your gemfile.lock ? Update to latest version of sidekiq.

seuros on 19 Sep 2014

Hello,

I think I have the same issue, I have a sidekiq process running but it does not "take" jobs in queue :

Processus:
capture d ecran 2014-09-26 a 10 07 16

In queue :
capture d ecran 2014-09-26 a 10 07 23

For Information, I have a lot of sidekiq Processus (around 30) and I have around 200 busy jobs at the same time.

Any idea why?

sebfie on 26 Sep 2014

@seuros it happens on 0.15.2 and 0.16.

We deployed it to production and it does not get stuck for 3 days so far (was stuck twice a day on staging). Don't have an idea why.

jumski on 26 Sep 2014

What did you deployed in production? I am interested by it!

sebfie on 26 Sep 2014

The app I'm working on. We had this problem on staging server but on fresh production one it disappeared.

We have rather minimal setup compared to yours, 2 processes 25concurrency each. I've never saw more that 20 jobs run simultaneously tho.

jumski on 26 Sep 2014

I meant, which gem did you update to get it work?

sebfie on 26 Sep 2014

The 3.2.5 version of sidekiq just locks celluloid to 0.15.2. This celluloid version you should be using, 0.16 have some locking issues.

jumski on 26 Sep 2014

You should be using 3.2.5, and kill any remaining manually before redeploying.
You can't shut them correctly if they are running with celluloid 0.16.

seuros on 26 Sep 2014

@jumski @seuros I am facing a similar issue (https://github.com/mperham/sidekiq/issues/2003) wondering if you managed to find a fix on your end?

krzkrzkrz on 30 Oct 2014

it started working on fresh production server, was failing on staging

dont really have any clues :(

2014-10-30 4:31 GMT+01:00 Christian Fazzini [email protected]:

@jumski https://github.com/jumski @seuros https://github.com/seuros I
am facing a similar issue (#2003
https://github.com/mperham/sidekiq/issues/2003) wondering if you
managed to find a fix on your end?

—
Reply to this email directly or view it on GitHub
https://github.com/mperham/sidekiq/issues/1963#issuecomment-61041342.

jumski on 30 Oct 2014

I have this problem too.

If I added custom queue on my worker it is not working.
I have to remove this line of code to make it work

sidekiq_options queue: :billing_notification

I think couple of months ago my code works perfectly, now with all dependencies are the same it does not work.

Here is my gem list:

celluloid (0.15.2)
sidekiq (3.2.1)

mssio on 6 Nov 2014

Got same issue. I am using sidekiq (3.2.6) and celluloid locked to (0.15.2). In addition I am using sidetiq (0.6.3) for scheduled jobs, is sidetiq the issue?

kxhitiz on 12 Nov 2014

@kxhitiz Sidetiq randomly stops processing jobs - unrelated entirely to this sidekiq issue. See https://github.com/tobiassvn/sidetiq/issues/116

lypanov on 23 Nov 2014

@mperham We have some networking issues occurring around 4am nightly on a recurring basis - hosting company seems incapable of fixing it alas. Due to this even with 3.2.5 we're seeing the system simply stop processing. Is it possible the timeouts are still too low for the Redis work around you introduced with 3.2.2? We'd preferably like it to just never stop trying :)

lypanov on 23 Nov 2014

@mperham Ignore that last one noticed there is a duplicate of this issue which provides various suggestions. I'll try 3.2.6 / TTIN / resolv-replace. If still broken will comment again.

lypanov on 23 Nov 2014

@lypanov coincidentally, the networking issues you are describing that occur at 3-5am nightly also happen to us, around that time. As in Sidekiq stops processing any jobs and the queue just keeps getting larger, even if the Sidekiq process is still running in the background.

However, our Redis server is hosted on AWS Elastic Cache. Do you know by chance if your hosting company is using AWS as well?

I am referencing the issue I opened. May be related: https://github.com/mperham/sidekiq/issues/2003

krzkrzkrz on 24 Nov 2014

@krzkrzkrz With sidetiq IIRC we were seeing it on a nightly basis. Without it happens maybe once every 2 weeks.

Not AWS no.

lypanov on 24 Nov 2014

The latest version is 3.3.3, maybe you should upgrade.

seuros on 1 Apr 2015

Unfortunately you haven't given us any info to diagnose the problem. We need a thread dump.

On Apr 1, 2015, at 07:44, Jake Hoffner [email protected] wrote:

I am having this issue as well. Jobs stop processing about once a week, typically in the middle of the night. I'm running sidekiq 3.2.6 on Heroku. I even have an autoscaler implemented to help scale dynos if the job queue builds up. What happens is that the main worker will stop processing altogether, the scaled workers will start up and work fine, process the queue, and then when its back to the single dyno jobs stop being processed again. A process restart fixes the issue.

—
Reply to this email directly or view it on GitHub.

mperham on 1 Apr 2015

I have enqueued a process to parse the Spreadsheet with JRuby and POI library. I got an error in the sidekiq log from the java side. And then Sidekiq Stops Picking the Jobs from the queue. But the Process is still Alive when I have checked with

   ps aux

Here is the versions of Sidekiq and others I am using

   sidekiq-4.1.1
   JRuby-1.7.19

Is the any way to reset the sidekiq so that it will start processing the jobs automatically.

metripurari on 6 Apr 2016

I just experienced this with sidekiq 4.1.4 on Heroku. It looks like the master process stopped responding but didn't crash. I had to scale down and up again.
Here are the relevant logs:

Nov 14 13:59:37 my-app heroku/worker.1:  State changed from up to down 
Nov 14 13:59:40 my-app heroku/worker.1:  Stopping all processes with SIGTERM 
Nov 14 13:59:41 my-app heroku/api:  Scaled to console@0:Hobby rake@0:Hobby web@1:Hobby worker@1:Hobby by [email protected] 
Nov 14 14:00:09 my-app heroku/worker.1:  Error R12 (Exit timeout) -> At least one process failed to exit within 30 seconds of SIGTERM 
Nov 14 14:00:09 my-app heroku/worker.1:  Stopping remaining processes with SIGKILL 
Nov 14 14:00:10 my-app heroku/worker.1:  Process exited with status 137 
Nov 14 14:00:15 my-app heroku/worker.1:  Starting process with command `bundle exec sidekiq -C config/sidekiq.yml` 
Nov 14 14:00:16 my-app heroku/worker.1:  State changed from starting to up

Then is started taking jobs again.

jdurand on 14 Nov 2016

@jdurand That logging doesn't help. You need to follow the directions noted on the Problems and Troubleshooting wiki page.

mperham on 14 Nov 2016

In case anyone ends up here with a similar problem, exit status 137 means your process was killed by Linux’s OOM (Out of Memory) killer.

@mperham: sorry for the trouble, but I think documenting this here might help others Googling this, and incidentally reduce the number of tickets being opened related to memory bloat issues.

jdurand on 15 Nov 2016

👍3

Was this page helpful?

0 / 5 - 0 ratings

Related issues

What is a server and a client?

jlecour · 4Comments

Periodic jobs: adding arguments

agrobbin · 4Comments

Stacktrace (or line number) for errors

sandstrom · 3Comments

Potential bug in superfetch?

michaeldiscala · 4Comments

PG::UnableToSend: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.

igorkasyanchuk · 3Comments