Resque: htop reports 250+ threads appear over hours of time for 4 workers

Created on 10 Aug 2015 · 7Comments · Source: resque/resque

Background

In the resque README there is a mention of no memory leaks and I am searching for a memory leak in our code base when i came across Resque in htop having at least 250 threads open. I have another project that uses Resque as well, and that project had just 12 threads/processes reported by htop for 12 workers.

Confused I came here, after asking on stackoverflow.

Gemfile

$> ag resq Gemfile.lock
274:    resque (1.25.2)
280:    resque_mailer (2.2.6)
432:  resque (~> 1.25.2)
433:  resque_mailer

$ ps aux | grep resq
root       876  0.0  0.0  61676     0 ?        Ss   Aug08   0:00 su - deploy -c cd /var/www/releases/20150806222346; export PORT=5000; export RAILS_ENV=production;  export QUEUES=critical,high,low;  export COUNT=2;  bundle exec rake environment resque:work >> /var/www/shared/upstart_log/worker-1.log 2>&1
root       878  0.0  0.0  61676     0 ?        Ss   Aug08   0:00 su - deploy -c cd /var/www/releases/20150806222346; export PORT=5001; export RAILS_ENV=production;  export QUEUES=critical,high,low;  export COUNT=2;  bundle exec rake environment resque:work >> /var/www/shared/upstart_log/worker-2.log 2>&1
root       879  0.0  0.0  61676     0 ?        Ss   Aug08   0:00 su - deploy -c cd /var/www/releases/20150806222346; export PORT=5002; export RAILS_ENV=production;  export QUEUES=critical,high,low;  export COUNT=2;  bundle exec rake environment resque:work >> /var/www/shared/upstart_log/worker-3.log 2>&1
deploy     880  0.0  0.0  20864     4 ?        Ss   Aug08   0:00 -su -c cd /var/www/releases/20150806222346; export PORT=5000; export RAILS_ENV=production;  export QUEUES=critical,high,low;  export COUNT=2;  bundle exec rake environment resque:work >> /var/www/shared/upstart_log/worker-1.log 2>&1
root       883  0.0  0.0  61676     0 ?        Ss   Aug08   0:00 su - deploy -c cd /var/www/releases/20150806222346; export PORT=5003; export RAILS_ENV=production;  export QUEUES=critical,high,low;  export COUNT=2;  bundle exec rake environment resque:work >> /var/www/shared/upstart_log/worker-4.log 2>&1
deploy     887  0.0  0.0  20864     4 ?        Ss   Aug08   0:00 -su -c cd /var/www/releases/20150806222346; export PORT=5001; export RAILS_ENV=production;  export QUEUES=critical,high,low;  export COUNT=2;  bundle exec rake environment resque:work >> /var/www/shared/upstart_log/worker-2.log 2>&1
deploy     892  0.0  0.0  20864     4 ?        Ss   Aug08   0:00 -su -c cd /var/www/releases/20150806222346; export PORT=5002; export RAILS_ENV=production;  export QUEUES=critical,high,low;  export COUNT=2;  bundle exec rake environment resque:work >> /var/www/shared/upstart_log/worker-3.log 2>&1
deploy     896  0.0  0.0  20864     4 ?        Ss   Aug08   0:00 -su -c cd /var/www/releases/20150806222346; export PORT=5003; export RAILS_ENV=production;  export QUEUES=critical,high,low;  export COUNT=2;  bundle exec rake environment resque:work >> /var/www/shared/upstart_log/worker-4.log 2>&1
deploy    1087  0.9  8.9 4812772 183464 ?      Sl   Aug08  22:00 resque-1.25.2: Waiting for critical,high,low                           
deploy    1113  0.9  8.9 4845800 182512 ?      Sl   Aug08  21:45 resque-1.25.2: Waiting for critical,high,low                           
deploy    1186  0.9 10.1 4785708 207892 ?      Sl   Aug08  22:04 resque-1.25.2: Waiting for critical,high,low                           
deploy    1223  0.8  9.5 4559812 195652 ?      Sl   Aug08  20:36 resque-1.25.2: Waiting for critical,high,low

On htop, there are hundreds of lines like the following sample:

 PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command 
22013 deploy     20   0 4700M  179M  2776 S  0.0  9.0  0:00.93 resque-1.25.2: Waiting for critical,high,low
22366 deploy     20   0 4700M  179M  2776 S  0.0  9.0  0:01.00 resque-1.25.2: Waiting for critical,high,low
22487 deploy     20   0 4700M  179M  2776 S  0.0  9.0  0:00.94 resque-1.25.2: Waiting for critical,high,low
22523 deploy     20   0 4700M  179M  2776 S  0.0  9.0  0:01.09 resque-1.25.2: Waiting for critical,high,low
..... and on.....

While i know each one of these threads doesn't count towards memory usage, it's the rough total of each line in the ps aux command above for the Sl columns, when each of those get to 20% each, that's 80% of the Ram we're using, and then one of the bigger jobs gets triggered on the site pushing the box into Swap and killing the Job mid way through, eventually.

resque_mailer is not used properly in our app. so it's really just for show as I was hoping that was cause for this behaviour but alas it is not.

netstat -ntp only has like 4 connections to the Redis DB on another box.

ruby -v
ruby 2.1.0p0 (2013-12-25 revision 44422) [x86_64-linux]

5 day memory profile from Boundary:

screen shot 2015-08-10 at 7 59 14 am

Each sharp decline is a restart either by human or the system.

Questions

Can others in production verify that this takes place on their resque server as well?
Is this memory pattern usual for Resque or where can i look to diagnose why these are staying open?
Open to hints that can help me deduce this problem and rule out Resque.

Bug Needs reproduction

Source

pjammer

Most helpful comment

OK, for me this was down to still having a call to ActiveRecord::Base.establish_connection in a Resque.before_fork hook after upgrading from Rails 4 to Rails 5.2.x, see https://github.com/rails/rails/pull/31241, after removing the before_fork hook htop shows one thread/process per worker and eliminates my Errno::ENOMEM Cannot allocate memory - fork(2) error.

https://github.com/rails/rails/pull/33094

AMHOL on 9 May 2019

👍2

All 7 comments

We use that same version of resque in production and do not experience this issue.

lukeasrodgers on 11 Aug 2015

This may be helpful in determining where the memory consumption is happening in your code: http://blog.skylight.io/hunting-for-leaks-in-ruby/

hoffmanc on 11 Aug 2015

The working model of how Resque processes jobs is crudely designed to prevent memory leaks by giving each job a _fork_ of the main worker _process_; the parent process acquires a job, forks and waits for the newly-created child to complete, relying on your system to clean up the child process' memory consumption. Because of this model, memory allocated by one job _cannot_ be leaked into the parent (long-lived worker) process.

That said, since your application is fully loaded in the parent process, it is entirely possible that something in your code is getting activated _before_ the worker enters its work loop (e.g., in app initialization), and that _that_ code is spawning threads unreasonably. It may be helpful to figure out where those threads are coming from to attach gdb to your long-lived worker process & do a thread dump.

yaauie on 15 Aug 2015

Yes, this seems like a bug. But it might not be in Resque. Is there any way we can get a reproduction?

steveklabnik on 12 Jan 2016

Without a way to reproduce this, I will not be able to investigate. If you can come up with one, please let me know, and I would be happy to re-open. Thank you!

steveklabnik on 10 Mar 2016

@pjammer did you ever find a solution to this?

AMHOL on 9 May 2019