Sidekiq: Understanding a leak that happens only when running job via sidekiq worker

Created on 11 Feb 2018 · 24Comments · Source: mperham/sidekiq

Sidekiq 5.1 (using Sidekiq Pro 4) with Ruby 2.4.1.
Same issue occurs with Sidekiq 4.2.10 (without pro)

We have an issue where a job of ours when running on Sidekiq leaks memory.

The use case is as follows: We have a BI system that has millions of records in our DB. Once in a while, our customer wants to export a subset of the data to a CSV (can be 500K records).
Therefore, we have a CSV worker that serialises the 500K using pagination to a file, uploads it then to private storage and sends the customer the link to the file.

This job, when running locally on the server (opening rails console and running CSVWorker.new.perform(some_collection)) works fine. It takes a while but finishes. Memory stays approx. static and does not go up during the pagination.

Then, when we try to run the job via Sidekiq (CSVWorker.new.perform_async(some_collection)) we start seeing graph charts like this (memory utilisation):

screen shot 2018-02-11 at 18 15 36

Eventually, the OOM kills our sidekiq process and our docker dies. To be extra careful with our test we ran our worker on an isolated sidekiq process (own server with no other workers running on it)

Any ideas? I know its not related to our code as we are running the same worker and it works via console. What can be different in this case?

Thanks in advance,
Dan

Source

dbenjamin57

👍1

Most helpful comment

Have you tried disabling the query cache? It may be the source of the leak, dunno.

You can test this in rails console:

ActiveRecord::Base.cache { Worker.new.perform(args) }

If that replicate the problem, the query cache is guilty.

AFAIK, AR query cache doesn't have a limit, so it may consume tons of MBs if you have tons of queries happening at one query cache block.

This is a Rails issue in fact (assuming that's the source of the memory consumption) and Sidekiq has no way to fix this by itself.

What you can do about that is disable the query cache for this particularly worker (again, if that's the source of the memory consumption).

sobrinho on 15 Feb 2018

👍4 🎉3

All 24 comments

Other components, like ActiveRecord, might behave differently in a console versus in a Sidekiq worker.

ActiveRecord's caching feature was just enabled in 5.1: https://github.com/mperham/sidekiq/pull/3718, perhaps try turning it off? I dunno.

There's not much to go on, so maybe add some debugging logs to that job to isolate the culprit.

mikegee on 11 Feb 2018

👍1

Are you testing fine on OSX and leaking on Linux? Huge diff in native extensions there.

On Feb 11, 2018, at 08:18, dbenjamin57 notifications@github.com wrote:

We have an issue where a job of ours when running on Sidekiq leaks memory.

The use case is as follows: We have a BI system that has millions of records in our DB. Once in a while, our customer wants to export a subset of the data to a CSV (can be 500K records).
Therefore, we have a CSV worker that serialises the 500K using pagination to a file, uploads it then to private storage and sends the customer the link to the file.

This job, when running locally on the server (opening rails console and running CSVWorker.new.perform(some_collection)) works fine. It takes a while but finishes. Memory stays approx. static and does not go up during the pagination.

Then, when we try to run the job via Sidekiq (CSVWorker.new.perform_async(some_collection)) we start seeing graph charts like this (memory utilisation):

Eventually, the OOM kills our sidekiq process and our docker dies. To be extra careful with our test we ran our worker on an isolated sidekiq process (own server with no other workers running on it)

Any ideas? I know its not related to our code as we are running the same worker and it works via console. What can be different in this case?

Thanks in advance,
Dan

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

mperham on 11 Feb 2018

Hey @mperham all testing is done on the same linux server. so the only real difference is between running it via a sidekiq worker and running it in console via Worker.new.perform(params)

dbenjamin57 on 11 Feb 2018

hey @mikegee what kind of logging would you add? The leak happens both in Sidekiq 5.1 and Sidekiq 4.2.10 so cant be attributed to this feature..

dbenjamin57 on 11 Feb 2018

what kind of logging would you add?

This gem might help narrow down memory consumption: https://github.com/SamSaffron/memory_profiler

But when I wrote that, I was thinking just log the current process's memory consumption as reported by the OS. I would sprinkle logging statements throughout that worker's code.

mikegee on 12 Feb 2018

I know memory profiler well. That wont help us with hunting a memory leak that happens only when running through sidekiq worker.

This is the module of ours that is in-charge of paginating the collection and I added in the logging we monitored to see if it crashes on a specific set of data (it doesnt, crashes in different collection points)

require 'securerandom'

class CsvNotAllowed < StandardError; end

module Csvable
  extend ActiveSupport::Concern

  DEFAULT_LIMIT = 1000
  DEFAULT_OFFSET = 0

  protected

  def user(user_id = nil)
    @user ||= User.find(user_id)

    raise CsvNotAllowed unless @user.csv_allowed?

    @user
  end

  # entry point method
  def generate_and_upload(collection_sql, klass)
    puts "CSV_EXPORT #{jid}: start export for #{klass}"

    file_path = generate_csv_stats(collection_sql, klass.constantize)

    puts "CSV_EXPORT #{jid}: generated #{file_path}"

    url = upload_to_storage(file_path, klass)

    puts "CSV_EXPORT #{jid}: upload finished to #{url}"

    url
  ensure
    puts "CSV_EXPORT #{jid}: removing tmp file #{file_path}"

    File.delete(file_path) if file_path && File.exists?(file_path)

    puts "CSV_EXPORT #{jid}: finished export for #{klass}"
  end

  def upload_to_storage(file_path, klass)
    random_string = SecureRandom.hex

    Google::Gcloud::Storage.new(bucket_name: CSV_STATS_BUCKET).upload_file(
      file_path,
      "#{random_string}_#{user.id}_#{user.platform.name.downcase}_#{klass.to_s.downcase.pluralize}.csv",
      private: false
    ).url
  end

  def generate_csv_stats(collection_sql, klass, limit: DEFAULT_LIMIT, offset: DEFAULT_OFFSET)
    collection = []
    file_path = nil

    puts "CSV_EXPORT #{jid}: start of loop for #{collection_sql}"

    loop do
      puts "CSV_EXPORT #{jid}: current offset #{offset}"

      collection = klass.find_by_sql_with_limit(collection_sql, offset, limit)

      puts "CSV_EXPORT #{jid}: collection count #{collection.count}"

      break unless collection.present?

      offset += limit

      puts "CSV_EXPORT #{jid}: csv method #{csv_method(offset, limit)}"

      file_path = call_serializer(collection, csv_method(offset, limit), file_path, csv_export_settings(user, klass))
    end

    puts "CSV_EXPORT #{jid}: finished loop"

    file_path
  end

  def call_serializer(collection, method, file_path, csv_export_settings)
    ActiveModelSerializers::SerializableResource.new(
      collection,
      adapter: ActiveModelSerializers::Adapter::Csv,
      namespace: CsvSerializers,
      user: user
    ).adapter.to_csv(serializer_args(method, file_path), csv_export_settings)
  end

  def csv_method(offset, limit)
    offset == limit ? :open : :append
  end

  def serializer_args(method, file_path = nil)
    args = { csv_method: method, col_sep: user.platform.csv_delimiter }
    args = args.merge!(file_path: file_path) if method == :append
    args
  end

  private

  def csv_export_settings(user, klass)
    @csv_export_settings ||= user.csv_export_settings.find { |export_setting| setting_fits_to_klass?(export_setting, klass) }&.settings
  end

  def setting_fits_to_klass?(export_setting, klass)
    export_setting.serializer_name.casecmp(klass.name.to_s).zero?
  end
end

dbenjamin57 on 12 Feb 2018

That wont help us with hunting a memory leak that happens only when running through sidekiq worker.

Not with that attitude. lol

If you can you duplicate this problem, I don't see why you can't get memory_profile running while it's happening. Is it because the process is killed before the profiler has a chance to report? If so, maybe it can repro with a smaller data set?

Anywho, another random thought: maybe stick a GC.start at the bottom of each of those loops.

mikegee on 12 Feb 2018

hey @mikegee, Im actually planning on re-testing the same logic with GC.start at each of the loops. I'll keep posted regarding results.

Also, regarding your thoughts about memory_profiler - actually, if I can print a memory profiler say every 100K rows, it can highlight where our issues lies also.
I'll keep posted regarding these results as well.

Thanks!

dbenjamin57 on 12 Feb 2018

I don't have much to say here because there's not much to say; we don't have great memory profiling tools for Ruby so memory usage is really hard to track. memory_profiler is the best out there today, hopefully it will shed some light.

mperham on 12 Feb 2018

Quick update: We added GC.start at the end of each loop of the above code(generate_csv_stats method).

Didnt resolve the issue. Now I'll add a memory profiler that will print out every end of loop
(offset % 100_00).zero?
so we can see what is retained.

I'll keep you posted.

erated on 14 Feb 2018

Have you tried disabling the query cache? It may be the source of the leak, dunno.

You can test this in rails console:

ActiveRecord::Base.cache { Worker.new.perform(args) }

If that replicate the problem, the query cache is guilty.

AFAIK, AR query cache doesn't have a limit, so it may consume tons of MBs if you have tons of queries happening at one query cache block.

This is a Rails issue in fact (assuming that's the source of the memory consumption) and Sidekiq has no way to fix this by itself.

What you can do about that is disable the query cache for this particularly worker (again, if that's the source of the memory consumption).

sobrinho on 15 Feb 2018

👍4 🎉3

See #3756 for more info.

In fact, the newer version of Sidekiq added the middleware but the middleware seems to affect only Rails 4, Rails 5 have the query cache enabled by default due to executor/reloader API.

sobrinho on 15 Feb 2018

Hey @sobrinho, thanks for the response.

How would I go about approaching disabling the query cache specifically for the worker? This issue does not replicate when running via rails console but only via Sidekiq.

Also, if this feature was only introduced at Sidekiq 5, does it even have an affect in this case that was tested both for Sidekiq 4 and 5?

Anyhow, Im still working on the memory profiling to see what is retained between loops.

erated on 17 Feb 2018

You can try disabling the caching by doing this:

class SomeWorker
  include Sidekiq::Worker

  def perform(...)
    ActiveRecord::Base.uncached { your code here }
  end
end

What we found in #3756 is that Rails 5 enables query caching by default in production, but not on development.

Also be aware that we reverted the middleware to avoid the query caching in Rails 4, you have to explicitly enable it if you want in Rails 4.

If you still see the leak by using the uncached block, the leak is probably in your app code or other 3rd party code.

sobrinho on 18 Feb 2018

So, I have some good news. The issue is as @sobrinho mentioned, related to the AR query cache. Running our worker "uncached" kept our memory stable and the process was able to complete!

Now, Im contemplating whats the best approach here. Of-course I dont want to turn off query cache in prod (nor turn it off completely for all my sidekiq pods).

We have many "runner workers" that their job is to run over a large collection and do some action. I can turn it off in each of those runners but it seems a bit backwards to me.

Any thoughts or best practice to overcome this issue?

dbenjamin57 on 19 Feb 2018

🎉2 👍2

I know that Rails is going to fix this here.

Before that you must disable manually, there is no other way, AFAIK.

sobrinho on 19 Feb 2018

Hey @sobrinho, unfortunately that fix wont solve our issue here (we use find_by_sql).

In general, Im wondering what other people that have long running jobs that touch large collections do. Do they all disable query cache?

erated on 20 Feb 2018

@erated can't you disable the query cache for find_by_sql?

ActiveRecord::Base.uncached do
  models = Model.find_by_sql(...)
end

I see no other way to work-around the memory leak, the uncached block is exactly for these situations.

The core team is going to fix the most common batch operations here but find_by_sql isn't necessary a batch query, so doesn't make sense to skip it by default.

sobrinho on 20 Feb 2018

@sobrinho I dont think thats the problem at hand actually..

Of-course if find_each retains memory until it finishes its an issue. (and that is resolved by the PR you referenced).

But isn't the real problem here that a worker doesn't clean its query cache throughout its run?

erated on 20 Feb 2018

It does clear to you and if not, the leak is at Rails side and we should investigate that.

What Sidekiq do in the current released version is wrap the entire job execution in an ActiveRecord::Base.cache { ... } call.

_The master version reverted that a bit on #3756 but the same concept is applied using the executor/reloader API; Essentially we don't support Rails 4 anymore and Rails 5 needs a second job execution to fix itself in dev mode. More on this on #3756._

What you can see here is that the query cache is cleared after the cache call.

Maybe the executor/reloader API isn't doing the right thing right now. See the hooks here and the implementation here and the disable here.

Apparently, the code is right but we can't say for sure that's right without profiling.

I would say that this doesn't seem to be a leak from Sidekiq or Rails, otherwise, we would be seeing reports like this for the request cycle, sidekiq job execution cycle or either, but again can't say for sure without profiling.

And like @mperham said, there is no good tool for profiling memory leaks in Ruby, you can see that by articles like this, this and this.

I wish you good luck hunting this down and if you find the cause at Sidekiq or Rails, I will be glad to code review.

sobrinho on 20 Feb 2018

👍1

@dbenjamin57 Any updates or end results you might want to share to help people like me facing same kind of issues?

ziaulrehman40 on 18 Nov 2018

@dbenjamin57 I found this issue because we have a similar one, could you share your solution? I'd be curious what caused your memory problem, maybe we're dealing with a similar one.

23tux on 28 May 2020

Hey all, our solution was to disable cache in the specific actions that caused the memory leak:

  def perform(csv_params, limit, offset)
    ActiveRecord::Base.uncached do
      generate_and_upload(@user_id, @collection_sql, @klass, @export_params, limit, offset)
    end
  end

dbenjamin57 on 30 May 2020

@dbenjamin57 if this fixed the issue, you really should report this against Rails to check what's going on, Rails is not supposed to leak in the AR cache.

sobrinho on 1 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings