Sidekiq 5.1 (using Sidekiq Pro 4) with Ruby 2.4.1.
Same issue occurs with Sidekiq 4.2.10 (without pro)
We have an issue where a job of ours when running on Sidekiq leaks memory.
The use case is as follows: We have a BI system that has millions of records in our DB. Once in a while, our customer wants to export a subset of the data to a CSV (can be 500K records).
Therefore, we have a CSV worker that serialises the 500K using pagination to a file, uploads it then to private storage and sends the customer the link to the file.
This job, when running locally on the server (opening rails console and running CSVWorker.new.perform(some_collection)) works fine. It takes a while but finishes. Memory stays approx. static and does not go up during the pagination.
Then, when we try to run the job via Sidekiq (CSVWorker.new.perform_async(some_collection)) we start seeing graph charts like this (memory utilisation):

Eventually, the OOM kills our sidekiq process and our docker dies. To be extra careful with our test we ran our worker on an isolated sidekiq process (own server with no other workers running on it)
Any ideas? I know its not related to our code as we are running the same worker and it works via console. What can be different in this case?
Thanks in advance,
Dan
Other components, like ActiveRecord, might behave differently in a console versus in a Sidekiq worker.
ActiveRecord's caching feature was just enabled in 5.1: https://github.com/mperham/sidekiq/pull/3718, perhaps try turning it off? I dunno.
There's not much to go on, so maybe add some debugging logs to that job to isolate the culprit.
Are you testing fine on OSX and leaking on Linux? Huge diff in native extensions there.
On Feb 11, 2018, at 08:18, dbenjamin57 notifications@github.com wrote:
We have an issue where a job of ours when running on Sidekiq leaks memory.
The use case is as follows: We have a BI system that has millions of records in our DB. Once in a while, our customer wants to export a subset of the data to a CSV (can be 500K records).
Therefore, we have a CSV worker that serialises the 500K using pagination to a file, uploads it then to private storage and sends the customer the link to the file.This job, when running locally on the server (opening rails console and running CSVWorker.new.perform(some_collection)) works fine. It takes a while but finishes. Memory stays approx. static and does not go up during the pagination.
Then, when we try to run the job via Sidekiq (CSVWorker.new.perform_async(some_collection)) we start seeing graph charts like this (memory utilisation):
Eventually, the OOM kills our sidekiq process and our docker dies. To be extra careful with our test we ran our worker on an isolated sidekiq process (own server with no other workers running on it)
Any ideas? I know its not related to our code as we are running the same worker and it works via console. What can be different in this case?
Thanks in advance,
Dan—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
Hey @mperham all testing is done on the same linux server. so the only real difference is between running it via a sidekiq worker and running it in console via Worker.new.perform(params)
hey @mikegee what kind of logging would you add? The leak happens both in Sidekiq 5.1 and Sidekiq 4.2.10 so cant be attributed to this feature..
what kind of logging would you add?
This gem might help narrow down memory consumption: https://github.com/SamSaffron/memory_profiler
But when I wrote that, I was thinking just log the current process's memory consumption as reported by the OS. I would sprinkle logging statements throughout that worker's code.
I know memory profiler well. That wont help us with hunting a memory leak that happens only when running through sidekiq worker.
This is the module of ours that is in-charge of paginating the collection and I added in the logging we monitored to see if it crashes on a specific set of data (it doesnt, crashes in different collection points)
require 'securerandom'
class CsvNotAllowed < StandardError; end
module Csvable
extend ActiveSupport::Concern
DEFAULT_LIMIT = 1000
DEFAULT_OFFSET = 0
protected
def user(user_id = nil)
@user ||= User.find(user_id)
raise CsvNotAllowed unless @user.csv_allowed?
@user
end
# entry point method
def generate_and_upload(collection_sql, klass)
puts "CSV_EXPORT #{jid}: start export for #{klass}"
file_path = generate_csv_stats(collection_sql, klass.constantize)
puts "CSV_EXPORT #{jid}: generated #{file_path}"
url = upload_to_storage(file_path, klass)
puts "CSV_EXPORT #{jid}: upload finished to #{url}"
url
ensure
puts "CSV_EXPORT #{jid}: removing tmp file #{file_path}"
File.delete(file_path) if file_path && File.exists?(file_path)
puts "CSV_EXPORT #{jid}: finished export for #{klass}"
end
def upload_to_storage(file_path, klass)
random_string = SecureRandom.hex
Google::Gcloud::Storage.new(bucket_name: CSV_STATS_BUCKET).upload_file(
file_path,
"#{random_string}_#{user.id}_#{user.platform.name.downcase}_#{klass.to_s.downcase.pluralize}.csv",
private: false
).url
end
def generate_csv_stats(collection_sql, klass, limit: DEFAULT_LIMIT, offset: DEFAULT_OFFSET)
collection = []
file_path = nil
puts "CSV_EXPORT #{jid}: start of loop for #{collection_sql}"
loop do
puts "CSV_EXPORT #{jid}: current offset #{offset}"
collection = klass.find_by_sql_with_limit(collection_sql, offset, limit)
puts "CSV_EXPORT #{jid}: collection count #{collection.count}"
break unless collection.present?
offset += limit
puts "CSV_EXPORT #{jid}: csv method #{csv_method(offset, limit)}"
file_path = call_serializer(collection, csv_method(offset, limit), file_path, csv_export_settings(user, klass))
end
puts "CSV_EXPORT #{jid}: finished loop"
file_path
end
def call_serializer(collection, method, file_path, csv_export_settings)
ActiveModelSerializers::SerializableResource.new(
collection,
adapter: ActiveModelSerializers::Adapter::Csv,
namespace: CsvSerializers,
user: user
).adapter.to_csv(serializer_args(method, file_path), csv_export_settings)
end
def csv_method(offset, limit)
offset == limit ? :open : :append
end
def serializer_args(method, file_path = nil)
args = { csv_method: method, col_sep: user.platform.csv_delimiter }
args = args.merge!(file_path: file_path) if method == :append
args
end
private
def csv_export_settings(user, klass)
@csv_export_settings ||= user.csv_export_settings.find { |export_setting| setting_fits_to_klass?(export_setting, klass) }&.settings
end
def setting_fits_to_klass?(export_setting, klass)
export_setting.serializer_name.casecmp(klass.name.to_s).zero?
end
end
That wont help us with hunting a memory leak that happens only when running through sidekiq worker.
Not with that attitude. lol
If you can you duplicate this problem, I don't see why you can't get memory_profile running while it's happening. Is it because the process is killed before the profiler has a chance to report? If so, maybe it can repro with a smaller data set?
Anywho, another random thought: maybe stick a GC.start at the bottom of each of those loops.
hey @mikegee, Im actually planning on re-testing the same logic with GC.start at each of the loops. I'll keep posted regarding results.
Also, regarding your thoughts about memory_profiler - actually, if I can print a memory profiler say every 100K rows, it can highlight where our issues lies also.
I'll keep posted regarding these results as well.
Thanks!
I don't have much to say here because there's not much to say; we don't have great memory profiling tools for Ruby so memory usage is really hard to track. memory_profiler is the best out there today, hopefully it will shed some light.
Quick update: We added GC.start at the end of each loop of the above code(generate_csv_stats method).
Didnt resolve the issue. Now I'll add a memory profiler that will print out every end of loop
(offset % 100_00).zero?
so we can see what is retained.
I'll keep you posted.
Have you tried disabling the query cache? It may be the source of the leak, dunno.
You can test this in rails console:
ActiveRecord::Base.cache { Worker.new.perform(args) }
If that replicate the problem, the query cache is guilty.
AFAIK, AR query cache doesn't have a limit, so it may consume tons of MBs if you have tons of queries happening at one query cache block.
This is a Rails issue in fact (assuming that's the source of the memory consumption) and Sidekiq has no way to fix this by itself.
What you can do about that is disable the query cache for this particularly worker (again, if that's the source of the memory consumption).
See #3756 for more info.
In fact, the newer version of Sidekiq added the middleware but the middleware seems to affect only Rails 4, Rails 5 have the query cache enabled by default due to executor/reloader API.
Hey @sobrinho, thanks for the response.
How would I go about approaching disabling the query cache specifically for the worker? This issue does not replicate when running via rails console but only via Sidekiq.
Also, if this feature was only introduced at Sidekiq 5, does it even have an affect in this case that was tested both for Sidekiq 4 and 5?
Anyhow, Im still working on the memory profiling to see what is retained between loops.
You can try disabling the caching by doing this:
class SomeWorker
include Sidekiq::Worker
def perform(...)
ActiveRecord::Base.uncached { your code here }
end
end
What we found in #3756 is that Rails 5 enables query caching by default in production, but not on development.
Also be aware that we reverted the middleware to avoid the query caching in Rails 4, you have to explicitly enable it if you want in Rails 4.
If you still see the leak by using the uncached block, the leak is probably in your app code or other 3rd party code.
So, I have some good news. The issue is as @sobrinho mentioned, related to the AR query cache. Running our worker "uncached" kept our memory stable and the process was able to complete!
Now, Im contemplating whats the best approach here. Of-course I dont want to turn off query cache in prod (nor turn it off completely for all my sidekiq pods).
We have many "runner workers" that their job is to run over a large collection and do some action. I can turn it off in each of those runners but it seems a bit backwards to me.
Any thoughts or best practice to overcome this issue?
I know that Rails is going to fix this here.
Before that you must disable manually, there is no other way, AFAIK.
Hey @sobrinho, unfortunately that fix wont solve our issue here (we use find_by_sql).
In general, Im wondering what other people that have long running jobs that touch large collections do. Do they all disable query cache?
@erated can't you disable the query cache for find_by_sql?
ActiveRecord::Base.uncached do
models = Model.find_by_sql(...)
end
I see no other way to work-around the memory leak, the uncached block is exactly for these situations.
The core team is going to fix the most common batch operations here but find_by_sql isn't necessary a batch query, so doesn't make sense to skip it by default.
@sobrinho I dont think thats the problem at hand actually..
Of-course if find_each retains memory until it finishes its an issue. (and that is resolved by the PR you referenced).
But isn't the real problem here that a worker doesn't clean its query cache throughout its run?
It does clear to you and if not, the leak is at Rails side and we should investigate that.
What Sidekiq do in the current released version is wrap the entire job execution in an ActiveRecord::Base.cache { ... } call.
_The master version reverted that a bit on #3756 but the same concept is applied using the executor/reloader API; Essentially we don't support Rails 4 anymore and Rails 5 needs a second job execution to fix itself in dev mode. More on this on #3756._
What you can see here is that the query cache is cleared after the cache call.
Maybe the executor/reloader API isn't doing the right thing right now. See the hooks here and the implementation here and the disable here.
Apparently, the code is right but we can't say for sure that's right without profiling.
I would say that this doesn't seem to be a leak from Sidekiq or Rails, otherwise, we would be seeing reports like this for the request cycle, sidekiq job execution cycle or either, but again can't say for sure without profiling.
And like @mperham said, there is no good tool for profiling memory leaks in Ruby, you can see that by articles like this, this and this.
I wish you good luck hunting this down and if you find the cause at Sidekiq or Rails, I will be glad to code review.
@dbenjamin57 Any updates or end results you might want to share to help people like me facing same kind of issues?
@dbenjamin57 I found this issue because we have a similar one, could you share your solution? I'd be curious what caused your memory problem, maybe we're dealing with a similar one.
Hey all, our solution was to disable cache in the specific actions that caused the memory leak:
def perform(csv_params, limit, offset)
ActiveRecord::Base.uncached do
generate_and_upload(@user_id, @collection_sql, @klass, @export_params, limit, offset)
end
end
@dbenjamin57 if this fixed the issue, you really should report this against Rails to check what's going on, Rails is not supposed to leak in the AR cache.
Most helpful comment
Have you tried disabling the query cache? It may be the source of the leak, dunno.
You can test this in rails console:
If that replicate the problem, the query cache is guilty.
AFAIK, AR query cache doesn't have a limit, so it may consume tons of MBs if you have tons of queries happening at one query cache block.
This is a Rails issue in fact (assuming that's the source of the memory consumption) and Sidekiq has no way to fix this by itself.
What you can do about that is disable the query cache for this particularly worker (again, if that's the source of the memory consumption).