Google-cloud-ruby: Memory leak with datastore using grpc version 1.2.2

Created on 13 Apr 2017 · 21Comments · Source: googleapis/google-cloud-ruby

During a recent bundle update of our app the version of grpc changed from 1.2.0 to 1.2.2. With the new version of grpc I am seeing runaway memory use. Here is a screenshot from a test I ran on Heroku. The initial flat spot is the Rails app running happily after deploy and the start of the large ramp was caused by a single datastore query. I then left the app alone and didn't make any more requests. Memory use continued to climb until 8:10am when Heroku killed the dyno. The app restarted and once again ran happily until I made another datastore request at 11:30 am and the ramp started over. The deploy of v 223 was with a single change of gem 'grpc', '1.2.0' to the Gemfile and the app has been running happily since.

memory

I can also reproduce this locally using the derailed gem. Has anything changed with version 1.2.2 of the grpc gem that needs a change to the datastore code? Or should this issue be moved to the grpc repo?

p1 acknowledged bug

Source

bmclean

Most helpful comment

In the meantime, going to release a 1.2.5 gem that reverts the connectivity fix (in https://github.com/grpc/grpc/pull/9986), which is causing this issue.

apolcyn on 20 Apr 2017

👍2

All 21 comments

@bmclean Thanks for opening the issue.

@swcloud Can we get someone from the grpc team to look at this?

blowmage on 13 Apr 2017

Might be related: https://github.com/grpc/grpc/issues/10658

bmclean on 14 Apr 2017

The require grpc would be happening when the first datastore query occurs (shown above).

bmclean on 14 Apr 2017

Adding @apolcyn

swcloud on 15 Apr 2017

I do think this is running into the same issue as in https://github.com/grpc/grpc/issues/10658.

If at all there is a fork after a 'require grpc', with grpc version 1.2.2, then unfortunately I'd expect these issues right now. There is a background thread in 1.2.2 involved in grpc channel lifecycles, and after a fork, if it's not there in the child process, then garbage collection of grpc channels will start to fail.

apolcyn on 15 Apr 2017

^ comment above is still a guess of what the app here is doing though, I'm not certain this is the same as in grpc/grpc#10658

apolcyn on 15 Apr 2017

I am not certain either. The Rails app used in the Heroku test was using Puma configured with 1 worker and 5 threads. So it could not have forked another process. The ruby client delays loading grpc until it is used (see here) and the memory starts ramping up immediately after the first datastore query (when grpc was loaded).

bmclean on 15 Apr 2017

I have created a simplified Rails 5 app that uses the derailed gem to show the memory usage. It uses an actual cloud datastore instance (so needs a project ID), authenticated locally through gcloud auth login. Clone the grpc-1.2.2-memory branch from here. Note that when running derailed to profile the memory usage all tests are run without a webserver, as it uses Rack directly. The Rails app is set to use grpc 1.2.0 initially.

cd test/support/datastore_example_rails_app
bundle

Check if the app is connecting to datastore by running:

RAILS_ENV=production GCLOUD_PROJECT=project-id-goes-here rails server

Navigate to localhost:3000 with a browser and create a few users.
Execute the following command to run the derailed benchmark (substitute your datastore project ID).

DERAILED_SKIP_ACTIVE_RECORD=true PATH_TO_HIT=/users GCLOUD_PROJECT=project-id-goes-here TEST_COUNT=5000 bundle exec derailed exec perf:mem_over_time

The profile will take about 15 minutes.
Then change the version of grpc to 1.2.2 in the Gemfile.

bundle update

Run the derailed command again to compare.

bmclean on 15 Apr 2017

👍1

You should end up with results that when graphed look something like this:

graph

bmclean on 15 Apr 2017

Inconclusive yet but looking into this and seem to be getting similar results. Thanks for the repro, this is really helpful!

apolcyn on 15 Apr 2017

@bmclean to be certain about the initial problem, you noticed runaway memory usage after only one datastore call? I'm not sure exactly which query it was, do we know if this translates to only one grpc call?

I'm seeing similar results with the benchmark memory comparisons in the graph above, but I'm not sure if these differences are necessarily hitting the bug in the original issue (trying reproduce that).

apolcyn on 17 Apr 2017

@apolcyn Correct. An ancestor query with an equality filter on one property:

query = CloudDatastore.dataset.query  'User'
query.ancestor(ancestor_key)
query.where('disabled', '=', false)
entities = CloudDatastore.dataset.run query

You have a valid point that the memory benchmark isn't exactly the same. It is performing the query repeatedly. I didn't see a way for derailed to hit the url only once but keep monitoring the memory.

So, I have added displaying the memory of the process to the example app. The index page of the app does not perform any queries. Start at the index page, then click to view the users (which performs a query) and then go back to the index page. Refresh the index page every few minutes and with 1.2.0 the memory eventually stays constant but with 1.2.2 it keeps climbing.

bmclean on 17 Apr 2017

You can start the Rails server locally with:
RAILS_ENV=production GCLOUD_PROJECT=project-id-goes-here rails server

bmclean on 17 Apr 2017

So it looks like this is definitely a pure grpc problem, it can actually be reproduced easily with a grpc example tweaked to:

  def main
    stub = Helloworld::Greeter::Stub.new('localhost:50051', :this_channel_is_insecure)
    user = ARGV.size > 0 ?  ARGV[0] : 'world'
    message = stub.say_hello(Helloworld::HelloRequest.new(name: user)).message
    p "Greeting: #{message}"

    loop do
      sleep 30
      p "#{(`ps -o rss= -p #{Process.pid}`.to_i * 1024).to_f / 2**20}"
    end

    stub.inspect
  end

  main

(it looks like the background thread that was added in v1.2.x to keep fix connectivity-related failures seems to be accumulating a lot of memory - it makes a check on a timer, and memory accumulates proportionally to the speed of it. e.g., setting https://github.com/grpc/grpc/blob/master/src/ruby/ext/grpc/rb_channel.c#L425 to a lower value can cause almost all of heap allocs to be from within grpc_channel_watch_connectivity_state)

apolcyn on 18 Apr 2017

Thanks @apolcyn!

bmclean on 18 Apr 2017

A heads up that the fix for this is under WIP but isn't immediate.

apolcyn on 19 Apr 2017

In the meantime, going to release a 1.2.5 gem that reverts the connectivity fix (in https://github.com/grpc/grpc/pull/9986), which is causing this issue.

apolcyn on 20 Apr 2017

👍2

@apolcyn I deployed grpc version 1.3.4 to our staging environment today and so far everything looks fine.

bmclean on 25 May 2017

🎉1

24 hours later and memory is still holding steady. Nice work @apolcyn! Are we ok to close this issue?

bmclean on 25 May 2017

🎉1

Great news @bmclean! Feel free to close this issue if you feel it is resolved.

blowmage on 26 May 2017

thanks for updates @bmclean, glad to hear that you're seeing the issue fixed!

apolcyn on 26 May 2017