Grpc: Python: possible memory leak?

Created on 25 Feb 2020 · 12Comments · Source: grpc/grpc

What version of gRPC and what language are you using?

gRPC 1.27.2, Python

What operating system (Linux, Windows,...) and version?

macOS 10.15.3
python:3.8 Docker image (Debian buster)

What runtime / compiler are you using (e.g. python version or version of gcc)

Python 3.8.1

What did you do?

We are running a gRPC API in Kubernetes and noticed the memory usage of the pods increases almost linearly with time. After ruling out a bunch of other stuff, it looks like there might be a memory leak in the grpcio library.

Locally I can reproduce the issue with the code from examples/python/helloworld and a script that watches the RSS of the server (https://gist.github.com/hackedd/b3a79fc49a76a9fa96945e1118da8190).

I've tried different versions of Python (3.6, 3.7) and of grpcio (1.26.0, 1.27.1, 1.27.2).

What did you expect to see?

Stable memory usage over time.

What did you see instead?

Increasing memory usage over time.

Anything else we should know about your project / environment?

kinbug lanPython prioritP2

Source

hackedd

👍6

Most helpful comment

Using similar mechanism, I found the simple insecure_channel of gRPC Python might leak if the close method is not invoked. The leak rate is much slower though (10000 iters for 100 MiB). Using credentials consumes more memory in each iteration which amplifies this bug.

For the DataStore library, their objects are freed but not explicitly closing the gRPC Channel. This issue has troubled us two years ago (see https://github.com/grpc/grpc/issues/17515). I can't recall the rationales.

With a local patch that on Channel object deallocation close the underlying C-Core Channel, the leak stopped. This patch is created as PR https://github.com/grpc/grpc/pull/22855.

lidizheng on 4 May 2020

🎉3 👍2

All 12 comments

@yashykt @vjpai any info or leads that we can solve this ourselves?

zegerius on 27 Mar 2020

@zegerius @hackedd I tried the script with Python 3.7 and grpcio 1.27.2. I can see the DRS (physical memory) increased to around 1MB, and then stopped there. It doesn't look like a memory leak to me.

The increased 1MB memory could be used by Python interpreter or the 10 threads spawned by the gRPC server.

lidizheng on 27 Mar 2020

I have a similar issue with the Datastore client library, which was also noticed because Kubernetes was evicting pods.

A simple test that fetches 1 key uses an extra ~1MiB per request when the process RSS is measured, but Python's mprof shows no increase.

[+] Iteration 0, memory usage 38.9 MiB bytes
[+] Iteration 1, memory usage 45.9 MiB bytes
[+] Iteration 2, memory usage 46.8 MiB bytes
[+] Iteration 3, memory usage 47.6 MiB bytes
[+] Iteration 4, memory usage 48.7 MiB bytes
[+] Iteration 5, memory usage 49.8 MiB bytes
..
[+] Iteration 98, memory usage 136.3 MiB bytes
[+] Iteration 99, memory usage 137.1 MiB bytes

I have created a gist with the PoC code. I can't be certain this relates to the grpc library, so this is for information only, but this was tested on Windows using version grpcio==1.28.1.

edeca on 23 Apr 2020

I have started a StackOverflow question here about my specific problem. Because I can't guarantee this relates to grpc I kept it separate from this issue. However, it contains more information.

edeca on 24 Apr 2020

After some debugging, I found that my problem with datastore.Client potentially disappeared when setting the GOOGLE_CLOUD_DISABLE_GRPC environment variable. So far I have only tested this locally, but have left a larger application running overnight in Google Kubernetes Engine.

Details including valgrind traces are available in this datastore ticket.

If I can provide more useful data then further guidance for debugging would be appreciated.

edeca on 24 Apr 2020

Potentially related: https://github.com/grpc/grpc/issues/22603

edeca on 25 Apr 2020

@edeca Thanks for providing the reproduction example. I can reproduce the error, and performed some digging. Here is the Python object diff before the 100 iterations (from 30 MiB to 140 MiB):

                        types |   # objects |   total size
============================= | =========== | ============
                         dict |        4642 |    932.88 KB
                         list |        7615 |    714.05 KB
                          str |        8790 |    636.80 KB
            collections.deque |         400 |    246.88 KB
                          int |        2347 |    183.79 KB
      collections.OrderedDict |         400 |    159.38 KB
                        tuple |        1615 |     99.02 KB
                          set |         225 |     49.47 KB
   builtin_function_or_method |         632 |     44.44 KB
          function (<lambda>) |         200 |     26.56 KB
  urllib3.poolmanager.PoolKey |         100 |     23.44 KB
          threading.Condition |         300 |     16.41 KB
                _thread.RLock |         300 |     14.06 KB
                      weakref |         163 |     12.73 KB
               ssl.SSLContext |         100 |     11.72 KB

The sum of size of increased Python object (3.26 MiB) cannot account for the 100 MiB increase in RSS. There could be a leak in the C extension.

lidizheng on 4 May 2020

🎉1

With a local patch that on Channel object deallocation close the underlying C-Core Channel, the leak stopped. This patch is created as PR https://github.com/grpc/grpc/pull/22855.

lidizheng on 4 May 2020

🎉3 👍2

I also encountered this problem a few days ago. It is true that there is a memory leak in the GRPC channel. My solution is to rewrite the __del__ method of class when using it

dovefi on 11 May 2020

Thanks for the fix @lidizheng

zegerius on 3 Jul 2020

👍1

@lidizheng
Do I need to explicitly call del channel or channel.close() is good enough?

DonnieKim411 on 5 Aug 2020

@DonnieKim411 After the fix since v1.30.0, there should be no need for explicit closing to prevent memory leak. But it also requires the application to mind the life span of the channel object. For simplicity and clarity, with grpc.secure_channel as ...: and channel.close() are still good options.

lidizheng on 5 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings