gRPC 1.27.2, Python
python:3.8
Docker image (Debian buster)Python 3.8.1
We are running a gRPC API in Kubernetes and noticed the memory usage of the pods increases almost linearly with time. After ruling out a bunch of other stuff, it looks like there might be a memory leak in the grpcio library.
Locally I can reproduce the issue with the code from examples/python/helloworld
and a script that watches the RSS of the server (https://gist.github.com/hackedd/b3a79fc49a76a9fa96945e1118da8190).
I've tried different versions of Python (3.6, 3.7) and of grpcio
(1.26.0, 1.27.1, 1.27.2).
Stable memory usage over time.
Increasing memory usage over time.
@yashykt @vjpai any info or leads that we can solve this ourselves?
@zegerius @hackedd I tried the script with Python 3.7 and grpcio 1.27.2. I can see the DRS (physical memory) increased to around 1MB, and then stopped there. It doesn't look like a memory leak to me.
The increased 1MB memory could be used by Python interpreter or the 10 threads spawned by the gRPC server.
I have a similar issue with the Datastore client library, which was also noticed because Kubernetes was evicting pods.
A simple test that fetches 1 key uses an extra ~1MiB per request when the process RSS is measured, but Python's mprof shows no increase.
[+] Iteration 0, memory usage 38.9 MiB bytes
[+] Iteration 1, memory usage 45.9 MiB bytes
[+] Iteration 2, memory usage 46.8 MiB bytes
[+] Iteration 3, memory usage 47.6 MiB bytes
[+] Iteration 4, memory usage 48.7 MiB bytes
[+] Iteration 5, memory usage 49.8 MiB bytes
..
[+] Iteration 98, memory usage 136.3 MiB bytes
[+] Iteration 99, memory usage 137.1 MiB bytes
I have created a gist with the PoC code. I can't be certain this relates to the grpc library, so this is for information only, but this was tested on Windows using version grpcio==1.28.1.
I have started a StackOverflow question here about my specific problem. Because I can't guarantee this relates to grpc I kept it separate from this issue. However, it contains more information.
After some debugging, I found that my problem with datastore.Client
potentially disappeared when setting the GOOGLE_CLOUD_DISABLE_GRPC
environment variable. So far I have only tested this locally, but have left a larger application running overnight in Google Kubernetes Engine.
Details including valgrind traces are available in this datastore ticket.
If I can provide more useful data then further guidance for debugging would be appreciated.
Potentially related: https://github.com/grpc/grpc/issues/22603
@edeca Thanks for providing the reproduction example. I can reproduce the error, and performed some digging. Here is the Python object diff before the 100 iterations (from 30 MiB to 140 MiB):
types | # objects | total size
============================= | =========== | ============
dict | 4642 | 932.88 KB
list | 7615 | 714.05 KB
str | 8790 | 636.80 KB
collections.deque | 400 | 246.88 KB
int | 2347 | 183.79 KB
collections.OrderedDict | 400 | 159.38 KB
tuple | 1615 | 99.02 KB
set | 225 | 49.47 KB
builtin_function_or_method | 632 | 44.44 KB
function (<lambda>) | 200 | 26.56 KB
urllib3.poolmanager.PoolKey | 100 | 23.44 KB
threading.Condition | 300 | 16.41 KB
_thread.RLock | 300 | 14.06 KB
weakref | 163 | 12.73 KB
ssl.SSLContext | 100 | 11.72 KB
The sum of size of increased Python object (3.26 MiB) cannot account for the 100 MiB increase in RSS. There could be a leak in the C extension.
Using similar mechanism, I found the simple insecure_channel
of gRPC Python might leak if the close
method is not invoked. The leak rate is much slower though (10000 iters for 100 MiB). Using credentials consumes more memory in each iteration which amplifies this bug.
For the DataStore library, their objects are freed but not explicitly closing the gRPC Channel. This issue has troubled us two years ago (see https://github.com/grpc/grpc/issues/17515). I can't recall the rationales.
With a local patch that on Channel object deallocation close the underlying C-Core Channel, the leak stopped. This patch is created as PR https://github.com/grpc/grpc/pull/22855.
I also encountered this problem a few days ago. It is true that there is a memory leak in the GRPC channel. My solution is to rewrite the __del__
method of class when using it
Thanks for the fix @lidizheng
@lidizheng
Do I need to explicitly call del channel
or channel.close()
is good enough?
@DonnieKim411 After the fix since v1.30.0
, there should be no need for explicit closing to prevent memory leak. But it also requires the application to mind the life span of the channel object. For simplicity and clarity, with grpc.secure_channel as ...:
and channel.close()
are still good options.
Most helpful comment
Using similar mechanism, I found the simple
insecure_channel
of gRPC Python might leak if theclose
method is not invoked. The leak rate is much slower though (10000 iters for 100 MiB). Using credentials consumes more memory in each iteration which amplifies this bug.For the DataStore library, their objects are freed but not explicitly closing the gRPC Channel. This issue has troubled us two years ago (see https://github.com/grpc/grpc/issues/17515). I can't recall the rationales.
With a local patch that on Channel object deallocation close the underlying C-Core Channel, the leak stopped. This patch is created as PR https://github.com/grpc/grpc/pull/22855.