Ray: Possible async memory leak

Created on 25 Jun 2020  路  10Comments  路  Source: ray-project/ray

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS): Python 3.7.4, latest Ray master (specifically commit 80bcbe20c70dd284e0d93ea94fe97d1b34b81433)

It seems that large async objects are not cleaned up properly.

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

from fastapi import FastAPI
import ray

ray.init(address="auto")

app = FastAPI()

@ray.remote
def f():
    return {"lines": "hello" * 200000}

@app.get("/")
async def root():
    x = await f.remote()
    return {"x": x["lines"]}

Put the script into test.py and start it with uvicorn test:app.

Hitting the endpoint via curl http://127.0.0.1:8000 repeatedly is filling the object store without cleaning up the objects that are not referenced any more (see the "memory" tab in the dashboard).

If we cannot run your script, we cannot fix your issue.

  • [x] I have verified my script runs in a clean environment and reproduces the issue.
  • [ ] I have verified the issue also occurs with the latest wheels.
P0 bug

All 10 comments

@pcmoritz does this only happen when using fastapi? I'm wondering if they're holding a reference internally somehow

The following also crashes:

import asyncio
import time

import ray

ray.init(address="auto")

@ray.remote
def f():
    return {"lines": "hello" * 200000}

async def g():
    x = await f.remote()
    return {"x": x["lines"]}

async def h():
    for i in range(100):
        for i in range(1000):
            await g()
        time.sleep(5.0)

asyncio.run(h())

with this

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0625 13:53:09.543361 28278 151260608 global_state_accessor.cc:25] Redis server address = 192.168.1.105:6379, is test flag = 0
I0625 13:53:09.544366 28278 151260608 redis_client.cc:141] RedisClient connected.
I0625 13:53:09.546486 28278 151260608 redis_gcs_client.cc:83] RedisGcsClient Connected.
I0625 13:53:09.547072 28278 151260608 service_based_gcs_client.cc:75] ServiceBasedGcsClient Connected.

(pid=27691) E0625 13:54:52.043798 27691 332230080 plasma_store_provider.cc:108] Failed to put object e387d0b9ad0c93cfffffffff030000c001000000 in object store because it is full. Object size is 1000021 bytes.
(pid=27691) Waiting 1000ms for space to free up...
2020-06-25 13:54:52,168 INFO (unknown file):0 -- gc.collect() freed 110 refs in 0.05274486199999728 seconds
(pid=27680) 2020-06-25 13:54:52,157 INFO (unknown file):0 -- gc.collect() freed 20 refs in 0.050736543999988726 seconds
(pid=27690) 2020-06-25 13:54:52,161 INFO (unknown file):0 -- gc.collect() freed 15 refs in 0.046222410000012815 seconds
(pid=27688) 2020-06-25 13:54:52,159 INFO (unknown file):0 -- gc.collect() freed 15 refs in 0.05371089199999801 seconds
(pid=27687) 2020-06-25 13:54:52,152 INFO (unknown file):0 -- gc.collect() freed 15 refs in 0.04291282099998739 seconds
(pid=27679) 2020-06-25 13:54:52,160 INFO (unknown file):0 -- gc.collect() freed 15 refs in 0.052180585999991536 seconds
(pid=27681) 2020-06-25 13:54:52,152 INFO (unknown file):0 -- gc.collect() freed 20 refs in 0.04495509299999867 seconds
(pid=27684) 2020-06-25 13:54:52,160 INFO (unknown file):0 -- gc.collect() freed 15 refs in 0.05456545900000265 seconds
(pid=27686) 2020-06-25 13:54:52,155 INFO (unknown file):0 -- gc.collect() freed 15 refs in 0.04797645499999703 seconds
(pid=27694) 2020-06-25 13:54:52,157 INFO (unknown file):0 -- gc.collect() freed 15 refs in 0.05108316799999102 seconds
(pid=27689) 2020-06-25 13:54:52,159 INFO (unknown file):0 -- gc.collect() freed 15 refs in 0.050501905999993824 seconds
(pid=27695) 2020-06-25 13:54:52,161 INFO (unknown file):0 -- gc.collect() freed 15 refs in 0.04647622799998885 seconds
(pid=27683) 2020-06-25 13:54:52,156 INFO (unknown file):0 -- gc.collect() freed 15 refs in 0.04888435800000934 seconds
(pid=27682) 2020-06-25 13:54:52,152 INFO (unknown file):0 -- gc.collect() freed 15 refs in 0.04332379699999933 seconds
(pid=27685) 2020-06-25 13:54:52,153 INFO (unknown file):0 -- gc.collect() freed 15 refs in 0.04627601999999342 seconds
(pid=27693) 2020-06-25 13:54:52,160 INFO (unknown file):0 -- gc.collect() freed 15 refs in 0.044878916000016034 seconds
(pid=27691) 2020-06-25 13:54:52,154 INFO (unknown file):0 -- gc.collect() freed 15 refs in 0.04764199500002064 seconds
(pid=27691) E0625 13:54:53.050061 27691 332230080 plasma_store_provider.cc:108] Failed to put object e387d0b9ad0c93cfffffffff030000c001000000 in object store because it is full. Object size is 1000021 bytes.
(pid=27691) Waiting 2000ms for space to free up...
(pid=27691) E0625 13:54:55.054432 27691 332230080 plasma_store_provider.cc:108] Failed to put object e387d0b9ad0c93cfffffffff030000c001000000 in object store because it is full. Object size is 1000021 bytes.
(pid=27691) Waiting 4000ms for space to free up...
(pid=27691) E0625 13:54:59.058993 27691 332230080 plasma_store_provider.cc:108] Failed to put object e387d0b9ad0c93cfffffffff030000c001000000 in object store because it is full. Object size is 1000021 bytes.
(pid=27691) Waiting 8000ms for space to free up...
(pid=27691) E0625 13:55:07.063592 27691 332230080 plasma_store_provider.cc:108] Failed to put object e387d0b9ad0c93cfffffffff030000c001000000 in object store because it is full. Object size is 1000021 bytes.
(pid=27691) Waiting 16000ms for space to free up...
Traceback (most recent call last):
  File "repo.py", line 22, in <module>
    asyncio.run(h())
  File "/Users/pcmoritz/anaconda3/lib/python3.7/asyncio/runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "/Users/pcmoritz/anaconda3/lib/python3.7/asyncio/base_events.py", line 579, in run_until_complete
    return future.result()
  File "repo.py", line 19, in h
    await g()
  File "repo.py", line 13, in g
    x = await f.remote()
ray.exceptions.RayTaskError: ray::__main__.f() (pid=27691, ip=192.168.1.105)
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1172, in ray._raylet.CoreWorker.store_task_outputs
  File "python/ray/_raylet.pyx", line 132, in ray._raylet.check_status
ray.exceptions.ObjectStoreFullError: Failed to put object e387d0b9ad0c93cfffffffff030000c001000000 in object store because it is full. Object size is 1000021 bytes.
The local object store is full of objects that are still in scope and cannot be evicted. Try increasing the object store memory available with ray.init(object_store_memory=<bytes>). You can also try setting an option to fallback to LRU eviction when the object store is full by calling ray.init(lru_evict=True). See also: https://docs.ray.io/en/latest/memory-management.html.
(pid=27691) E0625 13:55:23.067600 27691 332230080 plasma_store_provider.cc:118] Failed to put object e387d0b9ad0c93cfffffffff030000c001000000 after 6 attempts. Plasma store status:
(pid=27691) num clients with quota: 0
(pid=27691) quota map size: 0
(pid=27691) pinned quota map size: 0
(pid=27691) allocated bytes: 11083288158
(pid=27691) allocation limit: 11084256460
(pid=27691) pinned bytes: 11083288158
(pid=27691) (global lru) capacity: 11084256460
(pid=27691) (global lru) used: 0%
(pid=27691) (global lru) num objects: 0
(pid=27691) (global lru) num evictions: 0
(pid=27691) (global lru) bytes evicted: 0
(pid=27691) ---
(pid=27691) --- Tip: Use the `ray memory` command to list active objects in the cluster.
(pid=27691) ---

What's the status of this?

I'll look into it in the next week. Is it blocking for you?

Thanks for looking into it!

It just means we have to restart our Ray cluster every couple of days at the moment :D

Oh I see haha. I'll try to get to it soon!

I believe this was closed by #9228. @pcmoritz, can you check?

@stephanie-wang @pcmoritz Can I close this issue?

Yes, should be fixed by #9228!

Was this page helpful?
0 / 5 - 0 ratings