Ray: Ray error : "Not enough space to create this object"

Created on 27 Apr 2020 · 6Comments · Source: ray-project/ray

Hi, thanks for Ray. Amazing code.

However, I just launched a learning job in a distributive way (using a 32 cores & 256 GiB memory machine). It works fine but I get multiple Ray errors after around 2000 iterations :

There is not enough space to create this object, so evicting xxx objects to free up yyy bytes. The number of bytes in use (before this eviction) is zzz.

You can find an example of a plasma error file (located in /tmp/ray):
plasma_store_server.ec2.internal.ec2-user.log (4).txt

These errors repeat indefinitely.
How can I avoid this?

Thanks a lot

stale

Source

pitoupitou

Most helpful comment

So, there can be many possibilities why you have that error. The first thing to think about is "are you actually using that amount of memory"? Ray uses reference counting mechanism under the hood, and objects that are used by you are pinned and never be evicted from plasma store. If you store object stores somewhere in your driver and if they are not released (not Gc'ed from Python), they are not GC'ed from plasma store as well. This can lead huge amount of memory waste. For example,

objects_id = []
for _ in range(10000):
    objects_id.append(ray.remote(something))

# And you just keep running other part of code

This can lead waste of memory if you don't use those object ids at all later. Anything that is pinned to python objects will also be pinned in plasma store and never be evicted unless you set lru_evict=True when you run launch the ray node.

This means that you can solve your problem by enabling lru_evict. You can specify them from ray.init(lru_evict=True). This means that when plasma store becomes full, you are evicting the least recently used objects from the storage. (YES! This can lead errors because some objects evicted are those you might need for your application). But it will work fine in most cases.

If your problem is not your memory management, but that your application just requires tons of memory, then you should use multi-nodes cluster.

Also, here is a good documentation for Ray memory management.
https://docs.ray.io/en/latest/memory-management.html

rkooo567 on 30 Apr 2020

❤1 👍1

All 6 comments

This means your plasma store (object store) is full, so objects are evicted in a LRU manner to free up memory. You can increase object storage memory size using this command. https://ray.readthedocs.io/en/latest/package-ref.html#cmdoption-ray-start-object-store-memory

I'd also like to provide more context. When using Ray and you pass around objects that are bigger than 100 KB (e.g, ray.remote(object_bigger_than_hundred_kb)), this objects are stored in plasma store (object store) that is using "shared memory" of your machine (usually represented as /dev/shm in Linux). This plasma store should free objects when it is full, and this is done by evicting objects in a LRU manner.

rkooo567 on 27 Apr 2020

👍1

Thanks a lot for your informative answer.

I already tried setting the object store memory to a very high size (70% of the total RAM) but my deep learning training also ends up reaching this value at some point.

is it always recommended to set the object store memory up to 40% max of the total RAM?
what option would i have?

pitoupitou on 27 Apr 2020

objects_id = []
for _ in range(10000):
    objects_id.append(ray.remote(something))

# And you just keep running other part of code

If your problem is not your memory management, but that your application just requires tons of memory, then you should use multi-nodes cluster.

Also, here is a good documentation for Ray memory management.
https://docs.ray.io/en/latest/memory-management.html

rkooo567 on 30 Apr 2020

❤1 👍1

Again, thanks so much for your answer Sang-Cho!
_lru_evict_ worked, but a simple upgrade from from ray==0.6.1 to ray==0.8.4 also did the job, without using the _lru_evict_ parameter.
Would be interesting to figure out why the update fixed it though.

However, I now see hundreds of lines, in /tmp/ray/session_xxx/logs/python-worker_yyy, saying:
"Trying to put an object that already existed in plasma"

For context, I am training an agent using Deep Learning and Parameter Servers to distribute the learning to multiple workers.
Let me know if you have an idea how to avoid this warning.

pitoupitou on 1 May 2020

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] on 12 Nov 2020

"Trying to put an object that already existed in plasma" => This is just warning. I think this might indicate there's some system load to the plasma store, but it should just work.

Btw, I will close the issue as it seems to be resolved! But feel free to create a new issue or reopen the issue if you'd have any conversation you'd like to continue!!

rkooo567 on 12 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings