Hi, thanks for Ray. Amazing code.
However, I just launched a learning job in a distributive way (using a 32 cores & 256 GiB memory machine). It works fine but I get multiple Ray errors after around 2000 iterations :
There is not enough space to create this object, so evicting xxx objects to free up yyy bytes. The number of bytes in use (before this eviction) is zzz.
You can find an example of a plasma error file (located in /tmp/ray):
plasma_store_server.ec2.internal.ec2-user.log (4).txt
These errors repeat indefinitely.
How can I avoid this?
Thanks a lot
This means your plasma store (object store) is full, so objects are evicted in a LRU manner to free up memory. You can increase object storage memory size using this command. https://ray.readthedocs.io/en/latest/package-ref.html#cmdoption-ray-start-object-store-memory
I'd also like to provide more context. When using Ray and you pass around objects that are bigger than 100 KB (e.g, ray.remote(object_bigger_than_hundred_kb)), this objects are stored in plasma store (object store) that is using "shared memory" of your machine (usually represented as /dev/shm in Linux). This plasma store should free objects when it is full, and this is done by evicting objects in a LRU manner.
Thanks a lot for your informative answer.
I already tried setting the object store memory to a very high size (70% of the total RAM) but my deep learning training also ends up reaching this value at some point.
So, there can be many possibilities why you have that error. The first thing to think about is "are you actually using that amount of memory"? Ray uses reference counting mechanism under the hood, and objects that are used by you are pinned and never be evicted from plasma store. If you store object stores somewhere in your driver and if they are not released (not Gc'ed from Python), they are not GC'ed from plasma store as well. This can lead huge amount of memory waste. For example,
objects_id = []
for _ in range(10000):
objects_id.append(ray.remote(something))
# And you just keep running other part of code
This can lead waste of memory if you don't use those object ids at all later. Anything that is pinned to python objects will also be pinned in plasma store and never be evicted unless you set lru_evict=True when you run launch the ray node.
This means that you can solve your problem by enabling lru_evict. You can specify them from ray.init(lru_evict=True). This means that when plasma store becomes full, you are evicting the least recently used objects from the storage. (YES! This can lead errors because some objects evicted are those you might need for your application). But it will work fine in most cases.
If your problem is not your memory management, but that your application just requires tons of memory, then you should use multi-nodes cluster.
Also, here is a good documentation for Ray memory management.
https://docs.ray.io/en/latest/memory-management.html
Again, thanks so much for your answer Sang-Cho!
_lru_evict_ worked, but a simple upgrade from from ray==0.6.1 to ray==0.8.4 also did the job, without using the _lru_evict_ parameter.
Would be interesting to figure out why the update fixed it though.
However, I now see hundreds of lines, in /tmp/ray/session_xxx/logs/python-worker_yyy, saying:
"Trying to put an object that already existed in plasma"
For context, I am training an agent using Deep Learning and Parameter Servers to distribute the learning to multiple workers.
Let me know if you have an idea how to avoid this warning.
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
"Trying to put an object that already existed in plasma" => This is just warning. I think this might indicate there's some system load to the plasma store, but it should just work.
Btw, I will close the issue as it seems to be resolved! But feel free to create a new issue or reopen the issue if you'd have any conversation you'd like to continue!!
Most helpful comment
So, there can be many possibilities why you have that error. The first thing to think about is "are you actually using that amount of memory"? Ray uses reference counting mechanism under the hood, and objects that are used by you are pinned and never be evicted from plasma store. If you store object stores somewhere in your driver and if they are not released (not Gc'ed from Python), they are not GC'ed from plasma store as well. This can lead huge amount of memory waste. For example,
This can lead waste of memory if you don't use those object ids at all later. Anything that is pinned to python objects will also be pinned in plasma store and never be evicted unless you set
lru_evict=Truewhen you run launch the ray node.This means that you can solve your problem by enabling lru_evict. You can specify them from
ray.init(lru_evict=True). This means that when plasma store becomes full, you are evicting the least recently used objects from the storage. (YES! This can lead errors because some objects evicted are those you might need for your application). But it will work fine in most cases.If your problem is not your memory management, but that your application just requires tons of memory, then you should use multi-nodes cluster.
Also, here is a good documentation for Ray memory management.
https://docs.ray.io/en/latest/memory-management.html