Ray: [rllib] Training interrupted by RayOutOfMemoryError

Created on 20 Dec 2019 · 9Comments · Source: ray-project/ray

I'm executing a PPOTrainer with a custom environment I wrote, after some iterations (usually ~2k) the training stops with a RayOutOfMemoryError

Traceback (most recent call last):
  File "ppo.py", line 37, in <module>
    trainer.train()
  File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 418, in train
    raise e
  File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 407, in train
    result = Trainable.train(self)
  File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/tune/trainable.py", line 176, in train
    result = self._train()
  File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 129, in _train
    fetches = self.optimizer.step()
  File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/rllib/optimizers/multi_gpu_optimizer.py", line 140, in step
    self.num_envs_per_worker, self.train_batch_size)
  File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/rllib/optimizers/rollout.py", line 29, in collect_samples
    next_sample = ray_get_and_free(fut_sample)
  File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/rllib/utils/memory.py", line 33, in ray_get_and_free
    result = ray.get(object_ids)
  File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/worker.py", line 2121, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayOutOfMemoryError): ray_RolloutWorker:sample() (pid=5283, host=UBUNTU)
  File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/memory_monitor.py", line 130, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node UBUNTU is used (14.9 / 15.67 GB). The top 10 memory consumers are:

PID     MEM     COMMAND
5181    6.83GiB python ppo.py
4889    0.51GiB /home/devid/.vscode/extensions/ms-python.python-2019.11.50794/languageServer.0.5.10/Microsoft.Python
1778    0.24GiB /usr/bin/gnome-shell
5283    0.23GiB ray_RolloutWorker:sample()
5276    0.18GiB ray_worker
5290    0.18GiB ray_worker
5289    0.18GiB ray_worker
5282    0.18GiB ray_worker
5286    0.18GiB ray_worker

In addition, up to 2.12 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`. Note that Ray assumes all system memory is available for use by workers. If your system has other applications running, you should manually set these memory limits to a lower value.

This is how I initialize and run my trainer:

import rllib_wrapper.callbacks as cb
from rllib_wrapper.flatland_wrapper import FlatlandEnv
from rllib_wrapper.custom_preprocessor import TreeObsPreprocessor

ModelCatalog.register_custom_preprocessor("tree_obs_prep", TreeObsPreprocessor)
trainer = PPOTrainer(env=FlatlandEnv, config={
    "num_workers": 1,
    "train_batch_size": 4000,
    "model": {
        "custom_preprocessor": "tree_obs_prep"
    },
    "callbacks": {
        "on_episode_end": cb.on_episode_end,
        "on_train_result": cb.on_train_result,
    },
    "log_level": "ERROR"
})

for i in range(100000 + 2):
    trainer.train()

This is my custom environment

class FlatlandEnv(rllib.env.MultiAgentEnv):
    def __init__(self, env_config):
        self.env = RailEnv(...)
        self.action_space = gym.spaces.Discrete(5)
        self.observation_space = np.zeros((1, 231))

    def reset(self):
        self.agents_done = []
        obs = self.env.reset()
        return obs[0]

    def step(self, action_dict):
        obs, rewards, dones, infos = self.env.step(action_dict)
        d = dict()
        r = dict()
        o = dict()
        i = dict()
        for a in range(len(self.env.agents)):
            if a not in self.agents_done:            
                o[a] = obs[a]
                r[a] = rewards[a]
                d[a] = dones[a]
                i[a] = '...'
        d['__all__'] = dones['__all__']

        for agent, done in dones.items():
            if done and agent != '__all__':
                self.agents_done.append(agent)

        return  o, r, d, i

And this is my preprocessor:

class TreeObsPreprocessor(Preprocessor):
    def _init_shape(self, obs_space, options):
        self.step_memory = 2 # TODO options["custom_options"]["step_memory"]
        self.tree_depth = 2
        return sum([space.shape[0] for space in obs_space]),

    def transform(self, obs):
        if obs:
            ret = normalize_observation(obs, self.tree_depth, observation_radius=10)
        else:
            ret = np.zeros(231)

        return ret

The full code is available here and those are my system's informations:

OS:  Ubuntu 18.04 x86_64
ray:  0.7.6
tensorflow:  2.0.0
python:  3.7.0

I already tried those solutions but none of them worked:

Lowering the train_batch_size
Setting the memory parameters as suggested by the error message:
python ray.init( memory=8000000000, redis_max_memory=8000000000, object_store_memory=8000000000, )

I'm not experienced with ray and RL in general, could you help me understand why this error happens and how to fix it?
Thanks in advance

question rllib stale

Source

misterdev

All 9 comments

I'm trying with A3C now and I'll let you know the results :)

PS: @eugenevinitsky I've received your answer as an email but I can't see
it on Github 🤔

On Sat, Dec 21, 2019 at 2:32 AM Eugene Vinitsky notifications@github.com
wrote:

@misterdev https://github.com/misterdev, do you experience the same
leak if you switch from PPO to A3C? I'm curious if this leak is restricted
to PPO; that's what I'm seeing.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/6562?email_source=notifications&email_token=ACVWMHLYJUBPDMEQODH2PO3QZVW37A5CNFSM4J6AGXY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHOSEEI#issuecomment-568140305,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ACVWMHJ52LNW47U745FNVUTQZVW37ANCNFSM4J6AGXYQ
.

misterdev on 21 Dec 2019

Ah yeah, I deleted it because it definitely happens in A3C too.

eugenevinitsky on 21 Dec 2019

Actually, with A3C I'm at 14k iterations and it's still training 😅

Please, let me know if you have any idea on what to try to fix that or which could be the reason of the leak

misterdev on 22 Dec 2019

I'm just going through and trying to remove things from the algorithm one at a time until I find where the leak is. I'll let you know if I figure it out.

eugenevinitsky on 23 Dec 2019

👍1

@misterdev for me this issue went away on TF 1.15.0 after upgrading from 1.14.0. It's possible there's a new leak that got introduced between 1.15.0 and 2.0.0.

eugenevinitsky on 24 Dec 2019

👍1

@misterdev Have you tried this with the requirement ray[rllib]==0.8.0, and separately with the nightly wheel? It might have already been fixed.

internetcoffeephone on 9 Jan 2020

internetcoffeephone on 10 Jan 2020

I've only tried using TF 1.14.0 as suggested by Eugene and it works, I can try what you are suggesting in the next days

misterdev on 10 Jan 2020

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.