Ray: memory leaks during rllib training

Created on 16 May 2020  路  5Comments  路  Source: ray-project/ray

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS):
OS: docker on centos
ray:0.8.4
python:3.6

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

If we cannot run your script, we cannot fix your issue.

  • [x] I have verified my script runs in a clean environment and reproduces the issue.
  • [x] I have verified the issue also occurs with the latest wheels.

Recently, we found our RL model trained by rllib will deplete memory and throw OOM error. Then I run a rllib DQN model as belows, the memory usage grows as time pass by.

rllib train --run=DQN --env=Breakout-v0 --config='{"output": "dqn_breakout_1M/", "output_max_file_size": 50000000,"num_workers":3}' --stop='{"timesteps_total": 1000000}' 

Memory grows as time goes on:

image

Hope someone can give some help.

P2 bug rllib

All 5 comments

Does it still happen if you set the buffer size really small, or don't use the output option?

@ericl sorry for not replying this question for so long, I am busy with other work recently.
After a few trials, I found rollout worker may be the root cause of memory leak.
this scripts only remove "num_workers":3 in the config, and without rollout worker there is no sign of memory leak after running for a while.

rllib train --run=DQN --env=Breakout-v0 --config='{"output": "dqn_breakout_100M", "output_max_file_size": 50000000 }' --stop='{"timesteps_total": 1000000}' 

image

Besides, We found during the offline data training process, when we set num_workers >1 memory leak occurs and no sign of memory leak with num_workers=0 .

So I guess there may be some bug in rollout worker. I will try to locate it more precisely.

Oh, that might be because setting num workers enables distributed mode. We've fixed some memory leaks in 0.8.5, so it's worth upgrading to see.

I experience the same problem with APEX-DQN running in local mode with multiple workers. Memory usage linearly rises, and the experiments fail with RayOutOfMemoryError at some point.

I have tried setting the buffer_size to a smaller value, though I did not figure out what exactly the number is supposed to mean even after some invesitgation in the docs (is it # samples or bytes?) and it did not stop the memory error.

The traceback shows RolloutWorker occupying 56 of 64 GB. Feels like a memory leak to me.

Running on 0.8.5

I have also the same issue. I have tried to decrease buffer size, but the memory is still growing. Can you please suggest a solution if you solve this problem?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

robertnishihara picture robertnishihara  路  3Comments

heavyinfo picture heavyinfo  路  3Comments

austinmw picture austinmw  路  3Comments

WangYiPengPeter picture WangYiPengPeter  路  3Comments

ericl picture ericl  路  3Comments