Stable-baselines: Increasing memory usage throughout run time

Created on 31 Jan 2019  路  7Comments  路  Source: hill-a/stable-baselines

Describe the bug
Ever increasing memory usage while running. Each process has a similar amount of usage that increases linearly with the time it runs for. Eventually resulting in slow performance or system instability on long runs.

Code example

import gym

from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import A2C

n_cpu = 8
env = SubprocVecEnv([lambda: gym.make('CartPole-v1') for _ in range(n_cpu)])

model = A2C(MlpPolicy, env)
model.learn(total_timesteps=int(1e10))

This appears to be the minimal code needed to view the memory issues.
It becomes more obvious if you increase the number of processes, or in larger environments.

Initially ran on some custom environments from https://github.com/rubenrtorrado/GVGAI_GYM and also tried it with DQN or a CnnPolicy both of which still had this problem.

System Info
Describe the characteristic of your environment:

  • Tried on a Ubuntu Desktop and a MacBook Pro
  • Library installed via pip (tensorflow too)
  • GTX 1080 for Desktop, None in Laptop
  • Python 3.6.8
  • Tensorflow 1.12.0 (gpu version for desktop)
documentation

Most helpful comment

I have also encountered this issue, as has someone else on the openai repo, https://github.com/openai/baselines/issues/804.

I have determined this to be a bug with the latest version of numpy, in particular 1.16.0. Downgrading to numpy==1.15.4 fixed the issue for me.

Below is a minimal, complete, and verifiable example of the bug:

from multiprocessing import Process, Pipe
from os import getpid
import numpy as np

def f(conn):
    print("Sub pid:", getpid())
    while True:
        a = np.zeros(shape=(64,64,8), dtype=np.uint8)
        conn.send(a)
        conn.recv()

if __name__ == "__main__":
    parent_conn, child_conn = Pipe()
    p = Process(target=f, args=(child_conn,))
    print("Main pid:", getpid())
    p.start()
    while True:
        a = parent_conn.recv()
        parent_conn.send('')

On numpy==1.16.0 this will quickly OOM, on 1.15.4 it is fine. In their release notes they said they changed to a different pickling protocol, which I guess is the source of the bug.

edit: https://github.com/numpy/numpy/issues/12896

All 7 comments

I have also encountered this issue, as has someone else on the openai repo, https://github.com/openai/baselines/issues/804.

I have determined this to be a bug with the latest version of numpy, in particular 1.16.0. Downgrading to numpy==1.15.4 fixed the issue for me.

Below is a minimal, complete, and verifiable example of the bug:

from multiprocessing import Process, Pipe
from os import getpid
import numpy as np

def f(conn):
    print("Sub pid:", getpid())
    while True:
        a = np.zeros(shape=(64,64,8), dtype=np.uint8)
        conn.send(a)
        conn.recv()

if __name__ == "__main__":
    parent_conn, child_conn = Pipe()
    p = Process(target=f, args=(child_conn,))
    print("Main pid:", getpid())
    p.start()
    while True:
        a = parent_conn.recv()
        parent_conn.send('')

On numpy==1.16.0 this will quickly OOM, on 1.15.4 it is fine. In their release notes they said they changed to a different pickling protocol, which I guess is the source of the bug.

edit: https://github.com/numpy/numpy/issues/12896

Thank you very much, I can confirm that the fix of downgrading worked.

Other than setting the numpy requirement to be version 1.15.4 I don't see a permanent fix for now.

@hill-a @erniejunior any thought about that? Maybe a warning in the doc?

EDIT: @Kuldr @ArthurFirmino is it still valid with the 1.16.1 ? (released 13 hours ago)

Unless we are using numpy in a wrong way, I would advise against implementing workarounds for numpy bugs and just mention it in the doc.

I haven't tested it in numpy 1.16.1 I can try later on

@hill-a @Kuldr Confirmed numpy 1.16.1 fixes the issue

Thanks for testing it =)
So no need to update the requirements (it was out for only 19 days).
Closing the issue.

Was this page helpful?
0 / 5 - 0 ratings