Describe the bug
Ever increasing memory usage while running. Each process has a similar amount of usage that increases linearly with the time it runs for. Eventually resulting in slow performance or system instability on long runs.
Code example
import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import A2C
n_cpu = 8
env = SubprocVecEnv([lambda: gym.make('CartPole-v1') for _ in range(n_cpu)])
model = A2C(MlpPolicy, env)
model.learn(total_timesteps=int(1e10))
This appears to be the minimal code needed to view the memory issues.
It becomes more obvious if you increase the number of processes, or in larger environments.
Initially ran on some custom environments from https://github.com/rubenrtorrado/GVGAI_GYM and also tried it with DQN or a CnnPolicy both of which still had this problem.
System Info
Describe the characteristic of your environment:
I have also encountered this issue, as has someone else on the openai repo, https://github.com/openai/baselines/issues/804.
I have determined this to be a bug with the latest version of numpy, in particular 1.16.0. Downgrading to numpy==1.15.4 fixed the issue for me.
Below is a minimal, complete, and verifiable example of the bug:
from multiprocessing import Process, Pipe
from os import getpid
import numpy as np
def f(conn):
print("Sub pid:", getpid())
while True:
a = np.zeros(shape=(64,64,8), dtype=np.uint8)
conn.send(a)
conn.recv()
if __name__ == "__main__":
parent_conn, child_conn = Pipe()
p = Process(target=f, args=(child_conn,))
print("Main pid:", getpid())
p.start()
while True:
a = parent_conn.recv()
parent_conn.send('')
On numpy==1.16.0 this will quickly OOM, on 1.15.4 it is fine. In their release notes they said they changed to a different pickling protocol, which I guess is the source of the bug.
Thank you very much, I can confirm that the fix of downgrading worked.
Other than setting the numpy requirement to be version 1.15.4 I don't see a permanent fix for now.
@hill-a @erniejunior any thought about that? Maybe a warning in the doc?
EDIT: @Kuldr @ArthurFirmino is it still valid with the 1.16.1 ? (released 13 hours ago)
Unless we are using numpy in a wrong way, I would advise against implementing workarounds for numpy bugs and just mention it in the doc.
I haven't tested it in numpy 1.16.1 I can try later on
@hill-a @Kuldr Confirmed numpy 1.16.1 fixes the issue
Thanks for testing it =)
So no need to update the requirements (it was out for only 19 days).
Closing the issue.
Most helpful comment
I have also encountered this issue, as has someone else on the openai repo, https://github.com/openai/baselines/issues/804.
I have determined this to be a bug with the latest version of
numpy, in particular1.16.0. Downgrading tonumpy==1.15.4fixed the issue for me.Below is a minimal, complete, and verifiable example of the bug:
On
numpy==1.16.0this will quickly OOM, on1.15.4it is fine. In their release notes they said they changed to a different pickling protocol, which I guess is the source of the bug.edit: https://github.com/numpy/numpy/issues/12896