Stable-baselines: Converting OpenAI vectorized env to stable baselines vectorized env?

Created on 28 Jan 2020  路  16Comments  路  Source: hill-a/stable-baselines

I am trying to take an openai baselines environment (vectorized procgen env) and convert it to a stable baselines vectorized environment. Below is my naive attempt. I get the following error that it expects a function, but I am not sure how to resolve the issue. Any help is appreciated.

```
from procgen import ProcgenEnv

from baselines.common.vec_env import (
VecExtractDictObs,
VecMonitor,
VecFrameStack,
VecNormalize
)

venv = ProcgenEnv(num_envs=200, env_name="coinrun")

venv = VecExtractDictObs(venv, "rgb")

venv = VecMonitor(venv=venv, filename=None, keep_buf=200,)

venv = VecNormalize(venv=venv, ob=False)

venv = DummyVecEnv(venv)


TypeError Traceback (most recent call last)
in
16 venv = VecNormalize(venv=venv, ob=False)
17
---> 18 venv = DummyVecEnv(venv)

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/common/vec_env/dummy_vec_env.py in __init__(self, env_fns)
18
19 def __init__(self, env_fns):
---> 20 self.envs = [fn() for fn in env_fns]
21 env = self.envs[0]
22 VecEnv.__init__(self, len(env_fns), env.observation_space, env.action_space)

TypeError: 'VecNormalize' object is not iterable```

custom gym env enhancement question

Most helpful comment

You have to create a wrapper similar to VecExtractDictObs that extracts the "rgb" item from the observation dictionary each environment individually.

this already exist: https://github.com/openai/gym/blob/master/gym/wrappers/filter_observation.py

All 16 comments

Hello,
I think you should use the gym.Env (using gym.make, cf README) instead of the ProcgenEnv.
And why do you want to convert a VecNormalize to a DummyVecEnv? (VecNormlize is already a VecEnv...)

I am trying to convert it because when I try to train a PPO agent directly on the venv I get the following error:
```
Wrapping the env in a DummyVecEnv.

ValueError Traceback (most recent call last)
in
5
6 model = PPO2(MlpPolicy, venv, verbose=1)
----> 7 model.learn(total_timesteps=10000)

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/ppo2/ppo2.py in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps)
317 self._setup_learn()
318
--> 319 runner = Runner(env=self.env, model=self, n_steps=self.n_steps, gamma=self.gamma, lam=self.lam)
320 self.episode_reward = np.zeros((self.n_envs,))
321

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/ppo2/ppo2.py in __init__(self, env, model, n_steps, gamma, lam)
447 :param lam: (float) Factor for trade-off of bias vs variance for Generalized Advantage Estimator
448 """
--> 449 super().__init__(env=env, model=model, n_steps=n_steps)
450 self.lam = lam
451 self.gamma = gamma

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/common/runners.py in __init__(self, env, model, n_steps)
17 self.batch_ob_shape = (n_env*n_steps,) + env.observation_space.shape
18 self.obs = np.zeros((n_env,) + env.observation_space.shape, dtype=env.observation_space.dtype.name)
---> 19 self.obs[:] = env.reset()
20 self.n_steps = n_steps
21 self.states = model.initial_state

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/common/vec_env/dummy_vec_env.py in reset(self)
51 for env_idx in range(self.num_envs):
52 obs = self.envs[env_idx].reset()
---> 53 self._save_obs(env_idx, obs)
54 return self._obs_from_buf()
55

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/common/vec_env/dummy_vec_env.py in _save_obs(self, env_idx, obs)
70 for key in self.keys:
71 if key is None:
---> 72 self.buf_obs[key][env_idx] = obs
73 else:
74 self.buf_obs[key][env_idx] = obs[key]

ValueError: could not broadcast input array from shape (200,64,64,3) into shape (64,64,3)```

Two things:

  • The environment you are trying to learn uses images, so you need CnnPolicy, not MlpPolicy
  • You (probably) do not have to wrap your environment into new VecEnvs as, like arrafin mentioned, the ProcGen environment is already vectorized and you give it to the learn method as it is.

There could be a chance the VecEnv coming out from ProcGen does not work in stable-baselines as it is, which probably should be fixed with a bug (being such a nice environment).

I get the same error with CnnPolicy:

ValueError: could not broadcast input array from shape (200,64,64,3) into shape (64,64,3)

Also, calling gym.make on a procgen environment actually calls ProcgenEnv in gym.

Here I forced an error with gym.make by giving it a bad keyword and it shows that it calls ProcgenEnv at the end:
```
~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/gym/envs/registration.py in make(id, *kwargs)
154
155 def make(id, *
kwargs):
--> 156 return registry.make(id, **kwargs)
157
158 def spec(id):

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/gym/envs/registration.py in make(self, path, kwargs)
99 logger.info('Making new env: %s', path)
100 spec = self.spec(path)
--> 101 env = spec.make(
kwargs)
102 # We used to have people override _reset/_step rather than
103 # reset/step. Set _gym_disable_underscore_compat = True on

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/gym/envs/registration.py in make(self, kwargs)
71 else:
72 cls = load(self.entry_point)
---> 73 env = cls(
_kwargs)
74
75 # Make the enviroment aware of which spec it came from.

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/procgen/gym_registration.py in make_env(kwargs)
16
17 def make_env(
kwargs):
---> 18 venv = ProcgenEnv(num_envs=1, num_threads=0, **kwargs)
19 env = Scalarize(venv)
20 env = RemoveDictObs(env, key="rgb")

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/procgen/env.py in __init__(self, num_envs, env_name, center_agent, options, use_generated_assets, paint_vel_info, distribution_mode, *kwargs)
182 }
183 )
--> 184 super().__init__(num_envs, env_name, options, *
kwargs) ```

PS: as mentioned in the issue template, please use the markdown code blocks
for both code and stack traces.

EDIT: did you check the env with the env_checker ? (also mentioned in the issue template)

And to clarify @araffin 's comment: Did you remove the extra VecEnvs? What happens if you do just

venv = ProcgenEnv(num_envs=200, env_name="coinrun")
model = PPO2(CnnPolicy, venv, verbose=1)

In both cases (removing extra VecEnvs and keep them) I get the following error when using env_checker:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-38-3559b4aab3c7> in <module>
----> 1 check_env(venv)

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/common/env_checker.py in check_env(env, warn, skip_render_check)
    179         True by default (useful for the CI)
    180     """
--> 181     assert isinstance(env, gym.Env), ("You environment must inherit from gym.Env class "
    182                                       " cf https://github.com/openai/gym/blob/master/gym/core.py")
    183 

AssertionError: You environment must inherit from gym.Env class  cf https://github.com/openai/gym/blob/master/gym/core.py

Here if I remove the extra calls, same error:

```

Wrapping the env in a DummyVecEnv.

ValueError Traceback (most recent call last)
in
7 venv = VecExtractDictObs(venv, "rgb")
8 model = PPO2(CnnPolicy, venv, verbose=1)
----> 9 model.learn(total_timesteps=10000)

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/ppo2/ppo2.py in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps)
317 self._setup_learn()
318
--> 319 runner = Runner(env=self.env, model=self, n_steps=self.n_steps, gamma=self.gamma, lam=self.lam)
320 self.episode_reward = np.zeros((self.n_envs,))
321

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/ppo2/ppo2.py in __init__(self, env, model, n_steps, gamma, lam)
447 :param lam: (float) Factor for trade-off of bias vs variance for Generalized Advantage Estimator
448 """
--> 449 super().__init__(env=env, model=model, n_steps=n_steps)
450 self.lam = lam
451 self.gamma = gamma

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/common/runners.py in __init__(self, env, model, n_steps)
17 self.batch_ob_shape = (n_env*n_steps,) + env.observation_space.shape
18 self.obs = np.zeros((n_env,) + env.observation_space.shape, dtype=env.observation_space.dtype.name)
---> 19 self.obs[:] = env.reset()
20 self.n_steps = n_steps
21 self.states = model.initial_state

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/common/vec_env/dummy_vec_env.py in reset(self)
51 for env_idx in range(self.num_envs):
52 obs = self.envs[env_idx].reset()
---> 53 self._save_obs(env_idx, obs)
54 return self._obs_from_buf()
55

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/common/vec_env/dummy_vec_env.py in _save_obs(self, env_idx, obs)
70 for key in self.keys:
71 if key is None:
---> 72 self.buf_obs[key][env_idx] = obs
73 else:
74 self.buf_obs[key][env_idx] = obs[key]

ValueError: could not broadcast input array from shape (200,64,64,3) into shape (64,64,3) ```

If I remove that dictobs call though I get this error:

```

Wrapping the env in a DummyVecEnv.

NotImplementedError Traceback (most recent call last)
in
5
6 venv = ProcgenEnv(num_envs=200, env_name="coinrun")
----> 7 model = PPO2(CnnPolicy, venv, verbose=1)
8 model.learn(total_timesteps=10000)

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/ppo2/ppo2.py in __init__(self, policy, env, gamma, n_steps, ent_coef, learning_rate, vf_coef, max_grad_norm, lam, nminibatches, noptepochs, cliprange, cliprange_vf, verbose, tensorboard_log, _init_setup_model, policy_kwargs, full_tensorboard_log, seed, n_cpu_tf_sess)
102
103 if _init_setup_model:
--> 104 self.setup_model()
105
106 def _get_pretrain_placeholders(self):

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/ppo2/ppo2.py in setup_model(self)
132
133 act_model = self.policy(self.sess, self.observation_space, self.action_space, self.n_envs, 1,
--> 134 n_batch_step, reuse=False, **self.policy_kwargs)
135 with tf.variable_scope("train_model", reuse=True,
136 custom_getter=tf_util.outer_scope_getter("train_model")):

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/common/policies.py in __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse, *_kwargs)
599 def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, *
_kwargs):
600 super(CnnPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse,
--> 601 feature_extraction="cnn", **_kwargs)
602
603

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/common/policies.py in __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse, layers, net_arch, act_fun, cnn_extractor, feature_extraction, *kwargs)
538 act_fun=tf.tanh, cnn_extractor=nature_cnn, feature_extraction="cnn", *
kwargs):
539 super(FeedForwardPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse,
--> 540 scale=(feature_extraction == "cnn"))
541
542 self._kwargs_check(feature_extraction, kwargs)

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/common/policies.py in __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse, scale)
219 def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, scale=False):
220 super(ActorCriticPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse,
--> 221 scale=scale)
222 self._pdtype = make_proba_dist_type(ac_space)
223 self._policy = None

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/common/policies.py in __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse, scale, obs_phs, add_action_ph)
115 with tf.variable_scope("input", reuse=False):
116 if obs_phs is None:
--> 117 self._obs_ph, self._processed_obs = observation_input(ob_space, n_batch, scale=scale)
118 else:
119 self._obs_ph, self._processed_obs = obs_phs

~/Documents/drl/research/venv_proc/lib/python3.6/site-packages/stable_baselines/common/input.py in observation_input(ob_space, batch_size, name, scale)
49 else:
50 raise NotImplementedError("Error: the model does not support input space of type {}".format(
---> 51 type(ob_space).__name__))

NotImplementedError: Error: the model does not support input space of type Dict ```

If I just use gym.make as suggested it works. The problem is that I can't specify the number of environments I want to train on (say 200 like before) since in the gym code it calls ProcgenEnv with num_envs=1.

env = gym.make("procgen:procgen-maze-v0", distribution_mode='easy')
model = PPO2(CnnPolicy, env, verbose=1)
model.learn(total_timesteps=10000)

Ah right, because by default the environment works on Dicts. Am I to assume correct that this one did not work either?

venv = ProcgenEnv(num_envs=200, env_name="coinrun")
venv = VecExtractDictObs(venv, "rgb")
model = PPO2(CnnPolicy, venv, verbose=1)

If that does not work you can still create the environments manually. You have to create a wrapper similar to VecExtractDictObs that extracts the "rgb" item from the observation dictionary each environment individually.

That's correct, I get the same error as before (when including those extra venv calls)

ValueError: could not broadcast input array from shape (200,64,64,3) into shape (64,64,3)

I see. I will attempt to create the wrapper. Thank you for the suggestion.

Thanks for trying this out! Sounds like a bug-ish thing we should fix, or at least offer tools to avoid redoing work of that environment.

You have to create a wrapper similar to VecExtractDictObs that extracts the "rgb" item from the observation dictionary each environment individually.

this already exist: https://github.com/openai/gym/blob/master/gym/wrappers/filter_observation.py

Was this page helpful?
0 / 5 - 0 ratings