Stable-baselines: Why does env.render() create multiple render screens? | LSTM policy predict with one env [question]

Created on 16 Jan 2019  路  24Comments  路  Source: hill-a/stable-baselines

When I run the code example from the docs for cartpole multiprocessing, it renders one window with all env's playing the game. It also renders individual windows with the same env's playing the same games.

import gym
import numpy as np

from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines.common import set_global_seeds
from stable_baselines import ACKTR

def make_env(env_id, rank, seed=0):
    """
    Utility function for multiprocessed env.

    :param env_id: (str) the environment ID
    :param num_env: (int) the number of environments you wish to have in subprocesses
    :param seed: (int) the inital seed for RNG
    :param rank: (int) index of the subprocess
    """
    def _init():
        env = gym.make(env_id)
        env.seed(seed + rank)
        return env
    set_global_seeds(seed)
    return _init

env_id = "CartPole-v1"
num_cpu = 4  # Number of processes to use
# Create the vectorized environment
env = SubprocVecEnv([make_env(env_id, i) for i in range(num_cpu)])

model = ACKTR(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)

obs = env.reset()
for _ in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

System Info
Describe the characteristic of your environment:

  • Vanilla install, followed the docs using pip
  • gpus: 2-gtx-1080ti's
  • Python version 3.6.5
  • Tensorflow version 1.12.0
  • ffmpeg 4.0

Additional context
cartpole

question

All 24 comments

Hey,

Well this seems to be on OpenAI's side.
At the CartPole render function there are no checks for whether a rendering window was asked, or an RGB image

Normaly, when mode=rgb_image is used, no rendering is done, as definied by the Gym doc:

    def render(self, mode='human'):
        """Renders the environment.
        The set of supported modes varies per environment. (And some
        environments do not support rendering at all.) By convention,
        if mode is:
        - human: render to the current display or terminal and
          return nothing. Usually for human consumption.
        - rgb_array: Return an numpy.ndarray with shape (x, y, 3),
          representing RGB values for an x-by-y pixel image, suitable
          for turning into a video.
        - ansi: Return a string (str) or StringIO.StringIO containing a
          terminal-style text representation. The text can include newlines
          and ANSI escape sequences (e.g. for colors).

So you get a rendering window for each environment due to CartPole, and one tiled one from SubprocVecEnv.

If you want to avoid this display issue, but keep the SubProcVecEnv, recreate the vectorized environment for the rendering code, but with only one environment:

...
model.learn(total_timesteps=25000)

env = DummyVecEnc([make_env(env_id, 0)])
obs = env.reset()
for _ in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

Its a stopgap fix, but it is better than 5 windows.

That works but does it mean that after I've trained a model I have to load all envs into memory just to use one of them for testing?

Expect for LSTM policies, the predict() method only needs an observation or batch of observations (cf documentation) so you can use as many env as you want (e.g. only one) for testing.

For LSTMPolicies, you need to feed the predict method with the same observation as for training, which depends on the number of envs (to test it with only one env, a trick can consist in completing the batch of observations with zeros).

Once I get the model to converge, I'll probably need to pick your brain some more about the all zeros trick

To make it clearer, for LSTMPolicies, the predict method expect a shape of (n_envs, obs_space.shape), so if you want to test with only one env, construct an ndarray of shape (1, obs_space.shape) and then concatenate it with zeros to create the final ndarray.

Note: the shape may change (not sure if it is n_envs or minibatch_size) but at least you got the idea.

Hi @araffin , I followed your comments above but am really struggling to get it to work. I am using an LSTM policy with Subprocvecenv. My code is below:

env = DummyVecEnv([self.make_env(test_gym, 0)])

# for LSTMPolicies, the predict method expect a shape of (n_envs, obs_space.shape),
# so if you want to test with only one env,
# construct an ndarray of shape (1, obs_space.shape) and then
# concatenate it with zeros to create the final ndarray.
obs = env.reset()

zeroes = np.zeros(shape=(n_envs - 1, env.observation_space.shape[1]))
obs = np.concatenate((obs, zeroes), axis=0)
print(obs.shape)
for _ in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

With the above code, although print(obs.shape) gives me: (8, 1, 77), I get the following error when attempting to predict: ValueError: Cannot feed value of shape (1, 1, 77) for Tensor 'input/Ob:0', which has shape '(8, 1, 77)

Any ideas? Did I understand your comments correctly?

Hello,

You can find below a working example:

import gym
import numpy as np

from stable_baselines import PPO2
from stable_baselines.common.vec_env import DummyVecEnv

def make_env():
   def maker():
       env = gym.make("CartPole-v1")
       return env
   return maker

# Train with 2 envs
n_training_envs = 2
envs = DummyVecEnv([make_env() for _ in range(n_training_envs)])
model = PPO2("MlpLstmPolicy", envs, nminibatches=2)

# Create one env for testing
test_env = DummyVecEnv([make_env() for _ in range(1)])
test_obs = test_env.reset()

# model.predict(test_obs) would through an error
# because the number of test env is different from the number of training env
# so we need to complete the observation with zeroes
zero_completed_obs = np.zeros((n_training_envs,) + envs.observation_space.shape)
zero_completed_obs[0, :] = test_obs

# IMPORTANT: with recurrent policies, don't forget the state
state = None
action, state = model.predict(zero_completed_obs, state=state)
# The test env is expecting only one action
new_obs, reward, done, info = test_env.step([action[0]])
# Update the obs
zero_completed_obs[0, :] = new_obs

Please look at the documentation on how to use recurrent policies during testing, here you were forgetting the state.

This is the code used for prediction:

n_cpu = 1
env = PortfolioEnv(history=history, abbreviation=instruments, steps=settings['steps'], window_length=settings['window_length'], include_ta=settings['include_ta'],allow_short=settings['allow_short'], reward=settings['reward'])
env = SubprocVecEnv([lambda: env for _ in range(n_cpu)])

mdl = 'futures_20100101_20180101_5000000_2000_3_return_False'
model = PPO2.load(mdl)

# intialized here
obs = env.reset()
zero_completed_obs = np.zeros((n_cpu,) + env.observation_space.shape)
zero_completed_obs[0, :] = obs

state = None
#   state = model.initial_state   #   get the initial state vector for the reccurent network
#   done = np.zeros(state.shape[0])   #   set all environment to not done

weights, state = model.predict(zero_completed_obs, state)

#   print(weights)  

return weights, settings

I get this error in model.predict:

<class 'ValueError'>
Traceback (most recent call last):
  File "C:\Users\hanna\Anaconda3\lib\site-packages\quantiacsToolbox\quantiacsToolbox.py", line 871, in runts
    position, settings = TSobject.myTradingSystem(*argList)
  File "ppo2_quantiacs_test.py", line 47, in myTradingSystem
    weights, state = model.predict(zero_completed_obs, state)
  File "C:\Users\hanna\Anaconda3\lib\site-packages\stable_baselines\common\base_class.py", line 472, in predict
    actions, _, states, _ = self.step(observation, state, mask, deterministic=deterministic)
  File "C:\Users\hanna\Anaconda3\lib\site-packages\stable_baselines\common\policies.py", line 508, in step
    {self.obs_ph: obs, self.states_ph: state, self.dones_ph: mask})
  File "C:\Users\hanna\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
    run_metadata_ptr)
  File "C:\Users\hanna\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1128, in _run
    str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1, 675) for Tensor 'input/Ob:0', which has shape '(12, 675)'

Please read carefully my example, you have to use n_training_envs not n_cpu.

Whats the difference between n_training_envs and n_cpu?
Just a name of a variable.

You trained your agent with 12 envs (according to the error) and want to test it with only one.
But here, n_cpu != n_training_envs, so you get an error.

I changed it according to your example:

n_env = 12
env = PortfolioEnv(history=history, abbreviation=instruments, steps=settings['steps'], window_length=settings['window_length'], include_ta=settings['include_ta'],allow_short=settings['allow_short'], reward=settings['reward'])
env = DummyVecEnv([lambda: env for _ in range(1)])

mdl = 'futures_20100101_20180101_5000000_2000_3_return_False'
#   mdl = 'futures_20100101_20180101_5000000_2000_3_return_False_c7616a5f58b141aa989379427458bbe8'
model = PPO2.load(mdl)

# intialized here
obs = env.reset()
zero_completed_obs = np.zeros((n_env,) + env.observation_space.shape)
zero_completed_obs[0, :] = obs

state = None
#   state = model.initial_state   #   get the initial state vector for the reccurent network
#   done = np.zeros(state.shape[0])   #   set all environment to not done

pos, state = model.predict(zero_completed_obs, state)

Still get:

ValueError: could not broadcast input array from shape (12,45) into shape (45)

I guess that I have to take the first row of the pos matrix? pos[0] ?

I guess that I have to take the first row of the pos matrix? pos[0] ?

ok, you did not show all the code. Sure, your test env is expecting only one action, and please try by yourself before asking question for each step.

EDIT: I updated the example accordingly

Still have a problem in

pos, state = model.predict(zero_completed_obs, state, done)

ValueError: Error: Unexpected observation shape (12, 5) for Box environment, please use (10,) or (n_env, 10) for the observation shape.

Model was trained with n_env = 12

Where this 10 comes from?

Still have a problem in

pos, state = model.predict(zero_completed_obs, state, done)

ValueError: Error: Unexpected observation shape (12, 5) for Box environment, please use (10,) or (n_env, 10) for the observation shape.

Model was trained with n_env = 12

Where this 10 comes from?

A few things:

  • This is not related to the issue.
  • You did not give the required code to replicate the issue.
  • You did not give the full stack trace (which could be used to help find the origin of the issue)
  • please use the Markdown highlighting code format (https://help.github.com/en/articles/creating-and-highlighting-code-blocks)

Your issue will not be addressed if you do not follow the format described in the issue template (https://github.com/hill-a/stable-baselines/blob/master/.github/ISSUE_TEMPLATE/issue-template.md)

n_env = 12
env = PortfolioEnv(history=history, abbreviation=instruments, steps=settings['steps'], window_length=settings['window_length'], include_ta=settings['include_ta'],allow_short=settings['allow_short'], reward=settings['reward'], debug=settings['debug'])
env = SubprocVecEnv([lambda: env for _ in range(1)])

mdl = 'ES_19900102_20180101_5000000_7000_1_return_False_7a686c53e4a34338942a8b4bbe65fa47'
model = PPO2.load(mdl)

# intialized here
obs = env.reset()
zero_completed_obs = np.zeros((n_env,) + env.observation_space.shape)
zero_completed_obs[0, :] = obs

state = None
state = model.initial_state   
done = np.zeros(state.shape[0])   

pos, state = model.predict(zero_completed_obs, state, done)

Traceback (most recent call last):
File "C:\Users\hanna\Anaconda3\lib\site-packages\quantiacsToolbox\quantiacsToolbox.py", line 871, in runts
position, settings = TSobject.myTradingSystem(*argList)
File "ppo2_quantiacs_test.py", line 68, in myTradingSystem
pos, state = model.predict(zero_completed_obs, state, done)
File "C:\Users\hanna\Anaconda3\lib\site-packages\stable_baselines\common\base_class.py", line 469, in predict
vectorized_env = self._is_vectorized_observation(observation, self.observation_space)
File "C:\Users\hanna\Anaconda3\lib\site-packages\stable_baselines\common\base_class.py", line 399, in _is_vectorized_observation
.format(", ".join(map(str, observation_space.shape))))
ValueError: Error: Unexpected observation shape (12, 5) for Box environment, please use (10,) or (n_env, 10) for the observation shape.

Please use the Markdown highlighting code format (https://help.github.com/en/articles/creating-and-highlighting-code-blocks)

Reading code in pure text is not pleasent, and only take a few seconds for you to do.

Also, you are not using the latest version of stable-baselines, you must :

follow the format described in the issue template (https://github.com/hill-a/stable-baselines/blob/master/.github/ISSUE_TEMPLATE/issue-template.md)

as you will see that it says to describe with version of stable-baselines you have.

You are loading a model expecting a (n_env, 10) for the observation shape. It is an explicite message.

Ok it was my mistake I relieve it now.
I getting in prediction Nan.
Anyway I emailed you and Antonin privately.
Even if I do not get Nan, It is not working on new unseen data and in fact it does not even work when testing on same trained data. I hope that you can help and finish this once and for all.

I getting in prediction Nan.

You might want to have a look a this : https://stable-baselines.readthedocs.io/en/master/guide/checking_nan.html
It will help to find the NaNs in your code, specifically the VecCheckNan wrapper: https://stable-baselines.readthedocs.io/en/master/guide/checking_nan.html#vecchecknan-wrapper

Even if I do not get Nan, It is not working on new unseen data and in fact it does not even work when testing on same trained data.

Reinforcement learning is not a magic bullet, it is in no way garanted to work all the time on every problem. For mathematical reference see the no free lunch theorem, which states:

Any two optimization algorithms are equivalent when their performance is 
averaged across all possible problems

including random optimization algorithms.

You might want to try some tricks like VecFrameStacking, VecNormalize, or hyperparam search to help the algorithm optimize the way you would like.

I hope that you can help and finish this once and for all.

If you believe you have found a bug in the code of stable-baselines, and can show it reliably:
We will adresse it.

If you need techsupport or consulting:
We will not help

We do not have the time, nor the obligation for consulting on stable-baselines. The library is "as is", as described in the MIT licence: https://github.com/hill-a/stable-baselines/blob/master/LICENSE.

I understand that you do not have any obligation to counsel.
I am trying to implement this:
http://www-scf.usc.edu/~zhan527/post/cs599/
with stable baseline.
In the original article it does work, even on unseen data.
He created his own ddpg agent, and I understand that PPO suppsoe to be better.

In the original article it does work, even on unseen data.

Correction, on the given unseen data. it is possible to generate data that will not give a positive result for the algorithm. That is the hole point of adversarial learning.

He created his own ddpg agent, and I understand that PPO suppsoe to be better.

How did you get that impression? both have advantages and disadvantages.

EDIT:
if you are trying to replicate the results of the blogpost, why dont you use their hyperparameters with DDPG?

If that fails, then try and find the underlying implementation differences between the blogpost's DDPG and stable-baselines's DDPG?

In fact, why use stable-baselines at all, they have a github repo of their solution: https://github.com/vermouth1992/drl-portfolio-management

I know that they have github repository with their code. There are other similar works on github, for example https://github.com/yuriak/RLQuant
or
https://github.com/liangzp/Reinforcement-learning-in-portfolio-management-
I was hoping that stable baseline will let me test various agents and not be confined to ddpg only.
In addition, stable baseline has tensorboard integration.
In any case, at this point I still believe that the problem is with my code and not the agent or hyper parameters.
The original work is actually this:
https://arxiv.org/abs/1808.09940

Locking issue, diverging too much from the original message.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

maystroh picture maystroh  路  3Comments

H2SO4T picture H2SO4T  路  3Comments

shwang picture shwang  路  3Comments

junhyeokahn picture junhyeokahn  路  3Comments

JankyOo picture JankyOo  路  3Comments