Stable-baselines: [question] Using the monitor wrapper with already wrapped custom environment

Created on 17 Jul 2020  路  10Comments  路  Source: hill-a/stable-baselines

I am trying to wrap my custom environment using the Monitor wrapper to get additional information about the episode rewards. But since I am also wrapping the environment afterwards with my custom wrapper, the initial wrapping becomes obsolete. Is there a way to use the monitor wrapper on custom environments? I have also already seen issue #470 the answers there did not help at all.

import os
import time

from gym import Wrapper, spaces
import numpy as np
from gym.envs.classic_control import PendulumEnv

from stable_baselines.common.env_checker import check_env
from stable_baselines.sac.policies import CnnPolicy
from stable_baselines import A2C
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.bench import Monitor

from skimage import data, color
from skimage.transform import rescale, resize, downscale_local_mean

import tensorflow as tf


class RGBArrayAsObservationWrapper(Wrapper):
    """
    Use env.render(rgb_array) as observation
    rather than the observation environment provides
    """

    def __init__(self, env):
        # TODO this might not work before environment has been reset
        super(RGBArrayAsObservationWrapper, self).__init__(env)
        self.reset()
        dummy_obs = env.render('rgb_array')
        dummy_obs_resized = resize(dummy_obs, (dummy_obs.shape[0] // 10, dummy_obs.shape[1] // 10),
                                   anti_aliasing=True)
        # Update observation space
        # TODO assign correct low and high
        self.observation_space = spaces.Box(low=0, high=255, shape=dummy_obs_resized.shape,
                                            dtype=dummy_obs_resized.dtype)

    def reset(self, **kwargs):
        obs = self.env.reset(**kwargs)
        obs = self.env.render("rgb_array")
        obs = resize(obs, (obs.shape[0] // 10, obs.shape[1] // 10),
                     anti_aliasing=True)
        return obs

    def step(self, action):
        obs, reward, done, info = self.env.step(action)
        obs = self.env.render("rgb_array")
        obs = resize(obs, (obs.shape[0] // 10, obs.shape[1] // 10),
                     anti_aliasing=True)
        return obs, reward, done, info


# tensorboard --logdir=A2C_IMG_PENDULUM:C:\Users\meric\OneDrive\Masa眉st眉\TUM\Thesis\Pycharm\pioneer\a2c_pendulum_tensorboard --host localhost

log_dir = "/tmp/gym/{}".format(int(time.time()))
os.makedirs(log_dir, exist_ok=True)

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)

TEST_COUNT = 100

pendulum_env = PendulumEnv()
pendulum_env = Monitor(pendulum_env, log_dir, allow_early_resets=True)
pendulum_env = RGBArrayAsObservationWrapper(pendulum_env)
check_env(pendulum_env, warn=True)

model = A2C("CnnPolicy", pendulum_env, verbose=1, tensorboard_log="./a2c_pendulum_tensorboard/")
model.learn(total_timesteps=100_000, log_interval=10)
model.save("a2c_pendulum")

sum_rewards = 0
done = False
obs = pendulum_env.reset()
for i in range(TEST_COUNT):
    while not done:
        action, _states = model.predict(obs)
        obs, rewards, done, info = pendulum_env.step(action)
        sum_rewards += rewards

    pendulum_env.reset()
    done = False

print(sum_rewards / TEST_COUNT)


C:\Users\meric\Anaconda3\envs\pioneer\python.exe C:/Users/meric/OneDrive/Masa眉st眉/TUM/Thesis/Pycharm/pioneer/pendulum_image_A2C.py
2020-07-17 19:13:53.638249: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From C:/Users/meric/OneDrive/Masa眉st眉/TUM/Thesis/Pycharm/pioneer/pendulum_image_A2C.py:58: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From C:/Users/meric/OneDrive/Masa眉st眉/TUM/Thesis/Pycharm/pioneer/pendulum_image_A2C.py:60: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2020-07-17 19:13:57.789006: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-07-17 19:13:57.793476: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-07-17 19:13:57.827360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce GTX 1050 major: 6 minor: 1 memoryClockRate(GHz): 1.493
pciBusID: 0000:01:00.0
2020-07-17 19:13:57.827697: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2020-07-17 19:13:57.831805: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-07-17 19:13:57.835594: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_100.dll
2020-07-17 19:13:57.837423: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_100.dll
2020-07-17 19:13:57.842333: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_100.dll
2020-07-17 19:13:57.845671: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_100.dll
2020-07-17 19:13:57.854817: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-17 19:13:57.855150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-07-17 19:13:58.672437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-17 19:13:58.672654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2020-07-17 19:13:58.672763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2020-07-17 19:13:58.673026: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3001 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)
C:\Users\meric\Anaconda3\envs\pioneer\lib\site-packages\stable_baselines\common\env_checker.py:25: UserWarning: It seems that your observation is an image but the `dtype` of your observation_space is not `np.uint8`. If your observation is not an image, we recommend you to flatten the observation to have only a 1D vector
  warnings.warn("It seems that your observation is an image but the `dtype` "
C:\Users\meric\Anaconda3\envs\pioneer\lib\site-packages\stable_baselines\common\env_checker.py:210: UserWarning: We recommend you to use a symmetric and normalized Box action space (range=[-1, 1]) cf https://stable-baselines.readthedocs.io/en/master/guide/rl_tips.html
  warnings.warn("We recommend you to use a symmetric and normalized Box action space (range=[-1, 1]) "
Wrapping the env in a DummyVecEnv.
2020-07-17 19:14:00.046052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce GTX 1050 major: 6 minor: 1 memoryClockRate(GHz): 1.493
pciBusID: 0000:01:00.0
2020-07-17 19:14:00.046321: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2020-07-17 19:14:00.046506: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-07-17 19:14:00.046684: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_100.dll
2020-07-17 19:14:00.046906: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_100.dll
2020-07-17 19:14:00.047086: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_100.dll
2020-07-17 19:14:00.047271: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_100.dll
2020-07-17 19:14:00.047449: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-17 19:14:00.047689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-07-17 19:14:00.047878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-17 19:14:00.048061: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2020-07-17 19:14:00.048175: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2020-07-17 19:14:00.048348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3001 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)
WARNING:tensorflow:From C:\Users\meric\Anaconda3\envs\pioneer\lib\site-packages\stable_baselines\common\policies.py:116: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From C:\Users\meric\Anaconda3\envs\pioneer\lib\site-packages\stable_baselines\common\input.py:25: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From C:\Users\meric\Anaconda3\envs\pioneer\lib\site-packages\stable_baselines\common\tf_layers.py:103: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

WARNING:tensorflow:From C:\Users\meric\Anaconda3\envs\pioneer\lib\site-packages\stable_baselines\common\distributions.py:418: The name tf.random_normal is deprecated. Please use tf.random.normal instead.

WARNING:tensorflow:From C:\Users\meric\Anaconda3\envs\pioneer\lib\site-packages\stable_baselines\a2c\a2c.py:160: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

WARNING:tensorflow:From C:\Users\meric\Anaconda3\envs\pioneer\lib\site-packages\stable_baselines\common\tf_util.py:449: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From C:\Users\meric\Anaconda3\envs\pioneer\lib\site-packages\stable_baselines\common\tf_util.py:449: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

WARNING:tensorflow:From C:\Users\meric\Anaconda3\envs\pioneer\lib\site-packages\tensorflow_core\python\ops\clip_ops.py:301: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From C:\Users\meric\Anaconda3\envs\pioneer\lib\site-packages\stable_baselines\a2c\a2c.py:184: The name tf.train.RMSPropOptimizer is deprecated. Please use tf.compat.v1.train.RMSPropOptimizer instead.

WARNING:tensorflow:From C:\Users\meric\Anaconda3\envs\pioneer\lib\site-packages\tensorflow_core\python\training\rmsprop.py:119: calling Ones.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From C:\Users\meric\Anaconda3\envs\pioneer\lib\site-packages\stable_baselines\a2c\a2c.py:194: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From C:\Users\meric\Anaconda3\envs\pioneer\lib\site-packages\stable_baselines\a2c\a2c.py:196: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

WARNING:tensorflow:From C:\Users\meric\Anaconda3\envs\pioneer\lib\site-packages\stable_baselines\common\base_class.py:1169: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

2020-07-17 19:14:01.077903: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-07-17 19:14:01.342446: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-17 19:14:02.338288: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows
Relying on driver to perform ptx compilation. This message will be only logged once.
---------------------------------
| explained_variance | 3.26e-05 |
| fps                | 2        |
| nupdates           | 1        |
| policy_entropy     | 1.42     |
| total_timesteps    | 5        |
| value_loss         | 439      |
---------------------------------

System Info
Describe the characteristic of your environment:

  • Describe how stable baselines was installed (pip, docker, source, ...): source
  • GPU models and configuration: NVIDIA GTX 1050 with CUDA 10.0 and cuDNN 7.6.5
  • Python version: 3.7
  • Tensorflow version: 1.15
custom gym env question

Most helpful comment

The pendulum env have no timelimit by default, you need to wrap it with a Timelimit wrapper or use gym.make('Pendulum-v0') to have episodes.

All 10 comments

I do not see an issue here. Your code should work as expected, i.e. do the things of the custom wrapper and also create the Monitor file. You can wrap the environment to as many wrappers as possible. Each wrapper just treats the incoming env (wrapped or not) as an environment, and does not care what they do.

Well the log at the end does not contain any information regarding rewards. Tensorboard also does not show any graphs about mean_rewards.

That could be because not a single episode has been finished. You should see "average return" or similar in the printout box after at least one full episode has been finished, after which monitor file should have one line per episode.

I just let the agent train for 100_000 timesteps without using my own wrapper, no output regarding rewards.

With Pendulum env, A2C and no extra wrappers you should start seeing "ep_reward_mean" after first few printouts. Try the example codes from docs if nothing else starts to work, and start from the simplest code and add parts to that.

Even with the simplest code I could manage there is no output regarding episode rewards. Here is the code:

import os
import time

from gym import Wrapper, spaces
import numpy as np
from gym.envs.classic_control import PendulumEnv

from stable_baselines.common.env_checker import check_env
from stable_baselines.sac.policies import MlpPolicy
from stable_baselines import A2C
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.bench import Monitor

import tensorflow as tf

# tensorboard --logdir=A2C_DEFAULT_PENDULUM:C:\Users\meric\OneDrive\Masa眉st眉\TUM\Thesis\Pycharm\pioneer\a2c_pendulum_default_tensorboard --host localhost

log_dir = "/tmp/gym/{}".format(int(time.time()))
os.makedirs(log_dir, exist_ok=True)


TEST_COUNT = 100

pendulum_env = PendulumEnv()
pendulum_env = Monitor(pendulum_env, log_dir, allow_early_resets=True)
check_env(pendulum_env, warn=True)

model = A2C("MlpPolicy", pendulum_env, verbose=1, tensorboard_log="./a2c_pendulum_default_tensorboard/")
model.learn(total_timesteps=100_000)
model.save("a2c_pendulum_default")



The pendulum env have no timelimit by default, you need to wrap it with a Timelimit wrapper or use gym.make('Pendulum-v0') to have episodes.

Thanks @araffin that solved the problem. Could you also answer this question real quick? I did not see very promising results with PPO2 and A2C with the default state representation when working with Pendulum-v0, is that expected? And also I am trying to use raw pixels for state representation. I guess @Miffyli told me A2C and PPO2 are good candidates for that, but I am not getting any good results, the agent does not get good rewards. Is there a way to use more than one image as an input like in the Atari environment? Thanks a lot for the help, I will close the issue after I get answers to these two questions :)

I am also adding the tensorboard results.https://ibb.co/8DbdZ9c

1) A2C and PPO probably require more samples to learn Pendulum. Check rl-zoo hyperparameters, where agent is trained for 2 million steps. Using raw pixels will more-than-likely take longer time to train.

2) You could Atari's FrameStack wrapper to stack multiple frames.

if you want to solve Pendulum-v0, better to use a off-policy algorithm (sac/td3) and hyperparameters from the zoo.

Closing this as the original question was answered.

Was this page helpful?
0 / 5 - 0 ratings