Stable-baselines: V3.0 implementation design

Created on 23 Nov 2019 · 44Comments · Source: hill-a/stable-baselines

A beta version is now online: https://github.com/DLR-RM/stable-baselines3

Hello,

Before starting the migration to tf2 for stable baselines v3, I would like to discuss some design point we should agree on.

Which tf paradigm should we use?

I would go for pytorch-like "eager mode", wrapping the method using a tf.function to improve the performance (as it is done here).
The define-by-run is usually easier to read and debug (and I can compare it to my internal pytorch version). Wrapping it up with a tf.function should preserve performances.

What is the roadmap?

My idea would be:

Refactor common folder (as done by @Miffyli in #540 )
Implement one on-policy algorithm and one off-policy:
I would go for PPO/TD3 and I can be in charge of that.
This would allow to discuss concrete implementation details.
Implement the rest, in order:
SAC
A2C
DQN
DDPG
HER
TRPO

Implement the recurrent versions?

I'm afraid that the remaining ones (ACKTR, GAIL and ACER) are not the easiest one to implement.
And for GAIL, we can refer to https://github.com/HumanCompatibleAI/imitation by @AdamGleave et al.

Is there other breaking changes we should do? Change in the interface?

Some answers to this questions are linked here: https://github.com/hill-a/stable-baselines/issues/366

There are different things that I would like to change/add.

First, it would be adding the evaluation in the training loop. That is to say, we allow use to pass an eval_env on which the agent will be evaluated every eval_freq for n_eval_episodes. This is a true measure of the agent performance compared to training reward.

I would like to manipulate only VecEnv in the algorithm (and wrap the gym.Env automatically if necessary) this simplify the thing (so we don't have to think about what is the type of the env). Currently, we are using an UnVecEnvWrapper which makes things complicated for DQN for instance.

Should we maintain MPI support? I would favor switching to VecEnv too, this remove a dependency and unify the rest. (and would maybe allow to have an easy way to multiprocess SAC/DDPG or TD3 (cf #324 )). This would mean that we will remove PPO1 too.

Next thing I would like to make default is the Monitor wrapper. This allow to retrieve statistics about the training and would remove the need of a buggy version of total_episode_reward_logger for computing reward (cf #143).

As discussed in an other issue, I would like to unify the learning rate schedule too (would not be too difficult).

I would like to unify also the parameters name (ex: ent_coef vs ent_coeff).

Anyway, I plan to do a PR and we can then discuss on that.

Regarding the transition

As we will be switching to keras interface (at least for most of the layers), this will break previously saved models. I propose to create scripts that allow to convert old models to new SB version rather than try to be backward-compatible.

Pinging @hill-a @erniejunior @AdamGleave @Miffyli

PS: I hope I did not forget any important point

EDIT: the draft repo is here: https://github.com/Stable-Baselines-Team/stable-baselines-tf2 (ppo and td3 included for now)

Source

araffin

👍1

Most helpful comment

Paradigm: I agree on using eager-mode. This should make things much easier. However I am uncertain about the tf.function. I do not have too much experience with TF2, but wouldn't this require structuring in certain way that we can use tf.functions easily (similar to code structure now)? I do not know how much performance boost we can expect from tf.function as main bottlenecks already are environments and storing/passing data around.

MPI: I favor dropping support for this. I do not see the benefit of it at this point, but it has been a source of headaches (e.g. Windows support, importing MPI-dependent algorithms).

Monitor: I do not know about "on by default", but I agree on having some unified structure for tracking episode stats which can then be read in callbacks (see e.g. #563). I would still keep the monitor wrapper which would just print these results to a .csv file like previously.

Roadmap: I would go with the simplest algorithms, e.g. PPO and A2C, and see how things go from there (or would TD3 be easy after PPO?). This should be by default, but very first thing to do would be to gather some benchmark results with current stable-baselines (already in rl-zoo), and then run experiments against these and call it a day once similar performance is reached.

One thing I would add is the support for Tuple/Dict observation/action spaces, as discussed in many issues (e.g. #502). Judging by all the questions this is probably one of the biggest limitations of using stable-baselines in new kind of tasks. This would include some non-backend related modifications as well (e.g. how observations/actions are handled, as they can not be stacked to numpy arrays).

I can work on the model saving/loading and conversion of older models, as well as finish the refactoring of common.

Miffyli on 23 Nov 2019

👍5

All 44 comments

MPI: I favor dropping support for this. I do not see the benefit of it at this point, but it has been a source of headaches (e.g. Windows support, importing MPI-dependent algorithms).

I can work on the model saving/loading and conversion of older models, as well as finish the refactoring of common.

Miffyli on 23 Nov 2019

👍5

Monitor

I would still keep the monitor wrapper which would just print these results to a .csv file like previously.

Yes that's the idea. In fact, if you specify filename=None you don't have to write on the disk.

Roadmap

I would go with the simplest algorithms, e.g. PPO and A2C, and see how things go from there (or would TD3 be easy after PPO?).

PPO and A2C have equivalent complexity and in fact, TD3 would be the easiest as it does not require a probability distribution. And yes that's the all point of having the zoo ;) (we have the hyperparameters and the expected performances)

Tuple/Dict observation/action spaces,

Is the support for Tuple/Dict observation/action spaces,

this is tricky but should be possible with the eager mode (at least for the observation), most of the work would be done by a preprocess_obs method (I think we may discuss that point in #133 as I have an idea on how to do that (for the observation)). I would do that for a V3.1 rather than V3 (first get things working and then add more complexity).
For the action, it is unclear to me as we would need to separate the losses when we compute the log likelihood of taking a particular action (this sounds quite complex).

I can work on the model saving/loading and conversion of older models, as well as finish the refactoring of common.

Perfect =) Note that I would put the migration scripts in a repo in Stable-Baselines Team rather than in the package.

araffin on 23 Nov 2019

👍1

Yes that's the idea. In fact, if you specify filename=None you don't have to write on the disk.

By brain did a derp and totally missed the point this would get rid of the wrapper and move the whole logging inside stable-baselines rather than env side. Sounds good, although I have found Monitor wrapper useful outside stable-baselines as well (not that it is a complicated piece of code)

Tuple/Dict observation/action spaces

I am ok with v3.1 support for this, but we could keep it in mind when doing v3.0 in case some design decisions come up that could influence this. Indeed on backend side this is not a big thing. For action spaces you could assume independence between actions in which case you can multiply probabilities together (not the ideal solution as things probably are not independent, but works for any case).

refactoring of common.

👍 . I just want to add that I won't have too much time to put on this until after Christmas holidays.

Miffyli on 23 Nov 2019

Thanks for taking the lead on this @araffin.

I agree with the plan overall.

MPI: In favour of dropping support, it has caused me so many headaches, and we can get similar performance with other techniques.

Tuple/dict observation/action spaces: This would be useful for me but I agree it should not be our immediate target.

GAIL: If SB moves to TF 2.0, we'll migrate https://github.com/humancompatibleai/imitation/ to TF 2.0 as well. Shouldn't be that hard as the only TensorFlow code we have is some implementations of discriminator and reward networks.

Base classes: While we're making breaking API changes, I'd like to rethink BaseRLModel and BasePolicy. Right now BaseRLModel is responsible for most things: loading, saving, training and even prediction. I think it'd be cleaner if BasePolicy took on responsibility for loading/saving the policy weights, and if BasePolicy could easily be used standalone for prediction.

In particular, I've found the current design really awkward when I want to wrap policies. This has been quite common in my use cases (e.g. adding noise to actions of a policy to test stability, or normalizing observations which I cannot do at the VecEnv level in a multi-agent context since different policies have different parameters). The options here are either to try and extract the BasePolicy from the BaseRLModel (requires some trickery as different attributes in different algorithms, see https://github.com/HumanCompatibleAI/imitation/blob/master/src/imitation/policies/serialize.py#L117), or to wrap the BaseRLModel and implement stubs for all the training methods (see https://github.com/HumanCompatibleAI/adversarial-policies/blob/811bb12efe735ba30e6d6d0343d497d581d165b3/src/aprl/policies/base.py#L12).

AdamGleave on 23 Nov 2019

👍2

Hi Guys, thank you for your contributions to this project. I have been working with its parts on and off throughout a couple of months, so I thought I will share a few thoughts with you.

On choosing the TF style:

wrapping the method using a tf.function to improve the performance

I believe that the portability of TF graphs is a powerful concept which in TF2.0 is enabled through tf.function (and would be compromised through bare eager execution), therefore I would hope to reinforce you in your suggestion for this additional reason. As a matter of fact, graph portability is how I got interested in SB project as I was executing graphs in C++ with this project being an example.

On MPI:

get similar performance with other techniques

I am not fully aware of the history of baselines or which points in PG methods are universally suitable for parallelization, but I would think that MPI is applicable when you cannot fit into one physical node e.g. you require 100 logical cores and above and can tolerate the cost of communication. I would suspect that most people don't do that? So yet again, I would think that indeed dropping an immediate hurdle for prospect gain is a good choice.

On the feasibility of TF2 algorithms implementation:

I actually was playing with porting SAC and DDPG (here), and managed to benchmark former against 2 very different environments successfully (didn't know zoo has hyperparameters available lol). SAC_TF2 seemed to behave just like your implementation. It's definitely not a library-quality, but perhaps still can be helpful as a first draft of an idea.

On generic parts of the algorithms:

That's a hard one when looking at details. Simple things like f.e. MLP creation inside of policies could be shared of course but writing generic stuff without obscuring ideas with many layers of indirection is problematic, to say the least. What I like most about this library is its relative readability which helped me a lot as a learner.

I have worked with just 3 of your implementations, which may not be enough to make a proper judgment but what caught my eye was the PPO's (2) Runner separation which felt to me quite applicable to the other 2 implementations I touched: SAC and DDPG where this wasn't used. I believe that one of the ideas for changes in Python TF frontend was to encourage splitting up things a bit more and Runner seems to fit nicely in that.

On naming:

I would like to unify also the parameters name (ex: ent_coef vs ent_coeff).

Great idea. There were examples that troubled me even bit more than this, where parameters are only seemingly different and I had to perform some mental translation to see they are not. This happens for instance in learning loops that present many flavors of similar things. E.g. I believe that train_frequency is generally the same as rollouts_number, but it took me a minute to realize when going through codebase especially when one is in a nested loop and other is used in one of 2 separate loops.

Hope sth makes sense out of those :)

Antymon on 24 Nov 2019

👍3

Really looking for the v3 version and more than willing to help.
Here are the two main suggestions:

Use Keras style callback class, for example:

class Callback(KerasCallback):
    def _set_env(self, env):
        self.env = env

    def on_episode_begin(self, episode, logs={}):
        """Called at beginning of each episode"""
        pass

    def on_episode_end(self, episode, logs={}):
        """Called at end of each episode"""
        pass

    def on_step_begin(self, step, logs={}):
        """Called at beginning of each step"""
        pass

    def on_step_end(self, step, logs={}):
        """Called at end of each step"""
        pass

    def on_action_begin(self, action, logs={}):
        """Called at beginning of each action"""
        pass

    def on_action_end(self, action, logs={}):
        """Called at end of each action"""
        pass

src: https://github.com/keras-rl/keras-rl/blob/master/rl/callbacks.py

Add a processor class for process action, observations, batch... Something like this:

class Processor(object):
    """Abstract base class for implementing processors.
    A processor acts as a coupling mechanism between an `Agent` and its `Env`. This can
    be necessary if your agent has different requirements with respect to the form of the
    observations, actions, and rewards of the environment. By implementing a custom processor,
    you can effectively translate between the two without having to change the underlaying
    implementation of the agent or environment.
    Do not use this abstract base class directly but instead use one of the concrete implementations
    or write your own.
    """

    def process_step(self, observation, reward, done, info):
        """Processes an entire step by applying the processor to the observation, reward, and info arguments.
        # Arguments
            observation (object): An observation as obtained by the environment.
            reward (float): A reward as obtained by the environment.
            done (boolean): `True` if the environment is in a terminal state, `False` otherwise.
            info (dict): The debug info dictionary as obtained by the environment.
        # Returns
            The tupel (observation, reward, done, reward) with with all elements after being processed.
        """
        observation = self.process_observation(observation)
        reward = self.process_reward(reward)
        info = self.process_info(info)
        return observation, reward, done, info

    def process_observation(self, observation):
        """Processes the observation as obtained from the environment for use in an agent and
        returns it.
        # Arguments
            observation (object): An observation as obtained by the environment
        # Returns
            Observation obtained by the environment processed
        """
        return observation

    def process_reward(self, reward):
        """Processes the reward as obtained from the environment for use in an agent and
        returns it.
        # Arguments
            reward (float): A reward as obtained by the environment
        # Returns
            Reward obtained by the environment processed
        """
        return reward

    def process_info(self, info):
        """Processes the info as obtained from the environment for use in an agent and
        returns it.
        # Arguments
            info (dict): An info as obtained by the environment
        # Returns
            Info obtained by the environment processed
        """
        return info

    def process_action(self, action):
        """Processes an action predicted by an agent but before execution in an environment.
        # Arguments
            action (int): Action given to the environment
        # Returns
            Processed action given to the environment
        """
        return action

    def process_state_batch(self, batch):
        """Processes an entire batch of states and returns it.
        # Arguments
            batch (list): List of states
        # Returns
            Processed list of states
        """
        return batch

    @property
    def metrics(self):
        """The metrics of the processor, which will be reported during training.
        # Returns
            List of `lambda y_true, y_pred: metric` functions.
        """
        return []

    @property
    def metrics_names(self):
        """The human-readable names of the agent's metrics. Must return as many names as there
        are metrics (see also `compile`).
        """
        return []

src: https://github.com/keras-rl/keras-rl/blob/216c3145f3dc4d17877be26ca2185ce7db462bad/rl/core.py#L515

BrikerMan on 30 Nov 2019

Use Keras style callback class, for example:

Callbacks are already implemented (and there will be a collection for v2.10 see #348 ) but as discussed in #571 this is not possible to have per-episode callback.

Add a processor class for process action

This already exists and is called a gym.Wrapper (and its more related to the environment than to the agent). We have the VecEnvWrapper equivalent for multiple env (cf doc)

For those two features, I recommend you to look at our recent tutorial (that covers both callbacks and wrappers): https://github.com/araffin/rl-tutorial-jnrr19

EDIT: for the callback, we may add additional events like "on_training_end" but I'm not sure if this is really needed in our case

araffin on 30 Nov 2019

I know I am a bit late to the party, but I guess that my PR would be totally broken in SB V3. Special with recurrent support in question marks.
I could help port/re-implement the current RNN Implementation, and possible extend on it. I wanted anyway to Implement multi speed RNN, similar to LINK. But RNN would be better suited for V3.1 and not for V3.
I may need a bit of help from you more experienced people for the multi speed RNN implementation, last time when I read the paper, I got a bit confused by Z at t parameter.

XMaster96 on 5 Dec 2019

After thinking more about the first steps (I created a PR for that #580 ), I think it would clearer to start from scratch in another repo (in the stable-baselines team) for designing the new interface (and creating the first algorithms).

Once we have a working version we can start merging the first repo into the original one.
If we don't do that, we will end with something like #580 with a lot of commented code.

araffin on 5 Jan 2020

@araffin

I tend to agree a through-out modification like this would probably be easier done by starting (mostly) from scratch, especially given all the oddball remnants of the original baselines repo which still hang around and cause confusion every now and then.

One nitpicky/tangent-comment though: If most of the codebase will be redone, can we still call this a "fork of OpenAI baselines", since most of the code does not really originate from there anymore? Just something that popped to my head.

Miffyli on 5 Jan 2020

starting (mostly) from scratch,

yes, the idea is to add things (mostly copying) when needed instead of deleting useless things.
We should be able normally to re-use quite a lot of code.

can we still call this a "fork of OpenAI baselines",

We are already more "inspired by OpenAI Baselines" rather than a "fork" of it (we are still using the same structure, tricks and it gives credibility to stable baselines). It does not really matter for me (more a question of branding), but we can change that later.

araffin on 5 Jan 2020

I've started a first draft for td3 here: https://github.com/Stable-Baselines-Team/stable-baselines-tf2
(I've been using both my internal pytorch version and @Antymon code to develop it fast)
It's running and working ~but I've got a big problem for now: it is super slow (less than 10FPS on cpu with a small network and all my 8 cpus are at 100%)~

@Antymon @AdamGleave ~do you have any advice on how to optimize the code? (I've started decorating with @tf.function here and there but it improves only slightly)~
EDIT: I updated to tensorflow 2.1.0, and now it is much faster...

~EDIT: I suspect the fact that I'm moving the array from numpy to tensorflow all the time to be responsible for that...~

araffin on 11 Jan 2020

It is puzzling though. I didn't experience difficulties in that area despite using tf version 2.0, especially not in the eager mode (it is easier to mess up with tf.function). Well, glad you found something that works anyway (and hope that didn't highlight a valid problem). Which gym environment you used for benchmarking?

Antymon on 12 Jan 2020

Which gym environment you used for benchmarking?

I'm using Pendulum-v0, with the right hyperparameters, you can solve it in less than 10k timesteps (still hard to solve for ppo though). But compared to my pytorch (unoptimized) version, it still runs quite slow (only 130FPS vs 200 for pytorch) but I did not run extensive tests, so it may be wrong.

araffin on 12 Jan 2020

I've added PPO (missing some tricks like grad norm clipping and orthogonal init) both for Discrete and Continuous actions.
Some cleaning is missing (also save/load) but the core is here so you can already take a look ;)

I plan also to type everything but I will focus on the callback collection for now.
Feedbacks about the draft tf2 repo are welcomed ;)

araffin on 12 Jan 2020

Some cleaning is missing (also save/load) but the core is here so you can already take a look ;)

Only skimmed it very quickly but looks good, seems clearer than the TF1 code.

AdamGleave on 14 Jan 2020

👍1

I concur with Adam this looks much cleaner than with TF1. I also like the use of buffers.py for doing the collection etc. How is the speed side? Does PPO run slower with this implementation?

Miffyli on 14 Jan 2020

👍1

How is the speed side? Does PPO run slower with this implementation?

It is... It runs at 1100FPS with pytorch and 210FPS with tensorflow 2... I used the exact same code for testing the two (just replace the import) (and the tf2 code is very similar to the pytorch one anyway).
The tensorflow 1 version is also faster than the pytorch version (which is not optimized).

So yes, I would appreciate help to optimize the new version ;).

EDIT: I created a colab notebook so you can also try it yourself

araffin on 18 Jan 2020

I have got good news!
I profiled the scripts with line profiler and added tf.function where needed. Now, I have a 3x speed up for PPO and a 4x speed up for TD3, they both match the PyTorch (unoptimized) version and almost stable-baselines tf1 performance =)

The commits: https://github.com/Stable-Baselines-Team/stable-baselines-tf2/commit/769b03e091067108a01a6778173b6cf90ec375ce and https://github.com/Stable-Baselines-Team/stable-baselines-tf2/commit/2ca1d0700914cee76ae5ebbea25bd6fb15ab2ce8

araffin on 25 Jan 2020

👍2

I have got good news!
I profiled the scripts with line profiler and added tf.function where needed and now I have a 3x speed up for PPO and a 4x speed up for TD3, they both match the PyTorch (unoptimized) version and almost stable-baselines tf1 performance =)

Great news, well done!

AdamGleave on 25 Jan 2020

I have got good news!
I profiled the scripts with line profiler and added tf.function where needed. Now, I have a 3x speed up for PPO and a 4x speed up for TD3, they both match the PyTorch (unoptimized) version and almost stable-baselines tf1 performance =)

The commits: Stable-Baselines-Team/stable-baselines-tf2@769b03e and Stable-Baselines-Team/stable-baselines-tf2@2ca1d07

Well done @araffin ! Can you advise us what made the biggest difference exactly? Would my implementations also benefit from your changes?

Antymon on 26 Jan 2020

Well done @araffin ! Can you advise us what made the biggest difference exactly? Would my implementations also benefit from your changes?

@Antymon for PPO, it was the forward pass and the evaluate_action (that returns the action, the value and the log probability)
for TD3, it was the gradient tape blocks (especially apply_gradient) but looking at your implementations, you have already done the right thing ;)

araffin on 26 Jan 2020

👍1

Would your investigation imply that eager mode TF is considerably slower than Pytorch?

Antymon on 26 Jan 2020

The naive version yes apparently, but i would say it is expected as pytorch was built for that mode first.

araffin on 26 Jan 2020

fyi:
we are having a discussion with the Stable Baselines maintainers about its future, and we need your opinion.

As a big change is required anyway, what framework would you prefer to be the base for V3.0 ?

we have poll on Twitter, ending in one week :
https://mobile.twitter.com/araffin2/status/1223310856471138306

araffin on 1 Feb 2020

Regarding the structure of the project, it would be beneficial to create some form of separation of concerns and responsibilities. At the moment, the algorithms employ every trick in the book, which, while great it comes with the burden of code duplication, rigidness and complexity that turns algorithms into god classes with over 200 loc functions. In reality, most algorithms aren't a lot more than their objective function and the extra tricks can be added through mixins.

PartiallyTyped on 6 Feb 2020

Did you take a look at the tf2 draft?
what you are suggesting is more modularity?

araffin on 6 Feb 2020

How much modularity is a tricky balancing act for RL algorithms. A lot of modularity can make an algorithm very hard to understand unless you already have a good grasp of the framework.

Spinning Up intentionally moved in the opposite direction: everything you need to understand an algorithm being in one file. Although this was a decision based on pedagogy not long-term SWE, I think it does illustrate a tension here.

AdamGleave on 6 Feb 2020

👍1

How much modularity is a tricky balancing act for RL algorithms. A lot of modularity can make an algorithm very hard to understand unless you already have a good grasp of the framework.

I completely agree with @AdamGleave on that point.

Spinning Up

I wanted to take that example to ^^
so yes, for short, Spinning Up is a lot of duplicated code, but you don't have to search through the codebase and it's very good for educational purposes. On the other hand, you can take pyrobolearn where you can define algorithms like DQN in 40 lines of code, but it's a maze to understand what is actually happening.

I see Stable Baselines to be in the middle, we don't aim at full modularity (and want self contained implementations) but we try to avoid code duplication too (for instance in the tf2 draft, the collect_rollout removes a lot of duplication).

araffin on 6 Feb 2020

Did you take a look at the tf2 draft?
what you are suggesting is more modularity?

I find the draft to be good. I am suggesting slightly more modularity over the current implementation and spinning up, but perhaps not as much as ray[rllib].

@AdamGleave is right modularity in RL is indeed a double edged sword.

I think the modularity can be postponed until the algorithms are implemented with eager mode as that will reduce a lot of clutter.

PartiallyTyped on 6 Feb 2020

In the current version of SB, the user specifies the number of environment steps. It would be great if we could also specify the number of episodes.

PartiallyTyped on 25 Feb 2020

In the current version of SB, the user specifies the number of environment steps. It would be great if we could also specify the number of episodes.

This will be the case for SAC and TD3 (only valid when using one environment), it is not possible for ppo because of how the algorithm is defined.

EDIT: it will be only for gradient update

araffin on 25 Feb 2020

👍1

Here is a summary of what will the new version look like. The backend choice is summarized in issue #733 .

Whathever the backend, we plan to rewrite the lib more or less from scratch but keeping the api and most of the useful things that are currently in the common folder.
As overall architecture, we would keep the same architecture (so we don't aim at full modularity) but starting from scratch should allow to unify even more the algorithms and reduce code duplication. Before going any further, I would suggest that v2.10.0 would be the last one with new features (then we should do only bug fix), otherwise, it would take too much time to develop the new and the old version.

Dropped Features

some algorithms are not planned in the new version (for now), due to their complexity vs their use (which makes them hard to maintain): ACKTR and ACER won't be part of it
we will focus on model free RL only, leaving GAIL and other inverse rl/ imitation learning to Adam's repo (then maybe behavior cloning too), again this is to facilitate maintainability
MPI was mostly source of issues and sounds duplicate vs VecEnv, so we will drop its support too

New Features

policy should also implement save/load, so you don't have to re-create a fulll model to use them (can be used standalone) (requested by Adam)
proper evaluation (evaluating on a separate env, to favor best practices) should be included in each RL algorithm (requested by me, fairly easy to implement once the callback collection is merged), I may do a PR for V2.10 too
completely consistent api (no more ent_coef vs ent_coeff)

Delayed Features (so planned but not high-priority)

tuple/dict support for observation is one of the most requested feature (for the action space, I think this is a research topic, so I wouldn't plan for it) #207 #107 #133
RNN/LSTM support, even though it is used fairly often (even if the real benefit is still to be proven), recurrent policies will complexify the code quite a lot, so it is not planned for v3.0

Internal changes

I would like to change the default hyperparams of some algorithms (SAC/TD3) to match the ones from the papers
the problem is that people usually use the default, despite warnings to check the zoo... (I may do that change for v2.10.0)
Every environment will be a VecEnv, currently, we are using a UnVecWrapper which makes things a bit awkward
Whathever the backend, we will switch to a "eager mode" framework, wich would makes things clearer
we would unify rollout collection for both off-policy and on-policy (currently called Runner), this will allow to remove some duplicated code
everything should be typed (or at least, as much as possible)
env will be automatically wrapped into a Monitor wrapper, this would allow to collects stats more easily (the current episode reward logger is broken in lots of cases)

Most of the previous changes are already implemented in the tf2 draft: https://github.com/Stable-Baselines-Team/stable-baselines-tf2

araffin on 8 Mar 2020

👍2

@araffin I could contribute Energy Based Prioritisation (EBP) to HER algorithm.

Nice improvement overall without the increasing computational time. Not difficult to implement and the user could opt to use prioritisation: energy or prioritisation: none, for examlpe.

Let me know if this is something you would be interested in.

RyanRizzo96 on 8 Mar 2020

@RyanRizzo96

The initial version(s) will focus on porting what is currently in stable-baselines (minus the parts that will be dropped) into new backend, and making sure the performance does not change from current implementations. New features like that would be considered later.

Miffyli on 8 Mar 2020

👍2

It will be quite useful to implement __repr__ and __str__ functions and when the models subclass pytorch.nn.Module to override extra_repr and show additional information.

PartiallyTyped on 24 Mar 2020

It would be helpful to support viskit logging. Viskit is developed specifically for RL and plots RL experiments much better than tensorboard. https://github.com/rlworkgroup/viskit

buoyancy99 on 8 Apr 2020

You mean dowel ?
Viskit is for the visualization.

Actually, the logger from Stable-Baselines is quite close to the one from rllab, which is the previous version of garage, also developped by the rl-group.
It could be a feature for the logger, however, it seems that the project was not updated since september 2019 which is not a good sign.

araffin on 8 Apr 2020

@araffin are you referring to dowel or viskit?

viskit is definitely a little stale and I've been meaning to merge @vitchyr's improvements, but low-priority as it seem to see little usage compared to TensorBoard.

dowel is most-definitely actively-developed. One of the nice things about pulling single-purpose software into single-purpose packages is that we don't have to change it very much :).

If you are interested in more typing for RL, take a look at akro. It currently depends on gym.spaces, but we plan on removing that dep and making shapes part of the type signatures in the near future. It has some nice helpers for manipulating data from known spaces, e.g. concatenating a bunch of dict observations.

ryanjulian on 17 Apr 2020

@araffin are you referring to dowel or viskit?

For the logger, I'm referring to dowel. It would be nice to have but in my opinion it adds too many dependencies for now (e.g. tensorboardX should be optional).

If you are interested in more typing for RL, take a look at akro.

For now, gym.Space was enough and I would argue against adding additional dependencies.

araffin on 19 Apr 2020

@araffin I'd love to hear your feedback on dowel. Feel free to hop on over to the Issues page and leave some questions/concerns!

Is the tbX dependency the only thing keeping you from using dowel (note tbX doesn't rely on TF and only depends on numpy, protobuf, and six)? I'm happy to update it to make tbX an optional extra if that's what's blocking you.

ryanjulian on 20 Apr 2020

A beta version is now online: https://github.com/DLR-RM/stable-baselines3

araffin on 8 May 2020

😄1

Is the tbX dependency the only thing keeping you from using dowel (note tbX doesn't rely on TF and only depends on numpy, protobuf, and six)? I'm happy to update it to make tbX an optional extra if that's what's blocking you.

That's nice of you. I would say that for now we have a simple logger implementation that fits our needs, so there is no real need to change it. But maybe if we encounter some limitations and don't have the time to maintain it, we may switch to dowel ;)

araffin on 10 May 2020

closing this issue as any new features/changes should be discussed on the new repo.

araffin on 26 May 2020

Was this page helpful?

0 / 5 - 0 ratings