Stable-baselines: [feature request] Implement goal-parameterized algorithms (HER)

Created on 12 Feb 2019 · 22Comments · Source: hill-a/stable-baselines

I'd like to implement Hindsight Experience Replay (HER). This can be based on a whatever goal-parameterized RL off-policy algorithm.

Goal-parameterized architectures: it requires a variable for the current goal and one for the current outcome. By outcome, I mean anything that is requires to compute the current outcome in the process of targeting the goal, e.g. the RL task is to reach a 3D target (the goal) with a robotic hand. The position of the target is the goal, the position of the hand is the outcome. The reward is a function of the distance between the two. Goal and outcome are usually subparts of the state space.

How Gym handles this: In Gym, there is a class called GoalEnv to deal with such environments.

The variable observation_space is replaced by another class that contains the true observation space observation_space.spaces['observation'], goal space (observation_space.spaces['desired_goal']) and outcome space(observation_space.spaces['achieved_goal']).
The observation returned first by env.step is now a dictionnary obs['observation'], obs['desired_goal'], obs['achieved_goal'].
The environment defines a reward function (compute_reward), that takes as argument the goal and the outcome to return the reward
It also contains a sample_goal function that simply samples a goal uniformly from the goal space.

Stable-baselines does not consider this so far. The replay buffer, BasePolicy, BaseRLModels OffPolicyRLModels only consider observation, and are not made to include a notion of goal or outcome. Two solutions:

Adapt these default classes to allow the representation of goals and outcomes in option.
Concatenate the goal and outcome to the observation everywhere so that the previously mention classes don't see the difference. This requires to keep track of the indices of goals and outcomes in the full obs_goal_outcome vector. Ashley started to do this from what I understood. However, he did not take into account the GoalEnv class of Gym, which I think we should use as it's kind of neat, and it's used for the Robotic Fetch environment, which are kind of the only one generally used so far.

I think the second is more clear as it separates observation from goals and outcomes, but probably it would make the code less easy to follow, and would require more changes than the first option. So let's go for the first as Ashley started.

First thoughts on how it could be done.

we need (as Ashley started to do), a wrapper around the gym environment. GoalEnv are different from usual env because they return a dict in place of the former observation vector. This wrapper would unpack the observation in obs, goal, outcome from the GoalEnv.step. It would return a concatenation of all of those. Ashley considered that the goal was in the observation space, so that the concatenation was twice as long as the observation. This is generally not true. So we would need to keep as attribute the size of the goal and outcome spaces. It would keep the different spaces as attributes, keep the function to sample goals, and the reward function.

A multi-goal replay buffer to implement HER replay. It takes the observation from the buffer and redecompose it in obs, goal and outcome before performing replay.

I think it does not require too much work after what Ashley started to do. It would be a few modifications to integrate the GoalEnv of gym, as it is a standard way to use multi-goal environments. Then correct the assumption he made about the dimension of the goal.

If you're all ok, I will start in that direction and test them on the Fetch environments. In the baselines, their performance is achieved with 19 processes in parallel. They basically average the update of the 19 actors. I'll try first without parallelization.

enhancement

Source

ccolas

👍2

Most helpful comment

@ccolas @hill-a I'm taking over for this one.
I have read the paper again (and some implementations) and started to implement it (mostly from scratch).
There is one difference that was not mentioned that I also spotted: the original paper creates new transitions (by sampling goals) online after each rollout (not after each step) but this should be ok (in the sense we can overcome that limitation by using a custom replay buffer)

Here is my current plan (and current progress):

I created a FlippedBitEnv to iterate and debug quickly (both with continuous and discrete actions)
I plan to ensure that the environment inherits from GoalEnv
I started from scratch but in fact, now, my current plan really ressemble what @hill-a did ^^#
my current implementation requires the user to pass a sampling_strategy and a get_achieved_goal callable. The first argument refers to HER replay strategy, the second is a callable that convert an observation to a goal (so we can use env.compute_reward), I'm not sure if it is needed (still in early stage of development)

To overcome current limitation of stable-baselines (dict obs are not supported), I'll do something similar to @hill-a , using a wrapper over the environment.

Roadmap:

[x] implement dev env FlippedBitEnv
[x] have a first limited version working on FlippedBitEnv , i.e. DQN with FINAL strategy
[x] add support for DDPG and SAC and make it work for the FlippedBitEnv
[x] implement more replay strategies
[x] make them work on Highway env
[x] polish code
[x] documentation
[x] add support for more type of env obs spaces (Discrete, Binary, MultiDiscrete?)
[ ] add full support for all env obs spaces (e.g. where the observation and goal space are not of the same type)

Note: I consider the last point to be not a priority

araffin on 11 Apr 2019

👍2 🎉1

All 22 comments

Some questions to clarify my understanding:

goal and outcome are part of the observation? (Is that what you mean by "Goal and outcome are usually subparts of the state space." ?)
in a GoalEnv, does observation key contains both outcome and goal or do we need to concatenate them? (I think we need to concatenate them but I'm not sure)

So let's go for the first as Ashley started.

You meant second, no?

I think it does not require too much work after what Ashley started to do.

Good news then =)

, I will start in that direction and test them on the Fetch environments

For debugging and for your tests, Fetch env is ok. However, we will need either to adapt an existing env (e.g. Pendulum-v0) or create an artificial one (e.g. Identity env) in order to write unit tests for it (Mujoco requires a license and we don't have one for travis...).

For an open source Goal Env, you can also look into the parking env by @eleurent in https://github.com/eleurent/highway-env

araffin on 13 Feb 2019

I wanted to say: goal and outcome spaces are usually subspaces of the state space (but not always). Example: FetchReach (place gripper at 3D location). The observation is made of the position and velocities of all the joints, while the goal is only a 3D location, the outcome is the 3D position of the hand. In that case the outcome is a subpart of the state.
In GoalEnv, the observation return is a dictionary with three keys: observation, achieved_goal and desired_goal. What I'm doing now is wrapping the gym env to concatenate everything into one observation, and having some variable saying which indexes are what, so that HER can do its substitution. After that, I wrap in a DummyVecEnv.

(Yes I meant the second, as Ashley).

I guess we could modify easily Pendulum or MountainCarContinuous yes.

Last problem I encountered: DDPG considers the return to be the sum of all rewards. It's not absolutely general, sometimes it's only the last reward (in the Fetch env for instance). A solution can be to add a string parameter to the class return_func='last' or 'sum'.

ccolas on 13 Feb 2019

Last problem I encountered: DDPG considers the return to be the sum of all rewards. It's not absolutely general, sometimes it's only the last reward

I thought return meant sum of discounted reward for one episode... In the Fetch env, I think it is a sparse reward (0 almost everywhere except when reaching the goal), so last reward is the same of the sum no?
Does it affect the training of the algorithm (the difference in formulation) or is it only for logging?

araffin on 13 Feb 2019

In Fetch, always -1 and 0 when the goal is touched. It's only for logging, the algorithm uses the transition-based rewards. In that case, saying that the sum is -49 for 50 steps for instance, does not indicate whether the goal was reached in the middle of the episode (in that case the episode is not solved), or at the end (episode solved).

ccolas on 13 Feb 2019

I see... then for GoalEnv, it makes sense to have that feature (showing last return only).

araffin on 13 Feb 2019

The problem is that it's computed outside of the env, in the learn function of the algorithm. That would require to update all algorithm to allow that.

ccolas on 13 Feb 2019

That would require to update all algorithm to allow that.

I would update only algorithms that can be used with HER (and therefore GoalEnv), I'm not sure if it makes sense for other type of env.

araffin on 13 Feb 2019

Is this still being worked on? This feature is something I'm quite interested in.

keshaviyengar on 28 Feb 2019

Update:
I implemented it but it's still bugged somehow. It runs but it does not learn anything on FetchReach. It's on a fork from stable baselines. I can give you access if you want to check it out.

In the meantime I also tried implementing it using sac's spinning up. It learns perfectly on FetchReach but nothing on FetchPush. I'm quite puzzled because I already tried to reproduce results on FetchPush last year using a TD3 base and it also worked perfectly on FetchReach and not on FetchPush.

Either I do something wrong, or there is some special trick in the OpenAI Baselines that I didn't catch. Their version uses 19 worked in parallel, each doing 2 rollouts, computing an update using a batch of 256 and summing the 19 updates (yes they sum, they don't average). I would say it's roughly equivalent to do 38 collection rollout, then to use a 19 times bigger batch size and 19 times bigger learning rate (the sum of 19 updates). I tried this also but it got even worse.

I don't have much time these days so it's on pause right now.

ccolas on 28 Feb 2019

@ccolas don't hesitate to ping me if you need some help to double check some parts ;)

araffin on 28 Feb 2019

Here is my current plan (and current progress):

I created a FlippedBitEnv to iterate and debug quickly (both with continuous and discrete actions)
I plan to ensure that the environment inherits from GoalEnv
I started from scratch but in fact, now, my current plan really ressemble what @hill-a did ^^#
my current implementation requires the user to pass a sampling_strategy and a get_achieved_goal callable. The first argument refers to HER replay strategy, the second is a callable that convert an observation to a goal (so we can use env.compute_reward), I'm not sure if it is needed (still in early stage of development)

To overcome current limitation of stable-baselines (dict obs are not supported), I'll do something similar to @hill-a , using a wrapper over the environment.

Roadmap:

[x] implement dev env FlippedBitEnv
[x] have a first limited version working on FlippedBitEnv , i.e. DQN with FINAL strategy
[x] add support for DDPG and SAC and make it work for the FlippedBitEnv
[x] implement more replay strategies
[x] make them work on Highway env
[x] polish code
[x] documentation
[x] add support for more type of env obs spaces (Discrete, Binary, MultiDiscrete?)
[ ] add full support for all env obs spaces (e.g. where the observation and goal space are not of the same type)

Note: I consider the last point to be not a priority

araffin on 11 Apr 2019

👍2 🎉1

@ccolas
I have a first working draft with DQN (tried with a flipped bit env with n_bits=30 and it worked =) ),
you can have a look at it even if it is not fully polished yet.

In short, I made those choices, which is mostly wrappers:

HER will be just a wrapper around a RL model (which makes sense)
instead of creating a new replay buffer class, I created a wrapper for HER replay buffer, that is were all the magic happens
i have a env wrapper that concatenate dict observation and can do the inverse operation

to sum it up, the implementation now looks pretty simple, it is just a wrapper around a model and an env (I still don't understand why the original baselines were so complicated).
My next step is to make it work with SAC ;)

araffin on 15 Apr 2019

Super cool, thanks !
Keep me updated about SAC. I found it was easy to make it work on FetchReach but impossible to make it work on FetchPush. If you don't have a mujoco license, I can run tests on my side when you're done.

ccolas on 15 Apr 2019

@ccolas In my experiments, I found that SAC is quite hard to tune for problem with deceptive reward (I did not find good hyperparameters yet for MountainCar-v0 for instance), so this can be an issue when working with problem like FetchPush.
However, in my current experiments with HER + DDPG, I managed to get it work on harder problem where HER + SAC is failing (ex: continuous bit flipping env work for HER + DDPG when N_BITS=12).

I think I will open an issue on the original repo.

Also, the original baselines have several tricks (and I don't which one is useful or not) compared to the original HER paper:

they use 19 workers and average the parameters
they normalize the input, goal and obs (this can be done using VecNormalize(norm_reward=False))
they formulate sampling additional goal in a different manner: instead of adding artificial transition, they do it when sampling from the replay buffer

PS: SAC and DDPG are now supported on my dev branch ;) (just missing saving/loading for now)

araffin on 20 Apr 2019

@ccolas interesting discussion on SAC with sparse rewards is happening there ;):
https://github.com/rail-berkeley/softlearning/issues/76

araffin on 21 Apr 2019

Very good work, thanks !
About the additional tricks used in the baseline, I can comment them based on my experience:

19 workers: they do not average the updates, but sum them. It kind of relates to the A2C way of doing things, having workers in parallel, each with their replay buffer, each doing an update. Changing this parameter has a great influence on performance (it's not equivalent to running with 1 worker for a longer time). I suspect it's part of the reason why it works better than other implementations on FetchPush.
I remove don't normalize obs and goals in my experiments and it works just as well. I guess it depends on the values of the obs and goals, in the Fetch environments they are small and centered so it doesn't impact much.
I think it's better to use counterfactual learning at sampling instead of at storing time. This way the size of the replay buffer does not depend on the amount of replay. You also might want to tune replay in particular ways. For example I replay more the tasks that have higher learning progress, and I need to have the control on replay whenever I want to do an update, thus at sampling time.

I'll try to run it on the mujoco envs to see what it gives !
Thanks for the thread on SAC and CMC, I used that environment in a previous project, it'll be interesting !

ccolas on 22 Apr 2019

Changing this parameter has a great influence on performance

Yes, from my experience, it allows a better exploration and changes a lot.

This way the size of the replay buffer does not depend on the amount of replay.

That's true, but I also feel it will be less clear in the implementation, no?

among the tricks I forgot, there is also a l2 loss on the actions. (and I have to check the DDPG implementation, to see what is different from the one in the baselines)

Linking related issue of highway envs: https://github.com/eleurent/highway-env/issues/15

Edit: another additionnal trick: random_eps, they perform pure random exploration a fraction of the time

araffin on 22 Apr 2019

@ccolas The current hyperparameters I'm using for the highway-env (for SAC and DDPG) and that works better than the default ones (close to the default found in openai implementation):

SAC:

n_sampled_goal = 4

model = HER('MlpPolicy', env, SAC, n_sampled_goal=n_sampled_goal, goal_selection_strategy='future',
            verbose=1, buffer_size=int(1e6),
            learning_rate=1e-3,
            gamma=0.95, batch_size=256, policy_kwargs=dict(layers=[256, 256, 256]))

DDPG:

n_actions = env.action_space.shape[0]
noise_std = 0.2
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=noise_std * np.ones(n_actions))

n_sampled_goal = 4

model = HER('MlpPolicy', env, DDPG, n_sampled_goal=n_sampled_goal, goal_selection_strategy='future',
            verbose=1, buffer_size=int(1e6),
            actor_lr=1e-3, critic_lr=1e-3, action_noise=action_noise,
            gamma=0.95, batch_size=256, policy_kwargs=dict(layers=[256, 256, 256]))

Note: ~I did not change the policy architecture yet ([64, 64] for now instead of [256, 256, 256])~
I also added the success rate to the logs ;)

EDIT: the network architecture seems to have a great impact here... (I updated SAC hyperparams)

araffin on 23 Apr 2019

Also related for the tricks: https://github.com/vitchyr/rlkit/pull/35

araffin on 6 May 2019

Update: working version with HER + DDPG on FetchPush. Still a bug with VecEnvs but should be ready to merge soon (see PR)

araffin on 25 May 2019

I saw that the algorithms can perform well in DDPG+HER and SCE+HER for Fetch Push Environment. How about Pick and Place? I saw the issue mentioned by fisherxue is still in the status of Open.