I'd like to implement Hindsight Experience Replay (HER). This can be based on a whatever goal-parameterized RL off-policy algorithm.
Goal-parameterized architectures: it requires a variable for the current goal and one for the current outcome. By outcome, I mean anything that is requires to compute the current outcome in the process of targeting the goal, e.g. the RL task is to reach a 3D target (the goal) with a robotic hand. The position of the target is the goal, the position of the hand is the outcome. The reward is a function of the distance between the two. Goal and outcome are usually subparts of the state space.
How Gym handles this: In Gym, there is a class called GoalEnv to deal with such environments.
Stable-baselines does not consider this so far. The replay buffer, BasePolicy, BaseRLModels OffPolicyRLModels only consider observation, and are not made to include a notion of goal or outcome. Two solutions:
I think the second is more clear as it separates observation from goals and outcomes, but probably it would make the code less easy to follow, and would require more changes than the first option. So let's go for the first as Ashley started.
First thoughts on how it could be done.
I think it does not require too much work after what Ashley started to do. It would be a few modifications to integrate the GoalEnv of gym, as it is a standard way to use multi-goal environments. Then correct the assumption he made about the dimension of the goal.
If you're all ok, I will start in that direction and test them on the Fetch environments. In the baselines, their performance is achieved with 19 processes in parallel. They basically average the update of the 19 actors. I'll try first without parallelization.
Some questions to clarify my understanding:
So let's go for the first as Ashley started.
You meant second, no?
I think it does not require too much work after what Ashley started to do.
Good news then =)
, I will start in that direction and test them on the Fetch environments
For debugging and for your tests, Fetch env is ok. However, we will need either to adapt an existing env (e.g. Pendulum-v0) or create an artificial one (e.g. Identity env) in order to write unit tests for it (Mujoco requires a license and we don't have one for travis...).
For an open source Goal Env, you can also look into the parking env by @eleurent in https://github.com/eleurent/highway-env
I wanted to say: goal and outcome spaces are usually subspaces of the state space (but not always). Example: FetchReach (place gripper at 3D location). The observation is made of the position and velocities of all the joints, while the goal is only a 3D location, the outcome is the 3D position of the hand. In that case the outcome is a subpart of the state.
In GoalEnv, the observation return is a dictionary with three keys: observation, achieved_goal and desired_goal. What I'm doing now is wrapping the gym env to concatenate everything into one observation, and having some variable saying which indexes are what, so that HER can do its substitution. After that, I wrap in a DummyVecEnv.
(Yes I meant the second, as Ashley).
I guess we could modify easily Pendulum or MountainCarContinuous yes.
Last problem I encountered: DDPG considers the return to be the sum of all rewards. It's not absolutely general, sometimes it's only the last reward
I thought return meant sum of discounted reward for one episode... In the Fetch env, I think it is a sparse reward (0 almost everywhere except when reaching the goal), so last reward is the same of the sum no?
Does it affect the training of the algorithm (the difference in formulation) or is it only for logging?
In Fetch, always -1 and 0 when the goal is touched. It's only for logging, the algorithm uses the transition-based rewards. In that case, saying that the sum is -49 for 50 steps for instance, does not indicate whether the goal was reached in the middle of the episode (in that case the episode is not solved), or at the end (episode solved).
I see... then for GoalEnv, it makes sense to have that feature (showing last return only).
The problem is that it's computed outside of the env, in the learn function of the algorithm. That would require to update all algorithm to allow that.
That would require to update all algorithm to allow that.
I would update only algorithms that can be used with HER (and therefore GoalEnv), I'm not sure if it makes sense for other type of env.
Is this still being worked on? This feature is something I'm quite interested in.
Update:
I implemented it but it's still bugged somehow. It runs but it does not learn anything on FetchReach. It's on a fork from stable baselines. I can give you access if you want to check it out.
In the meantime I also tried implementing it using sac's spinning up. It learns perfectly on FetchReach but nothing on FetchPush. I'm quite puzzled because I already tried to reproduce results on FetchPush last year using a TD3 base and it also worked perfectly on FetchReach and not on FetchPush.
Either I do something wrong, or there is some special trick in the OpenAI Baselines that I didn't catch. Their version uses 19 worked in parallel, each doing 2 rollouts, computing an update using a batch of 256 and summing the 19 updates (yes they sum, they don't average). I would say it's roughly equivalent to do 38 collection rollout, then to use a 19 times bigger batch size and 19 times bigger learning rate (the sum of 19 updates). I tried this also but it got even worse.
I don't have much time these days so it's on pause right now.
@ccolas don't hesitate to ping me if you need some help to double check some parts ;)
@ccolas @hill-a I'm taking over for this one.
I have read the paper again (and some implementations) and started to implement it (mostly from scratch).
There is one difference that was not mentioned that I also spotted: the original paper creates new transitions (by sampling goals) online after each rollout (not after each step) but this should be ok (in the sense we can overcome that limitation by using a custom replay buffer)
Here is my current plan (and current progress):
GoalEnvsampling_strategy and a get_achieved_goal callable. The first argument refers to HER replay strategy, the second is a callable that convert an observation to a goal (so we can use env.compute_reward), I'm not sure if it is needed (still in early stage of development)To overcome current limitation of stable-baselines (dict obs are not supported), I'll do something similar to @hill-a , using a wrapper over the environment.
Roadmap:
FINAL strategyNote: I consider the last point to be not a priority
@ccolas
I have a first working draft with DQN (tried with a flipped bit env with n_bits=30 and it worked =) ),
you can have a look at it even if it is not fully polished yet.
In short, I made those choices, which is mostly wrappers:
to sum it up, the implementation now looks pretty simple, it is just a wrapper around a model and an env (I still don't understand why the original baselines were so complicated).
My next step is to make it work with SAC ;)
Super cool, thanks !
Keep me updated about SAC. I found it was easy to make it work on FetchReach but impossible to make it work on FetchPush. If you don't have a mujoco license, I can run tests on my side when you're done.
@ccolas In my experiments, I found that SAC is quite hard to tune for problem with deceptive reward (I did not find good hyperparameters yet for MountainCar-v0 for instance), so this can be an issue when working with problem like FetchPush.
However, in my current experiments with HER + DDPG, I managed to get it work on harder problem where HER + SAC is failing (ex: continuous bit flipping env work for HER + DDPG when N_BITS=12).
I think I will open an issue on the original repo.
Also, the original baselines have several tricks (and I don't which one is useful or not) compared to the original HER paper:
PS: SAC and DDPG are now supported on my dev branch ;) (just missing saving/loading for now)
@ccolas interesting discussion on SAC with sparse rewards is happening there ;):
https://github.com/rail-berkeley/softlearning/issues/76
Very good work, thanks !
About the additional tricks used in the baseline, I can comment them based on my experience:
I'll try to run it on the mujoco envs to see what it gives !
Thanks for the thread on SAC and CMC, I used that environment in a previous project, it'll be interesting !
Changing this parameter has a great influence on performance
Yes, from my experience, it allows a better exploration and changes a lot.
This way the size of the replay buffer does not depend on the amount of replay.
That's true, but I also feel it will be less clear in the implementation, no?
among the tricks I forgot, there is also a l2 loss on the actions. (and I have to check the DDPG implementation, to see what is different from the one in the baselines)
Linking related issue of highway envs: https://github.com/eleurent/highway-env/issues/15
Edit: another additionnal trick: random_eps, they perform pure random exploration a fraction of the time
@ccolas The current hyperparameters I'm using for the highway-env (for SAC and DDPG) and that works better than the default ones (close to the default found in openai implementation):
SAC:
n_sampled_goal = 4
model = HER('MlpPolicy', env, SAC, n_sampled_goal=n_sampled_goal, goal_selection_strategy='future',
verbose=1, buffer_size=int(1e6),
learning_rate=1e-3,
gamma=0.95, batch_size=256, policy_kwargs=dict(layers=[256, 256, 256]))
DDPG:
n_actions = env.action_space.shape[0]
noise_std = 0.2
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=noise_std * np.ones(n_actions))
n_sampled_goal = 4
model = HER('MlpPolicy', env, DDPG, n_sampled_goal=n_sampled_goal, goal_selection_strategy='future',
verbose=1, buffer_size=int(1e6),
actor_lr=1e-3, critic_lr=1e-3, action_noise=action_noise,
gamma=0.95, batch_size=256, policy_kwargs=dict(layers=[256, 256, 256]))
Note: ~I did not change the policy architecture yet ([64, 64] for now instead of [256, 256, 256])~
I also added the success rate to the logs ;)
EDIT: the network architecture seems to have a great impact here... (I updated SAC hyperparams)
Also related for the tricks: https://github.com/vitchyr/rlkit/pull/35
Update: working version with HER + DDPG on FetchPush. Still a bug with VecEnvs but should be ready to merge soon (see PR)
I saw that the algorithms can perform well in DDPG+HER and SCE+HER for Fetch Push Environment. How about Pick and Place? I saw the issue mentioned by fisherxue is still in the status of Open.
You can take a look at the trained agent (and hyperparameters) in the zoo ;)
https://github.com/araffin/rl-baselines-zoo/pull/53
Most helpful comment
@ccolas @hill-a I'm taking over for this one.
I have read the paper again (and some implementations) and started to implement it (mostly from scratch).
There is one difference that was not mentioned that I also spotted: the original paper creates new transitions (by sampling goals) online after each rollout (not after each step) but this should be ok (in the sense we can overcome that limitation by using a custom replay buffer)
Here is my current plan (and current progress):
GoalEnvsampling_strategyand aget_achieved_goalcallable. The first argument refers to HER replay strategy, the second is a callable that convert an observation to a goal (so we can useenv.compute_reward), I'm not sure if it is needed (still in early stage of development)To overcome current limitation of stable-baselines (dict obs are not supported), I'll do something similar to @hill-a , using a wrapper over the environment.
Roadmap:
FINALstrategyNote: I consider the last point to be not a priority