Stable-baselines: [question] differences between openai baselines and stable-baselines DDPG

Created on 22 Nov 2018 · 12Comments · Source: hill-a/stable-baselines

Hi,

I'm trying to use baselines to train an aircraft to fly in simulation. I've been switching between openai baselines and stable-baselines trying to decide which one to go with. One thing I've noticed is both appear to train successfully, but when I reload the agent within stable-baselines its actions appear to be "locked" to a certain output.

I noticed there are a lot of differences (aka fixes) made within stable-baselines... what is your procedure right now for verifying that the algorithms behave the same? I'd like to repeat that procedure within my setup to make sure I'm not somehow breaking things with how I'm calling the two baselines.

Or maybe alternatively, is there a standard environment that you guys normally use for continuous action and state space... I could maybe swap out my environment for that environment and retest.

Full code in case you want to browse it... but not necessary unless you're curious.

https://github.com/jrjbertram/jsbsim_rl

bug question

Source

jrjbertram

Most helpful comment

An update here... created a script to invoke a toy environment (lunar lander) to experiment with that lets me call either openai baselines or stable baselines depending on cmd line switch. Was using it to compare training and behaviors... key finding is that stable baselines does seem to be correctly implemented.

What I was seeing is that even if I break into openai baselines run.py and make some tweaks to allow me to call learn on the model multiple times, down deeper within its code it still recreates the model each time learn is called, so what I was seeing on my original problem I'm pretty sure was that openai baselines every 1e6 steps was basically restarting learning from scratch... and for my original problem / airplane environment, the longer you train the less well it seems to perform... that's probably a separate issue on my end.

I also compared stable baselines training across different numbers of time steps before calling save... tested out 1e5, 1e6, and 1e9... where after completing that many timesteps, a save and restore were called. Any errors in the implementation would I think show up there. I saw consistent performance across the save/restores which makes me think it's working great.

Closing this question.

jrjbertram on 25 Nov 2018

👍2

All 12 comments

Hello,

I reload the agent within stable-baselines its actions appear to be "locked" to a certain output.

Did you try setting deterministic=False for DDPG? (normally, Deep Deterministic Policy Gradient (DDPG) should be deterministic during testing)
Can you be also more precise of what the "locked" means?

what is your procedure right now for verifying that the algorithms behave the same

That is a very good question. The main problem with RL, is that even with the same algo, same implementation, you can have different results. So for now, we do basic check such as training DDPG on a toy problem and checking that it can solve it.
See https://github.com/hill-a/stable-baselines/blob/master/tests/test_identity.py#L56

That's also why I started the RL Baselines Zoo in order to check that we got the same performance as the original Baselines. (And I don't have any MuJoCo licence so I cannot reproduce original paper result for now :/ )

If you want to check results, I would go for MountainCarContinuous-v0, the tuned hyperparams can be found here: https://github.com/araffin/rl-baselines-zoo/blob/master/hyperparams/ddpg.yml#L1

I cannot assure you that the two implementations are exactly the same (because a lot changed due to refactoring), however, it seems that our implementation is working on different problem, see https://github.com/araffin/learning-to-drive-in-a-day for an example

araffin on 22 Nov 2018

"locked" here meaning it appears the agent is constantly spitting out a single action regardless of the state that is input to it. A couple of causes come to mind:

1) the load failed due to bug in my code, so the agent has default (e.g. random initial values) weights.

2) training somehow collapsed resulting in a policy that always generates a fixed output. (seems unlikely, and I've never observed that during any RL experiments I've tried)

jrjbertram on 23 Nov 2018

Because you said that training works, I would make sure that the observation you give to the agent is the same between training/testing.

araffin on 23 Nov 2018

Closing this question.

jrjbertram on 25 Nov 2018

👍2

Ok, good news, thanks =)

araffin on 25 Nov 2018

Ooops... I was only calling save and not load. Adding in the load I'm seeing something I don't understand. On these tensorflow graphs, the save (and now load) are called at 10.0, 20.0, and 30.0 on the x axis. At the same time, I see dramatic changes in the other plots indicating abrupt change in behavior... that smells weird.

I don't believe I was seeing the same thing from just calling saves... here's an image for comparison (unfortunately at a larger scale.)

Any intuition there? Might be related to #56 ?

jrjbertram on 25 Nov 2018

Adding in the load I'm seeing something I don't understand.

I think I found the bug: the target parameters where not saved. So if you tried to continue training by loading the model afterward, the target network would be randomly initialized instead of restoring its previous state.

I pushed a fix on the "sac" branch, could you try it? (see commit https://github.com/hill-a/stable-baselines/commit/cd831a19593f0cd7b68c06cd31c699e4975d42d9)

EDIT: @jrjbertram on that branch, there is SAC algorithm (usually works better than DDPG) that may interest you ;) This algorithm will be integrated soon in the master branch (I'm currently polishing the code and checking performance but everything seems alright).

araffin on 8 Dec 2018

Thanks for looking at it. I’ve been swamped with work the past week or
two... but I will try to do a run this weekend.

On Sat, Dec 8, 2018 at 4:35 AM Antonin RAFFIN notifications@github.com
wrote:

Adding in the load I'm seeing something I don't understand.

I think I found the bug: the target parameters where not saved. So if you
tried to continue training by loading the model afterward, the target
network would be randomly initialized instead of restoring its previous
state.

I pushed a fix on the "sac" branch, could you try it? (see commit cd831a1
https://github.com/hill-a/stable-baselines/commit/cd831a19593f0cd7b68c06cd31c699e4975d42d9
)

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/hill-a/stable-baselines/issues/93#issuecomment-445449337,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AmkAryR72uE9a8AzXIaeDtSSK98xmA_Dks5u25XugaJpZM4YvhDt
.

bertram1isu on 8 Dec 2018

@jrjbertram SAC branch is now merged with master

araffin on 13 Dec 2018

@bertram1isu I am interested in the save/load testing. How about your latest progress? If the restore works great when right saves the target parameters.

LanxinL on 13 Jun 2019

Related: #301

araffin on 13 Jun 2019

I ended up moving on to other projects and haven't returned to the aircraft
project yet. It's still on my bucket list.

On Thu, Jun 13, 2019 at 3:27 AM Antonin RAFFIN notifications@github.com
wrote:

Related: #301 https://github.com/hill-a/stable-baselines/issues/301

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/hill-a/stable-baselines/issues/93?email_source=notifications&email_token=AJUQBL5P7WCKWO54LJEVDN3P2IAHXA5CNFSM4GF6CDW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXS56IQ#issuecomment-501604130,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJUQBL7HCIENRS33252WOZDP2IAHXANCNFSM4GF6CDWQ
.