I'm using stable baselines to train a policy in an environment. My setting is that I train a TRPO policy on the environment for N timesteps, update the environment (the dynamics slightly changes) and now i rerun the training. I'm worried if the internal states of the Adam optimizer used to train the TRPO policy is not being reset when I call learn() again. Can someone confirm if this would be a problem ? Since the distribution of states in the updated environment changes the second time I run learn(), shouldn't the optimizer's internal states be reset especially when the optimizer heavily relies on momentum ? This isn't a bug necessarily, but i'm concerned about the way i'm currently using the learn() function for my work. I'd appreciate any thoughts on this, Thanks ! :)
Code example
model = TRPO(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
env.update()
model.learn(total_timesteps=25000)
Hello,
I would first try and ask afterward ;)
If you want to reset the state of the optimizer, the best way is to load the model (the state is currently not saved, cf issue #301 ).
Most helpful comment
Hello,
I would first try and ask afterward ;)
If you want to reset the state of the optimizer, the best way is to load the model (the state is currently not saved, cf issue #301 ).