Stable-baselines: [question] Updating a policy with externally generated experience data

Created on 27 Jan 2019 · 2Comments · Source: hill-a/stable-baselines

Hi,

The use case would be the following:

train a policy in a simulated environment
deploy this policy, use it to predict actions based on collected observations
store observations and actions in a storage layer (DB, ...)
use collected data to update the policy on a regular basis

If I’m correct, this only makes sense with off-policy algorithms.

I had a quick look into the code and the simplest way to do it would be to provide a custom Runner instance (where avalaible, checked ACER but not sure other algos use it).

Questions:

where would it be really applicable? I guess there are a few limitations/constraints that prevent from doing this with some particular algos.
do you see a better way than using Runner abstraction?

Thanks!

question

Source

antoine-galataud

Most helpful comment

Hello,

use collected data to update the policy on a regular basis

You should not only update the policy with the latest data but also sample from previous experience.

If I’m correct, this only makes sense with off-policy algorithms.

yes.

checked ACER but not sure other algos use it

I would rather check SAC, DQN or DDPG. ACER is not super readable.

where would it be really applicable? I guess there are a few limitations/constraints that prevent from doing this with some particular algos.
do you see a better way than using Runner abstraction?

Off-policy algorithms are made for that application, for using samples collected using another policy.

I would rather directly access the replay buffer (as done here) and then use the train_step method to optimize the policy (you have an example here)
I would go for a custom runner only if needed (ex having more than one environment at the same time).

araffin on 28 Jan 2019

👍2

All 2 comments

Hello,

use collected data to update the policy on a regular basis

You should not only update the policy with the latest data but also sample from previous experience.

If I’m correct, this only makes sense with off-policy algorithms.

yes.

checked ACER but not sure other algos use it

I would rather check SAC, DQN or DDPG. ACER is not super readable.

where would it be really applicable? I guess there are a few limitations/constraints that prevent from doing this with some particular algos.
do you see a better way than using Runner abstraction?

Off-policy algorithms are made for that application, for using samples collected using another policy.

araffin on 28 Jan 2019

👍2

I would rather directly access the replay buffer (as done here) and then use the train_step method to optimize the policy (you have an example here)

Thanks for this suggestion!