Stable-baselines: [question] Updating a policy with externally generated experience data

Created on 27 Jan 2019  路  2Comments  路  Source: hill-a/stable-baselines

Hi,

The use case would be the following:

  • train a policy in a simulated environment
  • deploy this policy, use it to predict actions based on collected observations
  • store observations and actions in a storage layer (DB, ...)
  • use collected data to update the policy on a regular basis

If I鈥檓 correct, this only makes sense with off-policy algorithms.

I had a quick look into the code and the simplest way to do it would be to provide a custom Runner instance (where avalaible, checked ACER but not sure other algos use it).

Questions:

  • where would it be really applicable? I guess there are a few limitations/constraints that prevent from doing this with some particular algos.
  • do you see a better way than using Runner abstraction?

Thanks!

question

Most helpful comment

Hello,

use collected data to update the policy on a regular basis

You should not only update the policy with the latest data but also sample from previous experience.

If I鈥檓 correct, this only makes sense with off-policy algorithms.

yes.

checked ACER but not sure other algos use it

I would rather check SAC, DQN or DDPG. ACER is not super readable.

where would it be really applicable? I guess there are a few limitations/constraints that prevent from doing this with some particular algos.
do you see a better way than using Runner abstraction?

Off-policy algorithms are made for that application, for using samples collected using another policy.

I would rather directly access the replay buffer (as done here) and then use the train_step method to optimize the policy (you have an example here)
I would go for a custom runner only if needed (ex having more than one environment at the same time).

All 2 comments

Hello,

use collected data to update the policy on a regular basis

You should not only update the policy with the latest data but also sample from previous experience.

If I鈥檓 correct, this only makes sense with off-policy algorithms.

yes.

checked ACER but not sure other algos use it

I would rather check SAC, DQN or DDPG. ACER is not super readable.

where would it be really applicable? I guess there are a few limitations/constraints that prevent from doing this with some particular algos.
do you see a better way than using Runner abstraction?

Off-policy algorithms are made for that application, for using samples collected using another policy.

I would rather directly access the replay buffer (as done here) and then use the train_step method to optimize the policy (you have an example here)
I would go for a custom runner only if needed (ex having more than one environment at the same time).

I would rather directly access the replay buffer (as done here) and then use the train_step method to optimize the policy (you have an example here)

Thanks for this suggestion!

Was this page helpful?
0 / 5 - 0 ratings