Hi,
The use case would be the following:
If I鈥檓 correct, this only makes sense with off-policy algorithms.
I had a quick look into the code and the simplest way to do it would be to provide a custom Runner instance (where avalaible, checked ACER but not sure other algos use it).
Questions:
Thanks!
Hello,
use collected data to update the policy on a regular basis
You should not only update the policy with the latest data but also sample from previous experience.
If I鈥檓 correct, this only makes sense with off-policy algorithms.
yes.
checked ACER but not sure other algos use it
I would rather check SAC, DQN or DDPG. ACER is not super readable.
where would it be really applicable? I guess there are a few limitations/constraints that prevent from doing this with some particular algos.
do you see a better way than using Runner abstraction?
Off-policy algorithms are made for that application, for using samples collected using another policy.
I would rather directly access the replay buffer (as done here) and then use the train_step method to optimize the policy (you have an example here)
I would go for a custom runner only if needed (ex having more than one environment at the same time).
Most helpful comment
Hello,
You should not only update the policy with the latest data but also sample from previous experience.
yes.
I would rather check SAC, DQN or DDPG. ACER is not super readable.
Off-policy algorithms are made for that application, for using samples collected using another policy.
I would rather directly access the replay buffer (as done here) and then use the
train_stepmethod to optimize the policy (you have an example here)I would go for a custom runner only if needed (ex having more than one environment at the same time).