Ray: [rllib] How to control how many times a policy is trained on one timestep from the replay buffer in off-policy training?

Created on 1 Jul 2020 · 9Comments · Source: ray-project/ray

Hi!

I'm quite confused right now how DQN/DDPG/TD3/etc. (off-policy with replay buffer) training is done. If I understand correctly, data and training are run in parallel. Multiple workers collect data and put it in a replay buffer. Trainer samples data from the replay buffer and trains on it. And both happen simultaneously.

But how can the ratio of timesteps collected to training steps done be controlled? I mean, for example here in SpinUp sequential algorithm for DDPG we can see that training happens after some number of timesteps collected and it's done for some number of iterations. Then we read that in code they fixed the ratio of environment steps to gradient steps to one. How this ratio could be controlled in RLlib?

My ultimate question is: _How to control how many times a policy is trained on one timestep from the replay buffer in off-policy training?_ Because I can imagine that without some supervising it's depending on the number of workers, how quick is file transfer between workers and trainer (how fast it can fetch new batches of data), how fast is one gradient step/update (e.g. do we use GPU or not), etc. Depending on these factors, the policy may or may not see new data more often than replayed data. Is it possible and how to control that explicitly?

Thanks for your great work!
Piotr.

question

Source

piojanu

Most helpful comment

I think I understand now. But still, docstring could do a better job 😄 I think it's the other way around than described now i.e. "If set, this will fix the ratio of replayed from a buffer and learned on timesteps to sampled from an environment and stored in the replay buffer timesteps. Otherwise, the replay will proceed at the native ratio determined by (train_batch_size / rollout_fragment_length)."

Do you agree it describes better what is happening (and hence I understand correctly)? If yes, I'll create a PR.

piojanu on 2 Jul 2020

👍2

All 9 comments

Hi,

From my understanding there are two ways to control the number of time a sample will be use to update the policy on average. You can either set the "training_intensity" parameter or alternatively play with "train_batch_size" and "rollout_fragment_length" as the default value of the training intensity is train_batch_size / rollout_fragment_length.
You can find the description of each parameter here.

Cheers

raphaelavalos on 1 Jul 2020

@raphaelavalos so for the default params of DDPG, we won't use practically any replayed experience?

Here you have default parameters:

    # If set, this will fix the ratio of sampled to replayed timesteps.
    # Otherwise, replay will proceed at the native ratio determined by
    # (train_batch_size / rollout_fragment_length).
    "training_intensity": None,
    [...]
    # Update the replay buffer with this many samples at once. Note that this
    # setting applies per-worker if num_workers > 1.
    "rollout_fragment_length": 1,
    # Size of a batched sampled from replay buffer for training. Note that
    # if async_updates is set, then each worker returns gradients for a
    # batch of this size.
    "train_batch_size": 256,

As far as I understand, DDPG will use the native ratio of (train_batch_size / rollout_fragment_length) = 256 / 1 = 256. So 256 sampled to 1 replayed timesteps. So practically only new experience. Moreover, this formula doesn't make sense to me. Logic tells me that we will be connection bound (latency of each send of a sampled batch is constant), so if we collect smaller batches of data, we need to do more sends and we will wait longer to collect the whole training batch. Therefore, to me, the other way around would make more sense. If we need to wait for data, let's use more data from the replay buffer we have access to and train while waiting. Right?

Please help me understand this. Some explanation in docs would be helpful too (I can try to add it).

piojanu on 1 Jul 2020

@ericl is the author of this change #8396. Thanks for it, I think it's very needed. However, could you explain to me how it works exactly? You might want to look at the previous comments.

piojanu on 1 Jul 2020

The reason it's set to 1/256 is that this maximizes sample efficiency. It does not necessarily speed things up. If your goal is time efficiency and have a fast env it makes sense to maximize sample throughput (or maybe even use on policy algs like PPO).

And yes, you can either play with this param or the batch sizes. All the ratio does is train more batches per sampled... you can check out the steps sampled / steps trained metrics to see the effect.

ericl on 1 Jul 2020

@ericl is it 1/256? From the equation, it's 256 / 1. So it will use pretty much only new samples, not the replayed ones. So this impairs sample efficiency (we don't replay experience). This is how I understand it from the docstring.

However, I read your MR and I'm still not sure what is going on. I understand that it controls how many times each op is called. We have three ops: [store_op, replay_op, update_op]. But what each op does?

piojanu on 1 Jul 2020

DDPG is using replay though...

The ops store a new sampled batch into the buffer, replay a batch, and do a SGD update each. The ratio controls the frequency each op is called to control the training intensity.

ericl on 1 Jul 2020

"rollout_fragment_length": 1,  -> each store op stores 1 sample into the replay buffer

"train_batch_size": 256, -> each replay/train op optimizes over batch of 256 samples

ericl on 1 Jul 2020

Do you agree it describes better what is happening (and hence I understand correctly)? If yes, I'll create a PR.

piojanu on 2 Jul 2020

👍2

That does sound much better, I guess it was backwards in the current doc. It would be great if you could make a PR.

ericl on 2 Jul 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings