Ray: [rllib] Policy losses that depend on trajectories of experience

Created on 19 Aug 2020  路  5Comments  路  Source: ray-project/ray

I want to implement a Policy loss function that requires a somewhat long trajectory of observations and actions from the same episode (e.g. a length-10 trajectory of observations and actions).

My understanding is that the standard aggregators of experience are the SampleBatch and MultiAgentBatch objects, which don't seem to explicitly provide access to arbitrary-length trajectories of samples, at least not in any examples. However, the SampleBatch.split_by_episodes, SampleBatch.rows and MultiAgentBatch.timeslices methods look relevant.

I am currently thinking of implementing a variant of TrainOneStep that, instead of calling do_minibatch_sgd, calls a different function that extracts windows from the SampleBatches.
Any advice is appreciated.

question

Most helpful comment

The sample batches do include the episode / unroll ids (eps_id), so it is possible to extract contiguous unroll segments from the batch. If you further use "batch_mode: complete_episodes", then you are guaranteed to see entire episodes in the batch, and can use .split_by_episode() to extract those.

Note that for DQN-like algorithms, you'll also want to set "replay_sequence_length: N", otherwise the training batch will be by default have random samples without any contiguous sequences.

Hope this helps!

All 5 comments

The sample batches do include the episode / unroll ids (eps_id), so it is possible to extract contiguous unroll segments from the batch. If you further use "batch_mode: complete_episodes", then you are guaranteed to see entire episodes in the batch, and can use .split_by_episode() to extract those.

Note that for DQN-like algorithms, you'll also want to set "replay_sequence_length: N", otherwise the training batch will be by default have random samples without any contiguous sequences.

Hope this helps!

Thanks @ericl ! Some quick details for context:

  • I'm currently using torch PPO with the default execution_plan
  • My goal is for the data that arrives inside the loss_fn to contain a set of windows (contiguous sequences) from different episodes, in order to construct, inside the loss function, an observation Tensor of shape (N_windows, window_size, obs_size).

I interpreted your comment to mean that I could set that parameter and use split_by_episodes inside the loss_fn (or right before it, inside TorchPolicy.learn_on_batch, because what arrives inside the loss_fn is a UsageTrackingDict).

Even if I set batch_mode: "complete_episodes", which seems to mainly affect the sampler, I don't think that guarantees that the minibatch that arrives inside the loss function will contain a set of continguous windows. I think this is because the way that do_minibatch_sgd creates the minibatches. As far as I can tell, the minibatches function will create a generator that slices out the SampleBatch, so the slices no longer contain what I want. This would suggest that I need to implement a different iterator instead of the TrainOneStep iterator..?

I found that if I disable the shuffling inside of the minibatches function, and with the default minibatch sizes of 128, I ended up with two contiguous trajectories in the postprocessed_batch after using split_by_episode, one of shape 40 and another of shape 88. All trajectories from my env are length-100, so I'm not totally sure what's going on (I would've expected a trajectory of length 100 and a trajectory of length 28).

So my current belief is that I'll need to implement a different iterator than TrainOneStep that uses a different minibatching scheme (and plug this into a different execution_plan). Does that seem reasonable, or there an easier way to do this? Further, I'm not sure how I can get those Tensors of shape (N_windows, window_size, obs_size) even if I get a set of same-length windows in the minibatch, because the UsageTrackingDict cannot be used to perform split_by_episodes.

Ok, I think I've figured out a reasonable solution that will still let me use TrainOneStep, and circumvent the minibatching:

  • Use batch_mode: "complete_episodes"
  • Use num_sgd_iter: 0
  • Use sgd_minibatch_size: 0
  • Use this create_windows function shown in this gist, which operates on the UsageTrackingDict once it arrives inside the policy's loss_fn

This has the obvious downside of resulting in batch GD instead of SGD on each new round of data (in my specific case).

I think another way to prevent shuffling would be to be able to manually override the pre-slice shuffling of sgd.minibatches, perhaps with another config parameter. Currently pre-slice shuffling is determined by checking if state_in_0 is a batch key, which is logic that seems to be decided purely by the model, rather than what the loss function might require independent of the model.

Yep, I was going to suggest similar (disabling minibatching). If you wanted to get it working with minibatching as well, I don't see any other good alternative than rewriting TrainOneStep to be (1) splitting up the train batch by episodes, and then (2) creating minibatches that are made of one or more contiguous episodes.

This might not end up too complicated (probably not much more code than you have in that snippet there), since most of the heavy work is done by split_by_episodes() and then SampleBatch.concat() to create the minibatches.

@ericl Thanks again! I found a different alternative to get the result of your (2) above, by just hard-coding the pre-slice shuffling to never occur, and ensuring sgd_minibatch_size is a multiple of the episode length, which works reasonably for me because my episodes are of length 100. That'll do for now, but maybe I'll return to your suggestion later.

Was this page helpful?
0 / 5 - 0 ratings