Stable-baselines: How does PPO with LSTM handle the case where episodes may have different lengths?

Created on 22 Mar 2020  路  14Comments  路  Source: hill-a/stable-baselines

Hi,

May I ask you a question about PPO with LSTM? When there are done signals emitted by the environments, episodes could terminate before reaching the time limit. In that case, episodes may have different lengths. How does PPO with LSTM handle situations like that? I can conjecture an approach: the agent runs n fixed steps in each environment and uses masks in the final loss to filter the transitions after done signals. However, I found this approach is unstable, and it is hard to choose the right fixed steps. How would you do in that situation?

BTW, I've checked your implementation but not found any solutions related to my confusion. Please let me know what you think, thanks.

duplicate question

All 14 comments

See related discussions, e.g. #278 and #646 .

Short version: There is no proper backprop over time (hidden states act like they were observations).

If there are no more questions related to this, you can close this issue.

Hi, @Miffyli

I've read these discussions you referred to. If I understand you right, you mean we don't propagate gradients over time as we generally do in NLP. This, however, does not seem to be consistent with this code, which only stores states at the beginning of each run.

I've also read this pull request, which implemented what you said and demonstrated that treating RNN states as a part of input yields good results. But I'm still not fully convinced: If we treat each time step separately, how can the RNN learn useful information that influences subsequent decisions?

@xlnwel

I have been terribly mistaken. Now that I look at it, it indeed seems to do backprop through time over the gathered rollout (n_steps). I have no idea what I was thinking when I wrote those comments. Thanks for pointing this out!

Hi, @Miffyli

Well, then do you know how PPO handles episodes with different lengths? For example, when we run n environments in parallel for n_steps steps, but some environments terminate earlier than the others. How should we deal with such situations?

The rollouts can cross episode boundaries, in which case mask variables are used to tell the network to reset the hidden state to zeros.

Hi, @Miffyli

I can make this strategy work when the agent interacts with the environment, but how can it applies to the training time, when LSTM requires batches of fixed length sequences to train?

I am not sure what is the confusion here. During training time, the code collects n_steps from all environments, regardless if the episodes ended, and trains using those (fixed length) trajectories. If episode ended in one of the trajectories, it just means the mask is true, and this resets the hidden state during the training process.

(PS: I still have to triple-check this to make sure I do not talk from my butt this time)

Hi, @Miffyli

If we want to reset the hidden state at the training time, do we need to manually define a LSTM ourself? As far as I know, none of the off-the-shelf LSTM supports changing a single hidden state at the training time.

The simplest way to do that is to provide done=True from the environment, but naturally this causes other things (episode ends, etc etc). If that does not work for you, then you have to modify the code a bit. You do not have to create a new LSTM, but you have to modify the code to provide mask=True when you want the hidden state to be reset. You could, for example, return this information in the info dict of the environment here, build a separate list of "masks", return that to PPO2 training code and use those instead of episode-dones to mask hidden states. By default, dones are renamed to masks here.

Hi, @Miffyli

Do you mean to set mask=True after the episode is finished, i.e. done=True? If that's what you mean, what should we do when the episode is finished? For example, if we set nsteps=100 but some trajectory is finished at 50 steps, then, for that trajectory, do you think if we should reset the corresponding environment? Currently, my solution is not to reset the environment even after it emits the done signal and apply mask to the loss so that the part after the done signal does not contribute to the gradients. However, I found this method is unstable sometimes. Furthermore, I've also experimented two different training processes. The first is to run N environments in parallel, and train PPO after all environments are done. The second is to train PPO immediately after nsteps, which is much smaller than the length of an episode. I was expecting the second one to perform better, but the experiments suggested that the first was much more stable, which is so confusing to me. The environment I used to experiment is BipedalWalker from gym, of which the episodic length can vary from tens of steps to 1600 steps.

Yes, that is what I meant. This is done by default already (when episode terminates, start with fresh hidden state), except that you end episodes when you want to reset hidden states. _However_, it is not the cleanest solution, as ideally you would only reset hidden states at the beginning of episodes (when the game really starts from zero).

Sounds like this is going to the "what is the right way of training LSTMs with RL" territory, and I do not think there is any clear answer to this (e.g. R2D2 paper presents some approaches in the case of DQN). Your best bet is to try out different approaches and see what works.

Hi @Miffyli

I've read R2D2, which experiments initial hidden states and burn-in method. But those are for off-policy methods, where a batch of nsteps sequence does not necessarily come from the most recent policy. In that case, we can easily make sure all sequences are of length nsteps. For on-policy methods, such as PPO, we will meet episodes of different lengths, which is exactly what I'm confused now. Maybe I have to do more experiments to find it out.

Ah, I think where the confusion is: In the current implementation, backpropagation is _not_ over whole episodes. It is only done over sequences of length nstep, which _can_ cross an episode boundary, in which case the gradient is killed at that point.

In case the rollout/trajectory begins from the middle of the episode, the initial hidden state is set to whatever the hidden state was at the beginning, but no backpropagation is done further into past. This does work with long enough n_step but I am not sure how well it would compare to training with full episodes.

Hi, @Miffyli

I see that the official implementation use a customized LSTM which use masks to stop gradient if an episode is done. Everything makes sense now.

Anyway, thanks for the discussion.

Was this page helpful?
0 / 5 - 0 ratings