Stable-baselines: Trying to understand how the LSTM policy works

Created on 17 Apr 2019  路  6Comments  路  Source: hill-a/stable-baselines

Dear @erniejunior,

I been trying to trace how the LSTM policy works (with ACER) and its rather confusing. My understanding that the n_steps = lstm sequence length, and so each batch (n_env * n_steps) is fed into the LSTM policy for train_step. However in _Runner.run the self.model.step only takes in 1 obs (1, obs_dim) step instead of (n_steps, obs_dim) when generating the predicted action.

So my 2 questions are:
1) Can you explain a little how the LSTM policy works when it is trained with a sequence of obs but it predicts with only 1 obs
2) It seems that the batch training step is not slid across the sequence? e.g. with data {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} and timesteps of 5, it is trained as batches of {0, 1, 2, 3, 4} and {5, 6, 7, 8, 9} rather than {0, 1, 2, 3, 4} followed by {1, 2, 3, 4, 5}

documentation question

Most helpful comment

Admittedly that part of the code could be clearer, but this is how I have understood it:

No unrolling/backprop-through-time is used here. Each step is handled separately, where the hidden state is just one of the inputs. This will make learning harder but also makes the implementation easier, as we can treat hidden states just as one of the inputs. The "right way" of doing recurrent policies with RL agents is still an on-going research (see e.g. R2D2). For prediction we just feed in observations and hidden states from previous calls.

Note that this is based on the observation that states are stored as numpy arrays during training, they are fed alongside observations and are not updated during training steps.

Late edit: Disregard above. The code seems to run backprop through time over the gathered rollout, i.e. n_steps. The previous known hidden state is used as the initial point. Only these initial states are stored in numpy arrays.

All 6 comments

Hello,

I been trying to trace how the LSTM policy works (with ACER) and its rather confusing.

I think this is a good question and some documentation is needed on that. To be honest, I did not have the time to dive into the obscure mechanics of LSTM in the codebase, but I would recommend to rather look at PPO2 or A2C, because the code of ACER is very hard to read.

And please tell us your finding, that would be valuable for the community ;)

Related: #158

I only ever looked ad PPO2 too. I will try to get back to you when I have some more time in a few days!

Hello,

Is there any update on this? I have the same questions as @Caisho. The way that LSTM policy is used doesn't make sense for me.

Admittedly that part of the code could be clearer, but this is how I have understood it:

No unrolling/backprop-through-time is used here. Each step is handled separately, where the hidden state is just one of the inputs. This will make learning harder but also makes the implementation easier, as we can treat hidden states just as one of the inputs. The "right way" of doing recurrent policies with RL agents is still an on-going research (see e.g. R2D2). For prediction we just feed in observations and hidden states from previous calls.

Note that this is based on the observation that states are stored as numpy arrays during training, they are fed alongside observations and are not updated during training steps.

Late edit: Disregard above. The code seems to run backprop through time over the gathered rollout, i.e. n_steps. The previous known hidden state is used as the initial point. Only these initial states are stored in numpy arrays.

Thank you @Miffyli

Was this page helpful?
0 / 5 - 0 ratings