Keras: Keras - stateful vs stateless LSTMs

Created on 25 Sep 2016 · 2Comments · Source: keras-team/keras

I'm having a hard time conceptualizing the difference between stateful and stateless LSTMs in Keras. My understanding is that at the end of each batch, the "state of the network is reset" in the stateless case, whereas for the stateful case, the state of the network is preserved for each batch, and must then be manually reset at the end of each epoch.

My questions are as follows: 1. In the stateless case, how is the network learning if the state isn't preserved in-between batches? 2. When would one use the stateless vs stateful modes of an LSTM?

stale

Source

vgoklani

👍1

Most helpful comment

@vgoklani : Sometimes I found it useful to picture LSTM as a Markov chain. The learning corresponds to finding the transition probabilities (hidden weights in LSTM) between states. And like what @larspars said, it is the current state that determines the output of your LSTM. So here is my understanding.

"Stateless" is like resetting LSTM to an "initial state" for every new batch, and 'stateful' means you continue from where you are. In both cases LSTM is learning because the transition probabilities are updated.

Given that, stateless LSTM should be used if the instances of different batches are independent, e.g, when modelling sentence-level patterns and each instance is a sentence - the state should be reset to "sentence beginning" for every new instance. Stateful LSTM is more useful if there is continuity between the ith instance of all batches, e.g., when modelling document-level patterns (without resetting at sentence boundaries). In this case the ith instance of each batch should be the consequent sentences from the ith document.

dolaameng on 26 Sep 2016

👍9

All 2 comments

1) An LSTM predicts based on the activation of the memory cell in the previous timestep. So do you copy that activation between batches? Or do you set it to all zeroes? That is what distinguishes a stateful and stateless LSTM. So state refers to neuron activations, not to the parameters (which are kept in either case).

2) Depends if you want predictions in batch n to depend on the state in batch n-1. For example, in language modelling, successive batches are successive chunks of text so it makes sense to keep the state. But if you know successive batches are unrelated to each other, it might make more sense to reset the state.