Keras: Don't use `stateful` LSTM unless you know what it does

Created on 3 Jul 2017 · 13Comments · Source: keras-team/keras

stateful LSTMs seem to be confusing everybody. I don't recommend stateful unless you know what it is and have a good reason to use it.

Imagine you have a 2 layer neural network and you only train the last layer. It might learn something it might not. That is basically what stateful is doing between batches. Cell t+1 will do its best to do something with state t, but state t will be random and untrained. It might learn something it might not.

@fchollet I'm thinking something like a disclaimer on the stateful LSTM example. It is a proof-of-concept showing that you can have sequence lengths of 1, but if you can pass actual sequences in a batch you will have a better model. I'm seeing a lot of people try to build models with sequence lengths of 1, which is simply a bad idea. I like having the example but we need to be clear that it is not the preferred way to do things.

Also, the stateful example is kinda odd. It has a batch size of 25 and feeds all of the examples in order. That means hidden state at 20 is used to make the prediction at step 45. The hidden states are randomly initialized and untrained. I don't think most people understand that part and end up with some weird models.

I've mostly just been recommending that people don't use stateful and pass actual sequences in each batch.

Cheers

Source

bstriner

👍11

Most helpful comment

ahrnbom, Using model.get_weights() and model.set_weights(), the weights of a normal LSTM can be transferred to a stateful LSTM, assuming the architectures are otherwise identical.

ElchananHaas on 15 Mar 2018

👍7

All 13 comments

You are correct that sequences of size 1 are a bad idea since they imply a total absence of backprop through time, which obviously will lead to a bad model. Please send a PR to modify the example to show instead best practices (stateful + sequences of reasonable size to allow for truncated backprop though time).

fchollet on 3 Jul 2017

👍5 🎉1

i try my example with sequence=1 and the outcome is horrible. it is needed to use the original sequence length, rather than shorten it or set it 1.

lynnwong11 on 28 Aug 2017

Why anybody would want to have a stateful model during training is beyond me, so this part I can agree with. But during testing, when you want to let the model predict some output on some data, then stateful makes a lot more sense. For example, it might be a part of a larger system that works on video frames. It might be required to perform some action instantly after each frame, instead of waiting for a sufficiently long sequence of video frames before being fed to the network. It would be really nice if you could train the network stateless with a time-depth of X (say 16), and then use those weights on a stateful network with a time-depth of 1 during prediction. In my experience however, this does not work in Keras.

ahrnbom on 19 Sep 2017

IMO It will be extremely useful for BPTT to work with stateful models (unfolded and back-propagated). It's a lot more natural for models like sequence-to-sequence models to take into account longer context (and trained with) than the current sequence to predict the target sequence. This mechanism enables models to use global (or longer) contexts rather than just local ones. I am not sure how it can be achieved yet and will dig into the code for more detail.

trunghlt on 27 Feb 2018

👍3

ahrnbom, Using model.get_weights() and model.set_weights(), the weights of a normal LSTM can be transferred to a stateful LSTM, assuming the architectures are otherwise identical.

ElchananHaas on 15 Mar 2018

👍7

Maybe I'm just not seeing the obvious, but could someone explain to me why it is not a good idea to use sequences of size=1 together with a stateful LSTM and why they imply an absence of BPTT?

jgonsior on 30 Jan 2019

👍3

@bstriner Can stateful lstm be trained using fit_generator?
We know that in stateful LSTM, the state passes between batches, thus training on each batch depends on the preceding batch. considering the importance of sequence, can we use fit_generator with use_multiprocessing = true? should I take into account any consideration when I want to have my customized batch generator to keep the batch number? I would like to mention that I have a time series regression type problem.

mohammadAbdolhosseiniMoghaddam on 14 Mar 2019

👍2

@Elch123 Have you tested that? It apparently doesn't work on my end.

Basically I have a sequential model trained stateless, and the first layer of the model is a stacked lstm. I load the trained model, edit the config with 'stateful'=True and set batch_input_shape, set the new model with same wights and modified config, while the testing result doesn't change.

        old_network = keras.models.load_model(args.model_path, custom_objects=None, compile=False)
        config = old_network.get_config()
        config['layers'][0]['config']['stateful'] = True
        config['layers'][0]['config']['batch_input_shape'] = (1, None, 6)
        weights = old_network.get_weights()
        network = tf.keras.models.Sequential.from_config(config)
        network.set_weights(weights)

I trained my model with data sequence with same length of timestamps, which is independent to each other. While I want to achieve real-time prediction, where the input data has only one timestamps, and I want to make the next prediction to base on the former state.

Does anyone have some idea how to do that?

---------------------------------------update---------------------------------------------
It seems that both stateful and stateless update the state in sameway during model.predict() , is that a expected behaviour?

VertexC on 8 Jun 2019

@bstriner : regarding your comment on the questionable suitability of Stateful LSTM, suppose I have a long Time Series with some yearly and monthly seasonality patterns : given the long dependencies between batches , don't stateful LSTM make more sense in this case? Could you please explain further your following statement: "state t will be random and untrained. It might learn something it might not."
How could state t be random ?
Thank you for your help.

sarahboufelja on 2 Sep 2019

👍3

@bstriner : regarding your comment on the questionable suitability of Stateful LSTM, suppose I have a long Time Series with some yearly and monthly seasonality patterns : given the long dependencies between batches , don't stateful LSTM make more sense in this case? Could you please explain further your following statement: "state t will be random and untrained. It might learn something it might not."
How could state t be random ?
Thank you for your help.

I think he means the gradient cannot really backpropagate between batches. The stateful setting enable us to initialize the hidden states of the next batch with the hidden states of the last batch, but this is still somehow "random" because you have no reason to believe that the hidden states of the last batch have been well trained. Moreover, you cannot change the initial hidden states when training on the next batch. I think that's why @bstriner said it is random and untrained.

Menandalbee on 8 Sep 2019

he stateful setting enable us to initialize the hidden states of the next batch with the hidden states of the last batch, but this is still somehow "random" because you have no reason to believe that the hidden states of the last batch have been well trained.

No longer true after a few epochs.

trunghlt on 8 Sep 2019

👍1

he stateful setting enable us to initialize the hidden states of the next batch with the hidden states of the last batch, but this is still somehow "random" because you have no reason to believe that the hidden states of the last batch have been well trained.

No longer true after a few epochs.

Yes, so I said "somehow", just my guess for the intention of @bstriner's original comments.

Menandalbee on 8 Sep 2019

@Elch123 Have you tested that? It apparently doesn't work on my end.

Basically I have a sequential model trained stateless, and the first layer of the model is a stacked lstm. I load the trained model, edit the config with 'stateful'=True and set batch_input_shape, set the new model with same wights and modified config, while the testing result doesn't change.
        old_network = keras.models.load_model(args.model_path, custom_objects=None, compile=False)
        config = old_network.get_config()
        config['layers'][0]['config']['stateful'] = True
        config['layers'][0]['config']['batch_input_shape'] = (1, None, 6)
        weights = old_network.get_weights()
        network = tf.keras.models.Sequential.from_config(config)
        network.set_weights(weights)
I trained my model with data sequence with same length of timestamps, which is independent to each other. While I want to achieve real-time prediction, where the input data has only one timestamps, and I want to make the next prediction to base on the former state.

Does anyone have some idea how to do that?

---------------------------------------update---------------------------------------------
It seems that both stateful and stateless update the state in sameway during model.predict() , is that a expected behaviour?

@VertexC I am trying to do something similar and I experienced the same behaviour in keras. Did you find a proper way to do it?