Keras: Stateful predictions of RNN

Created on 22 Apr 2016 · 18Comments · Source: keras-team/keras

Hi, I have read the Keras doc about "stateful flag" and several issues. However, I still don't understand how the stateful prediction works.

Notes that the methods predict, fit, train_on_batch, predict_classes, etc. will all update the states of the stateful layers in a model. This allows you to do not only stateful training, but also stateful prediction.

For instance, the size of test data is (2000,10,3), where 2000 is the nb_num,10 is sequence_length,3 is input_dim. The size of test label is (2000,2), where 2 is the label_dim.

When I use predict_classes, every time I label a batch of （１，１０，３）ｓｅｑｕｅｎｃｅs. However, the parameters of the network is fixed when predicting, what does the state mean?How can the network pass the state on to next batch?

The second question. Think there are several sequences A,B,C,D...
I want to give sequence B a label based on the label of A. So should I put the sequence A in the first batch, and sequence B in the same place in the second batch? Is there any other method?

The third question.Why does Keras always split the long sequences into shorter ones?If I want to give every timestep a label, I should use TimedistributedDense.But is it possible to process the whole long sequences?

Thank you in advance!

stale

Source

Yingyingzhang15

Most helpful comment

when you first create your model, it guesses randomly. after it's trained, it ideally guesses more accurately. the 'state' is the training (the weights that have been set via training).

You are confusing the 'state' and the 'weights'. Let me try to explain...

On each timestep each node in an RNN receives some input and produces 2 outputs. One output goes to the next network layer, the other is remembered until the next timestep and is then combined with the following input. These 'internal outputs' let it learn relationships in the data that span several timesteps. This internal 'state' is not trainable, just as for the output that is fed into the next network layer, it is the weights in the formulas used to calculate it that are trained.

The best way of understanding the statefulness it to imagine what would happen if each batch contained sequences of only one timestep. You would want the network to remember its internal state between the first batch and the second batch, rather than starting over at each step with its internal state set to all zeros. Otherwise the network would be unable of learning any time-based relationships in the input data.

A non-stateful RNN cannot learn any long running relationships in the input data that are longer than the training sequence length. That is where the statefulness can come in useful. You can feed a stateful RNN the first 10 steps in one batch, then the next ten steps in the second batch, and you can be sure that the internal state of the network after the first ten predictions have been calculated is used to prime the network for the second set of ten timesteps. Thus the network can learn relationships in the input data spanning up to twenty steps even though the sequence length of each batch is only ten steps. If you add a third set of 10 timesteps that follow on from the second set, the network can learn relationships spanning up to thirty steps.

If you had data for each day of the year your first batch could consist of the data for Jan 1st, Feb 1st, Mar 1st, ... Dec 1st. The second batch would then be Jan 2nd, Feb 2nd, ... Dec 2nd. Each element of each batch must follow on from the corresponding element in the previous batch. The batch size in this example is 12 and the sequence length is 1.

jpeg729 on 29 Apr 2016

👍11 🎉1

All 18 comments

@fchollet @EderSantana I'm sorry to bother you, but can help me with my questions?

Yingyingzhang15 on 24 Apr 2016

I'll try to help:

what does the state mean?How can the network pass the state on to next batch?

when you first create your model, it guesses randomly. after it's trained, it ideally guesses more accurately. the 'state' is the training (the weights that have been set via training). This means you can instantiate a model, train it, and use that instance of the model to perform predictions on properly shaped inputs.

The second question. Think there are several sequences A,B,C,D...
I want to give sequence B a label based on the label of A. So should I put the sequence A in the first
batch, and sequence B in the same place in the second batch? Is there any other method?

If you want A and B to be related as a sequence of data in the prediction of B's label, I think those need to be input together as a sequence. Ideally the order of the data points in the training sequence don't matter at all, and you can shuffle them freely.

The third question.Why does Keras always split the long sequences into shorter ones?If I want to give > every timestep a label, I should use TimedistributedDense.But is it possible to process the whole long
sequences?

This sounds potentially like a programming issue — can you post a gist?

bhtucker on 26 Apr 2016

😕2

@bhtucker Thank you so much for your help!
I still have something to ask about your reply.

the 'state' is the training
Do you mean that there is no 'state ' when predicting? The network predict every sequence separately? Keras document says:

Notes that the methods predict, fit, train_on_batch, predict_classes, etc. will all update the states of the stateful layers in a model. This allows you to do not only stateful training, but also stateful prediction.
But I don't understand how stateful prediction works.

I understand your reply about my second question,thank you !

The third question,actually I want to process some prediction problems.Like the one in the Keras document predicting the 11th timestep given the first 10.
X # this is our input data, of shape (32, 21, 16)
# we will feed it to our model in sequences of length 10
model = Sequential()
model.add(LSTM(32, batch_input_shape=(32, 10, 16), stateful=True))
model.add(Dense(16, activation='softmax'))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

# we train the network to predict the 11th timestep given the first 10:
model.train_on_batch(X[:, :10, :], np.reshape(X[:, 10, :], (32, 16)))
# the state of the network has changed. We can feed the follow-up sequences:
model.train_on_batch(X[:, 10:20, :], np.reshape(X[:, 20, :], (32, 16)))

In the first train_on_batch, the network predict the 10th timestep. In the second train_on_batch, the network predict the 20th timestep. So the network don't need to predict the 2th,3th,4th ...timestep?

Looking forward for your reply.

Yingyingzhang15 on 27 Apr 2016

Here's an example that does character-by-character prediction / text generation via sampling: https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py

In particular maybe this part:

            preds = model.predict(x, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char

bhtucker on 28 Apr 2016

@bhtucker Thank you !

Yingyingzhang15 on 29 Apr 2016

when you first create your model, it guesses randomly. after it's trained, it ideally guesses more accurately. the 'state' is the training (the weights that have been set via training).

You are confusing the 'state' and the 'weights'. Let me try to explain...

jpeg729 on 29 Apr 2016

👍11 🎉1

@jpeg729 Thanks for the correction — I missed that the question was about the more precise sense of 'state' and not the more general model-weights-as-state.

It looks like in the RNN literature the state at t is typically h_sub_t, so I'll call this recurrent state h. I also have a few questions about h:

Can you set the dimensionality of h? Or is always some function of some other dimensionality parameters?
In a classification task using RNNs, like the imdb sentiment analysis example, is the RNN's output to the next layer simply the value of h at the end of that sequence?
Some RNN papers (like http://arxiv.org/pdf/1409.2329.pdf) refer to having 600+ unit RNN layers. I understand that h is passed along and modified by each unit, but do these units all share values for their other weights (for GRUs, w and u)? Is there a way of creating a 600 unit RNN layer in Keras, or am I misunderstanding something?

Appreciate your help!

bhtucker on 30 Apr 2016

As far as I understand it, each GRU node or LSTM node stores one value. The dimensionality of h doesn't seem to be a useful question.
No. At time t the RNN's input is (input_sub_t, h_sub_t_minus_one) and its output is output_sub_t. h_sub_t is stored internally while the network awaits the next input set.
Each LSTM/GRU node stores its own value which it uses on the next timestep. The LSTM/GRU nodes in a layer do not share weights or state, each node has its own weights and its own state. model.add(LSTM(600)) would do the trick.

jpeg729 on 30 Apr 2016

@jpeg729 Thank you for your help. But I think my problem is still not solved.Can you explain the sentence in Keras document below?Ｔｈａｎｋ you!

Notes that the methods predict, fit, train_on_batch, predict_classes, etc. will all update the states of the stateful layers in a model. This allows you to do not only stateful training, but also stateful prediction.

Yingyingzhang15 on 10 May 2016

It means that just as training can be done by feeding the model one sequence step at a time, so you can also predict by just giving the network one sequence step at a time. If the model is stateful, then it can "remember" some aspects of the previous steps in the sequence.

This is rather useful when using the model for real-time predictions of a series.

jpeg729 on 10 May 2016

👍3

Any body having character based text classification code in keras like in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/skflow/text_classification_character_rnn.py

vinayakumarr on 3 Sep 2016

Hi @jpeg729 ,
I have a question regarding to batch_shape in the first layer of stateful LSTM network.
As fas as i know, batch_shape specifies (batch_size, time_step, input_dim) for LSTM network, if we do stateful, we can set batch_size to 1, but how about time_step? can we set time_step to 1 as well?
let's say I'm building a character level text classification with stateful LSTM,
in this case should the batch_shape be (1, 1, 26)? (where 26 is number characters)
Thank you!

newbiesitl on 16 Nov 2017

If you do stateful you can still use batch_size > 1, and time_steps > 1.

The simplest case is of course, batch_size == 1 and time_steps == 1, and in that case you feed the model one character of a single sentence at a time.

If you use time_steps == 10, for example, then you must feed the sentence to the model 10 characters at a time, the first call must provide the first 10 characters of the sentence, the second call must provide the next 10 characters of the sentence and so on.

You can use batch_size > 1, in which case you feed the model several sentences at once. The first call provides the first time_steps characters of each sentence, the second call provides the next time_steps characters and so on.

I hope this is helpful.

jpeg729 on 16 Nov 2017

Hi @jpeg729 ,
Thank you for your answer! this is very helpful! so my understanding is batch_size won't affect the performance of LSTM, the only different is if we use a large batch size, we would experience some training speed up on GPU?
Regarding to time_steps==1, if we pass one character at a time, it seems like we are not utilizing the advantage of LSTM, because we don't do BPTT across batches in training? I assume the performance of time_steps==1 will be worse than n time steps? Please correct me if I'm wrong..
What would be your suggestion on batch_shape configurations?

Thank you!

newbiesitl on 16 Nov 2017

I think a smallish batch_size would provide some speedup even on CPU since modern CPUs have some ops for vector/matrix calculations.

I think you are right about loosing BPTT when using time_steps == 1. However, you will get some useful training since the hidden state is preserved.

I have tried batch_size between 10 and 20, with time_steps >= 50. Those settings seem to do useful work at reasonable speed.

jpeg729 on 16 Nov 2017

🎉1 👍1

Hi @jpeg729 , sorry I'm still confusing about the batch size,
let's say we have a two sentences, one is
A B C D another one is E F G H, and the time step of LSTM is 2, so I have to divide this length 4 sequence into 2 length 2 subsequences. After cut, I got
A B, C D, E F, G H, and my training batch looks like (imagine it already converted to one-hot encoding or labels)
[A B, C D, E F, G H]
here is the part confuse me, should I pass this [A B, C D, E F, G H] to fit() function with batch_size = 2 and reset states every 2 batches (using a call back).
or should I created two chunks:
[A B, E F]
and
[C D, G H] and pass those small batches one after another and reset the state every 2 small batches?
Does it make sense?..
If I want to put all training data in one training batch, how should I order the samples within the batch?
Thank you!

newbiesitl on 22 Nov 2017

@jpeg729 When predict using stateful Lstm, does predict() set the state variable? If so, to predict a value, should we predict the past samples N times to get the state produced from the previous predictions? Thanks