Keras: LSTM: Many to many sequence prediction with different sequence length

Created on 30 Mar 2017 · 15Comments · Source: keras-team/keras

First of all, I know that there are already issues open regarding that topic, but their solutions don't solve my problem and I'll explain why.

The problem is to predict the next n_post steps of a sequence given n_pre steps of it, with n_pre < n_post. I've built a toy example using a simple sine wave to illustrate it. The many to one forecast (n_pre=50, n_post=1) works perfectly:

model = Sequential()  
model.add(LSTM(input_dim=1, output_dim=hidden_neurons, return_sequences=False))  
model.add(Dense(1))
model.add(Activation('linear'))   
model.compile(loss='mean_squared_error', optimizer='rmsprop', metrics=['accuracy'])

plot_mto_0

Also, the many to many forecast with (n_pre=50, n_post=50) gives a near perfect fit:

model = Sequential()  
model.add(LSTM(input_dim=1, output_dim=hidden_neurons, return_sequences=True))  
model.add(TimeDistributed(Dense(1)))
model.add(Activation('linear'))   
model.compile(loss='mean_squared_error', optimizer='rmsprop', metrics=['accuracy'])

plot_0

But now assume we have data that looks like this:
dataX or input: (nb_samples, nb_timesteps, nb_features) -> (1000, 50, 1)
dataY or output: (nb_samples, nb_timesteps, nb_features) -> (1000, 10, 1)

The solution given in #2403 is to build the model like this:

model = Sequential()  
model.add(LSTM(input_dim=1, output_dim=hidden_neurons, return_sequences=False))  
model.add(RepeatVector(10))
model.add(TimeDistributed(Dense(1)))
model.add(Activation('linear'))   
model.compile(loss='mean_squared_error', optimizer='rmsprop', metrics=['accuracy'])

Well, it compiles and trains, but the prediction is really bad:

plot_mtm_0

My explanation to this is: The network has only one piece of information (no return_sequences) at the end of the LSTM layer, repeats this output_dimension-times and then tries to fit. The best guess it can give is the average of all the points to predict as it doesn't know whether it is currently going down or up in the sinus wave, it loses this information with return_sequences=False!

So, my final question is: How can I keep this information and let the LSTM layer return a part of its sequence? Because I don't want to fit it to n_pre=50 time steps but only to 10 because in my problem, the points are not so nicely correlated as in the sine wave of course. Currently I just give 50 points and then crop the output (after training) to 10 but it still tries to fit to all 50, which distorts the result.

Any help would be greatly appreciated!

Source

Ironbell

👍6 ❤3

Most helpful comment

Here you go!
test_sine.txt

Ironbell on 30 Jun 2017

❤8 🎉3

All 15 comments

I think you need to do something like this:

model = Sequential()  
model.add(LSTM(input_dim=1, output_dim=hidden_neurons, return_sequences=False))  
model.add(RepeatVector(10))
model.add(LSTM(output_dim=hidden_neurons, return_sequences=True))  
model.add(TimeDistributed(Dense(1)))
model.add(Activation('linear'))   
model.compile(loss='mean_squared_error', optimizer='rmsprop', metrics=['accuracy'])

otherwise you are just repeating the the last Dense layer and getting a constant value.

javiercorrea on 30 Mar 2017

👍6

Thank you very much. I tried your suggestion and the predictions now look like this:

plot_mtm_double_0

The number of epochs and hidden neurons is the same as in the other testcases, but the prediction is worse for 10 steps compared to 50. Is there a (simple) explanation why it gets worse with more layers? Or does it just need to train longer because it has more parameters to adjust?

Ironbell on 30 Mar 2017

I would say that the modeling assumptions of both approaches are different. In the later model, it is assumed that the model sees the complete input sequence (first 50 steps), somehow creates a summary and uses this summary to generate a new signal (last 10 steps).

On the other hand, your initial model estimated the last 50 steps while reading the input signal, no summarisation of the original signal was used.

javiercorrea on 31 Mar 2017

That's a perfect and clear answer, thank you very much.

Ironbell on 31 Mar 2017

😄2

HI, I have been studying how to use the many to many model of lstm to predict time series data, and now I have the same problem that you once had, could you share your demo py files about predicting a simple sine wave to me ? I mean i want to learn your code and replace your data with mine just to have a try. it will be very nice of you if you could do me a favor! thanks first !
my email : [email protected]

thank you !

ChampionZP on 30 Jun 2017

Here you go!
test_sine.txt

Ironbell on 30 Jun 2017

❤8 🎉3

HI there!
It seems in new versions of Keras the input_dim and output_dim arguments are replaced with input_shape() function. So may you edit this parts of code to match to the new version:

model.add(LSTM(input_dim=1, output_dim=hidden_neurons, return_sequences=False))  
model.add(LSTM(output_dim=hidden_neurons, return_sequences=True))

I also have another question. what is the reason of using model.add(Activation('linear')) ?
Thanks in advanced!

bestazad on 1 Mar 2019

Hi @bestazad ,

You can obtain the same result using input_dim() or input_shape(), to my knowledge both these two "alternatives" has been used for quite some time

https://stackoverflow.com/questions/53106111/in-keras-when-should-i-use-input-shape-instead-of-input-dim

The reasoning the why model.add(Activation('linear')) is used is most likely because this is (only) tentative example, other activation functions can probably give similar results here.

pusj on 5 Mar 2019

How would you train the model on variable input length?

gustavz on 8 Jul 2019

Hi @gustavz

Two options/suggestions:

Padding https://machinelearningmastery.com/data-preparation-variable-length-input-sequences-sequence-prediction/
Sequence bucketing https://arxiv.org/ftp/arxiv/papers/1708/1708.05604.pdf

Padding looks easier but I would guess that this method also decreases the usefulness of the model.

If you (or anybode else) could help me with a good explanation of what RepeatVector() does here I would be happy, this is the best reference https://stackoverflow.com/questions/51749404/how-to-connect-lstm-layers-in-keras-repeatvector-or-return-sequence-true , however, this is for a Encode/Decoder network and I'm not sure if this is the same for a LSTM network. E.g., does RepeatVector() that the original input (from the very first layer) or does RepeatVector() work with the inputs/outputs between hidden layers?

pusj on 8 Jul 2019

what is the difference between:

model = Sequential()  
model.add(LSTM(input_dim=1, output_dim=hidden_neurons, return_sequences=False))  
model.add(RepeatVector(10))
model.add(LSTM(output_dim=hidden_neurons, return_sequences=True))  
model.add(TimeDistributed(Dense(1)))
model.add(Activation('linear'))

and

model = Sequential()  
model.add(LSTM(input_dim=1, output_dim=hidden_neurons, return_sequences=True))  
model.add(LSTM(output_dim=hidden_neurons, return_sequences=False))  
model.add(Dense(10))

maybe best explained with this image

gustavz on 9 Jul 2019

👍4

Thanks for this; which is which? I've added some numbers to your image to better reference the variants. I assume that the code that contains RepeatVector() is represented by variant 4 and that the code that does not contains RepeatVector() is represented by variant 5. Is this correct?

Thanks! :-)

pusj on 9 Jul 2019

Option 1 is an Encoder-Decoder, Option 2 is a Vanilla LSTM

gustavz on 9 Jul 2019

Option 1 is part 4 of the image?

byamao1 on 6 Feb 2020

Thanks for this; which is which? I've added some numbers to your image to better reference the variants. I assume that the code that contains RepeatVector() is represented by variant 4 and that the code that does not contains RepeatVector() is represented by variant 5. Is this correct?

Thanks! :-)

that the code that does not contains RepeatVector() is a many-to-one architecture (variant 3). To have a many-to-many architecture you have to mdofy the code that does not contain RepeatVector() to have return_sequences=True in both LSTM layer and not only only the first layer.