Keras: Implementation Questions On Using RNN For Time Series Forecasting

Created on 13 Feb 2017 · 7Comments · Source: keras-team/keras

I am having trouble wrapping my head around certain aspects of the Keras implementations of RNN. This is a description of my problem:

The observations each have 200 features, and varying time steps from 1 - 730 days.
I have labels for each day for each of these observations. So its like a sequence to sequence time series regression problem. The data is daily customer activity and the label is customer spend (regression).

I want to make predictions for the next X days, basically for any arbitrarily length sequence I want into the future.

Here are my questions:

Preparing training data:
I know the input dimension to RNN is (nb_samples, timesteps, input_dim) which in this case for me will be (# of customers, something between 1 - 730, 200)

should I zero-pad all of my sequences such that they all have time dimension of 730?
Any thoughts on left padding vs right padding, for this example?
I should pad my labels in the exact same way I pad my input, correct?
If I do the padding, I understand I should have a mask layer at the input of the network. Is that the only mask layer I need? Are there any disadvantages to using zero padding + masking vs. using batch size = 1 below?
I understand that an alternate approach would be to have batch size = 1 Timestep and set Stateful = True and reset state at the end of each variable length sequence. Is there any reason this is better than zero padding? Can someone confirm that I have the right understanding that you would reset the state at the end of the sequence? For example if customer #1 has 300 time steps and customer #2 has 400 time steps, I would reset the state of the model after 300 batches and then after 400 batches, assuming that each batch = 1 time step only.

Model Architecture:

Would turning on stateful offer any advantages in this situation? It doesn't look like I need it if I just use approach #1 above, and would just add unnecessary complexity to my problem.
Just to confirm if I zero pad all my sequences such that each sample contains the same number of time steps, I can actually still let Keras shuffle my data as it will only shuffle the first dimension (the customers) and not the time dimension within each of the customers, right?
Since I have labels available at every time step I will have return sequences = True

Making Predictions:

If I want to generate predictions for arbitrary length sequences, what is the best way to accomplish this in Keras? Lets say I want to make a prediction for the next time step t+1, I feed in the features for time step t. But what about carrying the hidden state forward and making prediction for time step t+2? Does my model need to predict the features for the next time step in addition to the label so that I can feed the features back into the model to predict the next time step?
Is it important to have stateful = True if I am trying to use the approach where I'm predicting the features for the next time step?

stale

Source

hamelsmu

👍7

Most helpful comment

If your issue is an implementation question, please ask your question on StackOverflow or join the Keras Slack channel and ask there instead of filing a GitHub issue.

Not necessarily, only pad up to the max sequence length of your current mini batch
Here (contrary to encoder decoder architecture) add zeros after the sequence, so that you remove the uncertainty about variable number of zero between initial state of recursive layers and start of the sentence.
Yes
The mask you need is to mask the output (i.e. give 0 weight to corresponding time samples of loss function) so you don't take into account losses that will be calculated with unavailable data.
You don't need stateful (it could be used as an approximation for memory limited problem) for such a small problem. You don't need to set the mini batch size=1, you can group the samples by sequence length.
see 5.
Yes but see 1 to maybe optimize.
Yes
Best simple ways but not bias less : The main usage of stateful : in a greedy decoder. You get a prediction for a single time step you take the argmax then feed it as input to the next time. Or do a beam search but you probably won't be able to take advantage of stateful.
Train the model with stateful=false, then create the same model with stateful=True, transfer the weight and do greedy predictions

unrealwill on 14 Feb 2017

👍3

All 7 comments

If your issue is an implementation question, please ask your question on StackOverflow or join the Keras Slack channel and ask there instead of filing a GitHub issue.

Not necessarily, only pad up to the max sequence length of your current mini batch
Here (contrary to encoder decoder architecture) add zeros after the sequence, so that you remove the uncertainty about variable number of zero between initial state of recursive layers and start of the sentence.
Yes
The mask you need is to mask the output (i.e. give 0 weight to corresponding time samples of loss function) so you don't take into account losses that will be calculated with unavailable data.
You don't need stateful (it could be used as an approximation for memory limited problem) for such a small problem. You don't need to set the mini batch size=1, you can group the samples by sequence length.
see 5.
Yes but see 1 to maybe optimize.
Yes
Best simple ways but not bias less : The main usage of stateful : in a greedy decoder. You get a prediction for a single time step you take the argmax then feed it as input to the next time. Or do a beam search but you probably won't be able to take advantage of stateful.
Train the model with stateful=false, then create the same model with stateful=True, transfer the weight and do greedy predictions

unrealwill on 14 Feb 2017

👍3

@unrealwill something got bungled in your numbering. can you reformat? I actually would like to share your answer with everyone that I'm working with

hamelsmu on 14 Feb 2017

@unrealwill when you say do greedy predictions, what if I don't have any features for the next time step how do I make predictions? Does my model have to also output predictions for the features at the next timestep as well? This is not a character - rnn problem where the outptut can be directly be fed as input at the next time step...

hamelsmu on 14 Feb 2017

I've reformatted, my answer, thx.

If you don't have the features for the next time step, you should probably build a model for them.
The generic way is character-rnn style but with a mixture of gaussians model (vae style), which you usually can start approximating with a single gaussian . Or you can make an hypothesis that features are independent from the current sequence, and sample from them directly when you need some.

Alternatively you can learn multiple model (one for each x) which learns to predict label (probably the cumulative sales between t and t+x) at t + x directly.

unrealwill on 14 Feb 2017

@unrealwill so just to confirm, I should NOT mask my input values (which you suggested I should right-pad), correct? What is the practical difference in masking input vs. output? If I mask the input, my understanding is that the mask will propagate all the way downstream... ? What am I missing?

hamelsmu on 14 Feb 2017

I used to mask the output with sample weights in temporal mode before masks in Keras got implemented and didn't change my habit. There are various ways to implement masks, (masks are quite tricky), I checked the keras code regarding masks yesterday (see https://github.com/fchollet/keras/issues/5392), and masks indeed propagate downstream, to be applied at the output level, so you can use keras mask of the input values and it will be propagated to the output where the loss of the masked values will be ignored.

unrealwill on 14 Feb 2017

@unrealwill I have timeseries data and the same problem with the seq length. I train with a smaller part of the data (10000,20,1) shape and the whole data is (20000,20,1). Should I ad 10000 rows with zeros beginning at the top into my training data ? My aim usually was to train with less data but use the whole data for prediction