I am training an LSTM model for multiple time-series regression. But, my losses are always either very high or Nan. I have tried several optimizers such as rmsprop, adam and sgd. Here's the script:
````
sgd = SGD(lr=0.0008, decay=1e-6, momentum=0.9, nesterov=True)
model = Sequential()
model.add(LSTM(64, input_shape=(MAX_TIMESTEPS, MAX_FEATURES)))
model.add(Dropout(0.3))
model.add(Dense(1))
model.compile(loss='mean_absolute_error',
optimizer=sgd)
print 'Training the LSTM model'
batch_size = 32
model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=20, validation_split=0.2)
mae = model.evaluate(X_test, Y_test, batch_size=batch_size)
```
X_trainis of the shape (N_SAMPLES_TRAIN, MAX_TIMESTEPS, MAX_FEATURES)
Y_train` is of the shape (N_SAMPLES)
I should add that the Y values I am trying to predict are very high. Any idea where I might be going wroing?
Have you tried to rescale the values before training?
When working with very long sequences (e.g. a large MAX_TIMESTEPS), it's pretty common for underflow/overflow issues during the early stages of training, where the error is large and gets larger (or inversely, smaller and gets smaller) with each timestep. This causes NaNs, INFs, and the like. This is due to random state of the model at the start of the training, and the loss function accruing for longer with longer sequences.
This can be mitigated by:
@patyork Thanks a lot for the suggestions. I am trying to implement the 3rd suggestion, but it raises a Theano error. How should I modify it to work correctly?
def root_mean_squared_error(y_true, y_pred):
return K.clip(K.sqrt(K.mean(K.square(y_pred - y_true), axis=-1)), min_value=MIN_VALUE, max_value=MAX_VALUE)
@Bo604 I did try to scale the values, but it still results in large losses in the test set. Maybe I was doing it wrong, I used the MinMaxScaler from scikit to scale features in (0,1)
This is a support question, you are more likely to get an answer on the mailing list (https://groups.google.com/forum/#!forum/keras-users) or Stack Overflow. I recommend you close this bug to reduce the noise on the devs.
@Bo604 what does rescaling the values mean?
@naisanza i meant to apply something like this to each of the input and target variables before training:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html
Most helpful comment
When working with very long sequences (e.g. a large MAX_TIMESTEPS), it's pretty common for underflow/overflow issues during the early stages of training, where the error is large and gets larger (or inversely, smaller and gets smaller) with each timestep. This causes NaNs, INFs, and the like. This is due to random state of the model at the start of the training, and the loss function accruing for longer with longer sequences.
This can be mitigated by: