Keras: Very high / Nan loss when performing multiple sequence time series regression

Created on 1 Dec 2016 · 6Comments · Source: keras-team/keras

I am training an LSTM model for multiple time-series regression. But, my losses are always either very high or Nan. I have tried several optimizers such as rmsprop, adam and sgd. Here's the script:

````
sgd = SGD(lr=0.0008, decay=1e-6, momentum=0.9, nesterov=True)

Build the Keras Model

model = Sequential()
model.add(LSTM(64, input_shape=(MAX_TIMESTEPS, MAX_FEATURES)))
model.add(Dropout(0.3))
model.add(Dense(1))
model.compile(loss='mean_absolute_error',
optimizer=sgd)

print 'Training the LSTM model'

batch_size = 32
model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=20, validation_split=0.2)
mae = model.evaluate(X_test, Y_test, batch_size=batch_size)
```X_trainis of the shape (N_SAMPLES_TRAIN, MAX_TIMESTEPS, MAX_FEATURES)Y_train` is of the shape (N_SAMPLES)

I should add that the Y values I am trying to predict are very high. Any idea where I might be going wroing?

Source

ankeshanand

Most helpful comment

When working with very long sequences (e.g. a large MAX_TIMESTEPS), it's pretty common for underflow/overflow issues during the early stages of training, where the error is large and gets larger (or inversely, smaller and gets smaller) with each timestep. This causes NaNs, INFs, and the like. This is due to random state of the model at the start of the training, and the loss function accruing for longer with longer sequences.

This can be mitigated by:

Training on the shortest sequences first, and adding longer sequences to the training as the model moves from random towards a local minimum
Initializing the weights intelligently such that the model starts out in a less-random state
Using a clipped (min()'d or max()'d) loss function that keeps the loss within the realm of float numbers

patyork on 1 Dec 2016

👍7 ❤2

All 6 comments

Have you tried to rescale the values before training?

Bo604 on 1 Dec 2016

👍3 😕1

This can be mitigated by:

Training on the shortest sequences first, and adding longer sequences to the training as the model moves from random towards a local minimum
Initializing the weights intelligently such that the model starts out in a less-random state
Using a clipped (min()'d or max()'d) loss function that keeps the loss within the realm of float numbers

patyork on 1 Dec 2016

👍7 ❤2

@patyork Thanks a lot for the suggestions. I am trying to implement the 3rd suggestion, but it raises a Theano error. How should I modify it to work correctly?

def root_mean_squared_error(y_true, y_pred): return K.clip(K.sqrt(K.mean(K.square(y_pred - y_true), axis=-1)), min_value=MIN_VALUE, max_value=MAX_VALUE)

@Bo604 I did try to scale the values, but it still results in large losses in the test set. Maybe I was doing it wrong, I used the MinMaxScaler from scikit to scale features in (0,1)

ankeshanand on 2 Dec 2016

This is a support question, you are more likely to get an answer on the mailing list (https://groups.google.com/forum/#!forum/keras-users) or Stack Overflow. I recommend you close this bug to reduce the noise on the devs.