Keras: Batch size (stacked LSTM, Adam, binary_crossentropy) changes accuracy quite dramatically?

Created on 21 Aug 2015 · 2Comments · Source: keras-team/keras

I am mapping sequences of vectors to corresponding sequences of vectors. The input vectors have elements in range (-1, 1) and the output vector elements are in range (0, 1). I've chosen sigmoid activations and binary_cross entropy because my outputs are interpretable as probabilities. The model (which I'm keeping as simple as I can until this seems to be working) is

model = Sequential()
model.add(LSTM(embedding_size=50, hidden_size=512, activation='sigmoid', truncate_gradient=-1, return_sequences=True))
model.add(LSTM(hidden_size=512, hidden_size=512, activation='sigmoid', truncate_gradient=-1, return_sequences=True))
model.add(LSTM(hidden_size=512, output_size=20, activation='sigmoid', truncate_gradient=-1, return_sequences=True)
model.compile(loss='binary_crossentropy', optimizer='adam')

Note that I don't expect this model to predict anything on a limited set of 10,000 points. I just want to make sure I can fit 10,000 points before trying more exotic things.

I got Theano set up to run from GPU. All good. To speed things up, it seems that increasing the batch size is the preferred strategy. This however sends the accuracy score after an epoch close to zero. On batch sizes of 25, accuracy is around 0.7 after first epoch, whereas on 100 it is around 0.004.

Any suggestions?

Thanks!

Source

cjmcmurtrie

Most helpful comment

Any suggestions?

The key parameter here is the size of your training set, which you are not providing.

Larger batch sizes will indeed speed things up especially on GPU (it is required to fully exploit the GPU speedup, with small batch sizes most of the processing time is just moving batch data on and off the GPU).

Larger batch sizes also mean that you are doing less gradient updates, i.e. you will need to train for more epochs than with smaller batch sizes.

It is completely expected to have different results (at the same number of epochs) with different batch sizes. Final accuracy after convergence is reached should not differ significantly.

fchollet on 21 Aug 2015

👍3

All 2 comments

Any suggestions?

The key parameter here is the size of your training set, which you are not providing.

Larger batch sizes also mean that you are doing less gradient updates, i.e. you will need to train for more epochs than with smaller batch sizes.

It is completely expected to have different results (at the same number of epochs) with different batch sizes. Final accuracy after convergence is reached should not differ significantly.

fchollet on 21 Aug 2015

👍3