Keras: Use batch size in validation for limited GPU memory

Created on 10 Apr 2017 · 13Comments · Source: keras-team/keras

Using Keras 2.0.0, Tensorflow 1.0.1.

I think this is more of a feature request, but it would be great if there's a way to specify/use batch size in model.fit during validation. Currently, it seems like the validation feeds in the entire validation set into the neural network, which, in my case, my GPU runs out of memory. Training works fine because batch size of 32 is small enough. A solution would be to chunk the validation into batches of 32 and validate them each individually, which will enable validation to run on large data sets on GPUs that have insufficient memory.

Source

calclavia

Most helpful comment

But why would validation require more memory than training?

farizrahman4u on 11 Apr 2017

👍3

All 13 comments

In model.fit, the same batch size is used for training and for validation, which you specify via the batch_size keyword argument.

fchollet on 11 Apr 2017

@fchollet That's strange... My GPU trains the network fine, but when I enable 10% validation on data, it crashes (due to lack of memory) when it tries to validate the data at the end of the epoch. Changing validation split to 0 allows it to run fine.

Note: I have quite a large model architecture ~12 layers (6 of which are 128 unit LSTM layers) on Tesla K80.

Any possible suggestions on what might be wrong? Somehow validation requires more memory than training.

calclavia on 11 Apr 2017

Then you should simply decrease the batch size or decrease the network size.

fchollet on 11 Apr 2017

👎1

But why would validation require more memory than training?

farizrahman4u on 11 Apr 2017

👍3

@farizrahman4u Yes that's what I found to be strange. I tried this one two different GPUs, so it shouldn't be any fault with the specific hardware I'm using. If I manually try to call predict on given data one by one (equivalent to batch size = 1), I'm able to do it without running out of memory. Only Keras validation causes this problem.

calclavia on 11 Apr 2017

There are various possible reasons. If you want to investigate this,
snapshot the TF graph before and after validation (using a custom callback
for instance).

On 10 April 2017 at 18:23, Henry Mao notifications@github.com wrote:

@farizrahman4u https://github.com/farizrahman4u Yes that's what I found
to be strange. I tried this one two different GPUs, so it shouldn't be any
fault with the specific hardware I'm using. If I manually try to call
predict on given data, I'm able to do it without running out of memory.
Only Keras validation causes this problem.

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/6217#issuecomment-293122909,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AArWbxfQIykeW2ULszP_vD_U_OEXVSVYks5rutYOgaJpZM4M5Dvt
.

fchollet on 11 Apr 2017

Would there be a way to specify a different batch size for model.fit during validation?

calclavia on 16 Apr 2017

I am having this same issue. My model works just fine without validation, but it runs out of GPU memory when validating regardless of batch size.

It does work with validation_split around 0.01 or lower, but it crashes at some point when increasing the validation split.

Could anybody figure out what is going on? I don't get why changing the validation_split would use up more memory if the validation is done with the same batch size.

waissbluth on 2 Sep 2017

@fchollet I think there should be an option of different batch sizes for training and validation. In many cases we can't change the train batch_size as it effects the results. But the validation batch size can be changed. Even caffe supports that.

divamgupta on 24 Sep 2017

👍2

@calclavia Do you use TensorBoardcallback?
I encountered the same problem, turned out it was (probably) caused by the batch_sizeargument of TensorBoardbeing too high.

rposer on 24 Oct 2017

🎉2

@rposer I disabled TensorBoard and stopped running out of memory in the validation step. Setting the batch_size to TensorBoard to the batch size being used on training also solved the issue. Thanks!

waissbluth on 27 Oct 2017

@waissbluth @fchollet @divamgupta @calclavia
I am facing the same issue, with deeplabv3+, with backbone = xception. Only very small validation split works. Training goes fine, but runs into OOM error just when validation starts.
Has it been resolved?

getsanjeev on 28 Jun 2018

With validation data: https://pasteboard.co/IllA27U.png
Without validation data: https://pasteboard.co/IllDeR3.png
Code: https://gist.github.com/PowerOfCreation/b0940d377180d0131f51a538152f788b

Tested on single and multiple GPUs(Tesla K-20, K-80, V-100), same behavior on all systems. Strangely the memory usage is different on each system. K-20 requiring 8 GB of RAM, K-80 10 GB of RAM and V-100 requiring 36 GB of RAM, RAM usage spikes double when validation data is turned out.

Measurements have been made in steps all 10 seconds.

Tensorflow: 1.10
Keras: 2.1.6-tf