Keras: Reduce memory burden on small GPUs with "Physical/Virtual Batch Size"

Created on 1 Feb 2017 · 26Comments · Source: keras-team/keras

Not sure if this is a great idea since I solved this problem for myself by buying a GPU with a lot of memory, but...

One of the ways that is suggested to get models running on smaller GPUs is by reducing the batch size, however given the fact that the batch size is meaningful for the optimization procedure as well as the actual execution, it may be useful to decouple the batches that gradients are computed with vs the batches that are used to do optimization steps, i.e. run several smaller physical batches where gradients are accumulated across batches before a gradient step is taken.

stale

Source

kuza55

👍16

Most helpful comment

+1 This would be a great addition.

waleedka on 7 Jan 2018

👍13

All 26 comments

+1
I think this is possible and could be implemented without too much fuss (though I haven't the setup right now). I think having a virtual_batch_size parameter in the optimizers might be a good API for this idea too.

tstandley on 2 Feb 2017

I don't think this would require any changes to optimizer APIs, this would just require gradients to be accumulated across multiple batches before passing them to the optimizer. I'm happy to do the implementation if there's some agreement that this is worthwhile.

kuza55 on 2 Feb 2017

+1 Something like this would be very useful.

raghakot on 7 Mar 2017

@kuza55 Any pointers on how tp get started? I am currently trying to implement this..

raghakot on 12 Mar 2017

You would want to change the fit function to:

Allocate some space for the gradients, based on the size of the model's trainable weights.

Instead of running an op that computes gradients, passes them to an optimizer that computes steps and applies those to the weights, you'll want to run an op that applies those weights to your accumulation buffer.

On the final step in each virtual batch you'll want to run the accumulated gradients through an optimizer and then apply them to the weights.

You may need to change the model building code to build this accumulation/apply op, I think these graph ops get built before the fit function.

kuza55 on 12 Mar 2017

This would be a nice feature.

arunpatala on 13 Mar 2017

Why not do as other optimizer? Like Adams optimizer out momentum? It is
like a momentum with special updates frequency. This could disrupt less
keras code base.

Le lun. 13 mars 2017 00:55, arunpatala notifications@github.com a écrit :

This would be a nice feature.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/5244#issuecomment-286015437,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AALC-9VT4Mo9YFX2slR7eo_K7RLfksghks5rlMwpgaJpZM4L0Ouy
.

nouiz on 13 Mar 2017

I don't like the idea of a single separate optimizer, since this is totally orthogonal to which optimizer a person wants to use, and having to duplicate all the optimizers seems worse than making changes to the Keras codebase.

If it's possible to make a wrapper, or make some changes to the base class, that would be a fine solution.

It's probably possible to do most of what's desired with a wrapper by using a conditional graph op which counts how many rows it has accumulated before doing the update, though it would lose any incomplete virtual batches. Though I guess a wrapper could also have a callback function that you stick into the training callbacks that flushes the update when training is done.

That actually seems pretty reasonable.

kuza55 on 14 Mar 2017

This should be orthogonal to the optimizer so new optimizers may use this.
This is useful not only for people with 'small gpu's. Its a really useful feature because it allows you to use deeper models/larger batch.

RaananHadar on 8 Apr 2017

I did this as a separate optimizer, but it would be better to have it as a feature for arbitrary optimizers.

the-moliver on 9 Apr 2017

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

stale[bot] on 8 Jul 2017

We still need a solution for this.

raghakot on 8 Jul 2017

👍13

Agree with that

VENHEADs on 5 Sep 2017

stale[bot] on 4 Dec 2017

+1 This would be a great addition.

waleedka on 7 Jan 2018

👍13

I would definitely welcome this feature.

rohit-gupta on 2 Apr 2018

Was this ever implemented? I could really use this feature!

EmilHJ on 6 Jul 2018

gradient_batch = []
for index in range(batch_size):
    with tf.device("/device:GPU:{}".format(index % gpu_count)):
        output = feed_forward(...)
        loss = get_loss(output)
        gradient_batch.append(tf.gradients(loss, tf.trainable_variables()))
with tf.device("/cpu:0"):
    update = tf.train.AdamOptimizer().apply_gradients(
        zip(
            [tf.reduce_mean(input_tensor=tf.stack(part_batch), axis=0) for part_batch in zip(*gradient_batch)],
            tf.trainable_variables()
        )
    )

erikchwang on 26 Jul 2018

I'm looking for this awesome feature, too. No progress?
@erikchwang where would one have to add your code snippet, when starting from a keras typical fit_generator() call?

Phyx on 23 Oct 2019

This can be achieved easily by just wrapping any tensorflow optimizer with AccumGradOptimizer

ppwwyyxx on 24 Oct 2019

👍1

This code runs and was supposed to do the job, but it performs totally worse compared to real mini_batches. Is there something that I need to take care of? @ppwwyyxx

from keras.models import Sequential
from keras.layers import Activation, Flatten, Dense, BatchNormalization, Conv2D, InputLayer, MaxPooling2D
import tensorflow as tf
from keras import optimizers
from tensorpack.tfutils.optimizer import AccumGradOptimizer

test_model = Sequential([
    InputLayer(input_shape=(128, 128, 3)),
    Conv2D(50, (5, 5), padding='valid'),
    Activation('relu'),
    BatchNormalization(),
    MaxPooling2D(pool_size=(2, 2)),
    Conv2D(50, (5, 5), padding='valid'),
    Activation('relu'),
    BatchNormalization(),
    MaxPooling2D(pool_size=(2, 2)),
    Flatten(),
    BatchNormalization(),
    Dense(30, activation="relu"),
    BatchNormalization(),
    Dense(2, activation='softmax')
])

tfopti = tf.train.AdamOptimizer()
tfopti = AccumGradOptimizer(tfopti, niter=16)
keras_opti = optimizers.TFOptimizer(tfopti)

test_model.compile(
    optimizer=keras_opti,
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

Phyx on 24 Oct 2019

The code does what it says, and no one can guarantee you it will work better or worse

ppwwyyxx on 24 Oct 2019

What is the actual batch size you are setting? The code simulates a larger
minibatch for updates, but not for batchnorm. Small batches for batchnorm
can totally ruin performance because the batchnorm statistics aren't stable
enough.

On Thu, Oct 24, 2019, 5:08 AM Yuxin Wu notifications@github.com wrote:

The code does what it says, and no one can guarantee you it will work
better or worse

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/keras-team/keras/issues/5244?email_source=notifications&email_token=AEDABXXDKFWTVADRK4SBH7DQQGF4FA5CNFSM4C6Q5OZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECEZE3Q#issuecomment-545886830,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AEDABXQRTJ42H6RMY4URLUDQQGF4FANCNFSM4C6Q5OZA
.

tstandley on 24 Oct 2019

👍1

@tstandley thanks for your hint. Actual batch_size was 1. Removing the batchnorm layers did not help!

batch_size = 8 -> Optimization works, problem get solved close to perfectness.
batch_size = 1, niter=8 -> Optimization gets stuck at random guess accuracy.

Seems like, at least for this experiment, AccumGradOptimizer does not provide a virtual compensation for bigger batch size.

Phyx on 25 Oct 2019

Yes that makes sense. Batchnorm will not work with only one channel. You
can try batch size=4 and niter=2. That should work better and save nearly
half of the memory.

On Thu, Oct 24, 2019, 3:26 PM Phyx notifications@github.com wrote:

@tstandley https://github.com/tstandley thanks for your hint. Actual
batch_size was 1. Removing the batchnorm layers did not help!

batch_size = 8 -> Optimization works, problem get solved close to
perfectness.
batch_size = 1, niter=8 -> Optimization gets stuck at random guess
accuracy.

Seems like, at least for this experiment, AccumGradOptimizer does not
provide a virtual compensation for bigger batch size.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/keras-team/keras/issues/5244?email_source=notifications&email_token=AEDABXVV5X27XHVVCG4QLS3QQIOJVA5CNFSM4C6Q5OZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECGT4DQ#issuecomment-546127374,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AEDABXRYGNGUGJ26ZQT2FJDQQIOJVANCNFSM4C6Q5OZA
.

tstandley on 25 Oct 2019

👍1

The tensorpack solution does not work with tf >=2.0, unfortunately

/usr/local/lib/python3.6/dist-packages/tensorpack/tfutils/optimizer.py in apply_gradients(self, grads_and_vars, global_step, name)
    190             slots_and_vars = [(s, gv[1]) for s, gv in zip(slots, grads_and_vars)]
    191 
--> 192             with tf.variable_scope(self._name), tf.device('/cpu:0'):
    193                 counter = tf.Variable(
    194                     0, name="counter", trainable=False, dtype=tf.int32)

AttributeError: module 'tensorflow' has no attribute 'variable_scope'