Not sure if this is a great idea since I solved this problem for myself by buying a GPU with a lot of memory, but...
One of the ways that is suggested to get models running on smaller GPUs is by reducing the batch size, however given the fact that the batch size is meaningful for the optimization procedure as well as the actual execution, it may be useful to decouple the batches that gradients are computed with vs the batches that are used to do optimization steps, i.e. run several smaller physical batches where gradients are accumulated across batches before a gradient step is taken.
+1
I think this is possible and could be implemented without too much fuss (though I haven't the setup right now). I think having a virtual_batch_size parameter in the optimizers might be a good API for this idea too.
I don't think this would require any changes to optimizer APIs, this would just require gradients to be accumulated across multiple batches before passing them to the optimizer. I'm happy to do the implementation if there's some agreement that this is worthwhile.
+1 Something like this would be very useful.
@kuza55 Any pointers on how tp get started? I am currently trying to implement this..
You would want to change the fit function to:
Allocate some space for the gradients, based on the size of the model's trainable weights.
Instead of running an op that computes gradients, passes them to an optimizer that computes steps and applies those to the weights, you'll want to run an op that applies those weights to your accumulation buffer.
On the final step in each virtual batch you'll want to run the accumulated gradients through an optimizer and then apply them to the weights.
You may need to change the model building code to build this accumulation/apply op, I think these graph ops get built before the fit function.
This would be a nice feature.
Why not do as other optimizer? Like Adams optimizer out momentum? It is
like a momentum with special updates frequency. This could disrupt less
keras code base.
Le lun. 13 mars 2017 00:55, arunpatala notifications@github.com a écrit :
This would be a nice feature.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/5244#issuecomment-286015437,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AALC-9VT4Mo9YFX2slR7eo_K7RLfksghks5rlMwpgaJpZM4L0Ouy
.
I don't like the idea of a single separate optimizer, since this is totally orthogonal to which optimizer a person wants to use, and having to duplicate all the optimizers seems worse than making changes to the Keras codebase.
If it's possible to make a wrapper, or make some changes to the base class, that would be a fine solution.
It's probably possible to do most of what's desired with a wrapper by using a conditional graph op which counts how many rows it has accumulated before doing the update, though it would lose any incomplete virtual batches. Though I guess a wrapper could also have a callback function that you stick into the training callbacks that flushes the update when training is done.
That actually seems pretty reasonable.
This should be orthogonal to the optimizer so new optimizers may use this.
This is useful not only for people with 'small gpu's. Its a really useful feature because it allows you to use deeper models/larger batch.
I did this as a separate optimizer, but it would be better to have it as a feature for arbitrary optimizers.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
We still need a solution for this.
Agree with that
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
+1 This would be a great addition.
I would definitely welcome this feature.
Was this ever implemented? I could really use this feature!
gradient_batch = []
for index in range(batch_size):
with tf.device("/device:GPU:{}".format(index % gpu_count)):
output = feed_forward(...)
loss = get_loss(output)
gradient_batch.append(tf.gradients(loss, tf.trainable_variables()))
with tf.device("/cpu:0"):
update = tf.train.AdamOptimizer().apply_gradients(
zip(
[tf.reduce_mean(input_tensor=tf.stack(part_batch), axis=0) for part_batch in zip(*gradient_batch)],
tf.trainable_variables()
)
)
I'm looking for this awesome feature, too. No progress?
@erikchwang where would one have to add your code snippet, when starting from a keras typical fit_generator() call?
This can be achieved easily by just wrapping any tensorflow optimizer with AccumGradOptimizer
This code runs and was supposed to do the job, but it performs totally worse compared to real mini_batches. Is there something that I need to take care of? @ppwwyyxx
from keras.models import Sequential
from keras.layers import Activation, Flatten, Dense, BatchNormalization, Conv2D, InputLayer, MaxPooling2D
import tensorflow as tf
from keras import optimizers
from tensorpack.tfutils.optimizer import AccumGradOptimizer
test_model = Sequential([
InputLayer(input_shape=(128, 128, 3)),
Conv2D(50, (5, 5), padding='valid'),
Activation('relu'),
BatchNormalization(),
MaxPooling2D(pool_size=(2, 2)),
Conv2D(50, (5, 5), padding='valid'),
Activation('relu'),
BatchNormalization(),
MaxPooling2D(pool_size=(2, 2)),
Flatten(),
BatchNormalization(),
Dense(30, activation="relu"),
BatchNormalization(),
Dense(2, activation='softmax')
])
tfopti = tf.train.AdamOptimizer()
tfopti = AccumGradOptimizer(tfopti, niter=16)
keras_opti = optimizers.TFOptimizer(tfopti)
test_model.compile(
optimizer=keras_opti,
loss='categorical_crossentropy',
metrics=['accuracy']
)
The code does what it says, and no one can guarantee you it will work better or worse
What is the actual batch size you are setting? The code simulates a larger
minibatch for updates, but not for batchnorm. Small batches for batchnorm
can totally ruin performance because the batchnorm statistics aren't stable
enough.
On Thu, Oct 24, 2019, 5:08 AM Yuxin Wu notifications@github.com wrote:
The code does what it says, and no one can guarantee you it will work
better or worse—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/keras-team/keras/issues/5244?email_source=notifications&email_token=AEDABXXDKFWTVADRK4SBH7DQQGF4FA5CNFSM4C6Q5OZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECEZE3Q#issuecomment-545886830,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AEDABXQRTJ42H6RMY4URLUDQQGF4FANCNFSM4C6Q5OZA
.
@tstandley thanks for your hint. Actual batch_size was 1. Removing the batchnorm layers did not help!
batch_size = 8 -> Optimization works, problem get solved close to perfectness.
batch_size = 1, niter=8 -> Optimization gets stuck at random guess accuracy.
Seems like, at least for this experiment, AccumGradOptimizer does not provide a virtual compensation for bigger batch size.
Yes that makes sense. Batchnorm will not work with only one channel. You
can try batch size=4 and niter=2. That should work better and save nearly
half of the memory.
On Thu, Oct 24, 2019, 3:26 PM Phyx notifications@github.com wrote:
@tstandley https://github.com/tstandley thanks for your hint. Actual
batch_size was 1. Removing the batchnorm layers did not help!batch_size = 8 -> Optimization works, problem get solved close to
perfectness.
batch_size = 1, niter=8 -> Optimization gets stuck at random guess
accuracy.Seems like, at least for this experiment, AccumGradOptimizer does not
provide a virtual compensation for bigger batch size.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/keras-team/keras/issues/5244?email_source=notifications&email_token=AEDABXVV5X27XHVVCG4QLS3QQIOJVA5CNFSM4C6Q5OZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECGT4DQ#issuecomment-546127374,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AEDABXRYGNGUGJ26ZQT2FJDQQIOJVANCNFSM4C6Q5OZA
.
The tensorpack solution does not work with tf >=2.0, unfortunately
/usr/local/lib/python3.6/dist-packages/tensorpack/tfutils/optimizer.py in apply_gradients(self, grads_and_vars, global_step, name)
190 slots_and_vars = [(s, gv[1]) for s, gv in zip(slots, grads_and_vars)]
191
--> 192 with tf.variable_scope(self._name), tf.device('/cpu:0'):
193 counter = tf.Variable(
194 0, name="counter", trainable=False, dtype=tf.int32)
AttributeError: module 'tensorflow' has no attribute 'variable_scope'
Most helpful comment
+1 This would be a great addition.