From Advances in Optimizing Recurrent Networks: "The cutoff threshold for gradient clipping is set based on the average norm of the gradient over one pass on the data". I would therefore like to compute the average norm of the gradient to find a fitting gradient clipping value for my LSTM. How can this be done in Keras?
A good starting point seems to be get_gradients()
in optimizers.py, but I can't see how I can pass the loss to this function.
I've tried to get inspiration from how the training loss is passed to get_gradients()
when calling get_updates()
as part of Sequential.compile()
in models.py, but it is not clear to me how I can extract the gradients in a similar way.
@PiranjaF did you ever find a nice way of doing this?
No, but you might be able to do it based on the code in the examples folder on visualizing the filters, where the gradient is computed explicitly. Please tell me if you get something working.
In Theano, you can use the op grad_clip, that you put in the forward and
will cause the grad to be clipped:
http://deeplearning.net/software/theano/library/gradient.html?highlight=clip#theano.gradient.grad_clip
maybe it can help you build a layer with it.
On Sat, Mar 5, 2016 at 4:13 AM, PiranjaF [email protected] wrote:
No, but you might be able to do it based on the code in the examples
folder on visualizing the filters, where the gradient is computed
explicitly. Please tell me if you get something working.—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/1370#issuecomment-192610196.
I'm also wondering if there's a nice way of finding a sensible threshold for gradient clipping without resorting to grid search or other hyperparameter optimization strategies.
I think when you create the instance of optimiser. you can specify the clipping value of clipping norm
like following.
opt = rmsprop(clipvalue=100)
model.compile(loss=neg_log_normal_mixture_likelihood, optimizer=opt)
It is documented in the optimise.py file
class Optimizer(object):
'''Abstract optimizer base class.
Note: this is the parent class of all optimizers, not an actual optimizer
that can be used for training models.
All Keras optimizers support the following keyword arguments:
clipnorm: float >= 0. Gradients will be clipped
when their L2 norm exceeds this value.
clipvalue: float >= 0. Gradients will be clipped
when their absolute value exceeds this value.
'''
@yunzhou, yeah, that's how it's set, but _how do you find out a sensible value to set?_
I am also struggling with this. Can someone please tell me what are some "standard" ranges that i can put to my random search for clipnorm
and clipvalue
?
I have observed increased stability in my NN after adding clipnorm, but i can't just go on trying random values. I need a range for searching.
As proposed in https://github.com/fchollet/keras/issues/arxiv.org/pdf/1212.0901 i've build this little code snippet (maybe it helps). It is not beautiful but does its job:
def average_gradient_norm(model, data):
# just checking if the model was already compiled
if not hasattr(model, "train_function"):
raise RuntimeError("You must compile your model before using it.")
weights = model.trainable_weights # weight tensors
get_gradients = model.optimizer.get_gradients(model.total_loss, weights) # gradient tensors
input_tensors = [
# input data
model.inputs[0],
# how much to weight each sample by
model.sample_weights[0],
# labels
model.targets[0],
# train or test mode
K.learning_phase()
]
grad_fct = K.function(inputs=input_tensors, outputs=get_gradients)
steps = 0
total_norm = 0
s_w = None
while steps < data.steps_per_epoch:
X, y = next(data)
# set sample weights to one
# for every input
if s_w is None:
s_w = np.ones(X.shape[0])
gradients = grad_fct([X, s_w, y, 0])
total_norm += np.sqrt(np.sum([np.sum(np.square(g)) for g in gradients]))
steps += 1
return total_norm / float(steps)
It takes a compiled keras model and a data generator (as you would use with fit_generator) as input and computes the average gradient norm over the entire dataset. This norm can then be used, e.g. with the Adam optimizer.
thresh = average_gradient_norm(model, data)
optimizer = Adam(clipnorm=thresh)
To obtain the gradients, if have used @ebanner implementation from: https://github.com/fchollet/keras/issues/2226
For what it's worth: There's also a cheaper way to obtain some insight about the gradient norm by peeking inside the optimizer's weights. RMSprop already tracks the running average square of gradients which can easily be accessed with model.optimizer.weights
. To compute the norm:
norm = math.sqrt(sum(numpy.sum(K.get_value(w)) for w in model.optimizer.weights))
This doesn't require compiling and running another function, but of course this will be the norm of the running squared average of gradients according to rho
, but it can be good enough to be able to set a sensible clipnorm
value.
With some adjustments this might also work with other optimizers than RMSprop.
Most helpful comment
@yunzhou, yeah, that's how it's set, but _how do you find out a sensible value to set?_