Keras: Finding gradient clipping threshold

Created on 28 Dec 2015 · 9Comments · Source: keras-team/keras

From Advances in Optimizing Recurrent Networks: "The cutoff threshold for gradient clipping is set based on the average norm of the gradient over one pass on the data". I would therefore like to compute the average norm of the gradient to find a fitting gradient clipping value for my LSTM. How can this be done in Keras?

A good starting point seems to be get_gradients() in optimizers.py, but I can't see how I can pass the loss to this function.

I've tried to get inspiration from how the training loss is passed to get_gradients() when calling get_updates() as part of Sequential.compile() in models.py, but it is not clear to me how I can extract the gradients in a similar way.

stale

Source

PiranjaF

Most helpful comment

@yunzhou, yeah, that's how it's set, but _how do you find out a sensible value to set?_

carlthome on 27 May 2016

👍11

All 9 comments

@PiranjaF did you ever find a nice way of doing this?

bmabey on 5 Mar 2016

No, but you might be able to do it based on the code in the examples folder on visualizing the filters, where the gradient is computed explicitly. Please tell me if you get something working.

PiranjaF on 5 Mar 2016

In Theano, you can use the op grad_clip, that you put in the forward and
will cause the grad to be clipped:
http://deeplearning.net/software/theano/library/gradient.html?highlight=clip#theano.gradient.grad_clip

maybe it can help you build a layer with it.

On Sat, Mar 5, 2016 at 4:13 AM, PiranjaF [email protected] wrote:

No, but you might be able to do it based on the code in the examples
folder on visualizing the filters, where the gradient is computed
explicitly. Please tell me if you get something working.

—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/1370#issuecomment-192610196.

nouiz on 7 Mar 2016

I'm also wondering if there's a nice way of finding a sensible threshold for gradient clipping without resorting to grid search or other hyperparameter optimization strategies.

carlthome on 17 May 2016

I think when you create the instance of optimiser. you can specify the clipping value of clipping norm
like following.

    opt = rmsprop(clipvalue=100)
    model.compile(loss=neg_log_normal_mixture_likelihood, optimizer=opt)

It is documented in the optimise.py file

class Optimizer(object):
    '''Abstract optimizer base class.

    Note: this is the parent class of all optimizers, not an actual optimizer
    that can be used for training models.

    All Keras optimizers support the following keyword arguments:

        clipnorm: float >= 0. Gradients will be clipped
            when their L2 norm exceeds this value.
        clipvalue: float >= 0. Gradients will be clipped
            when their absolute value exceeds this value.
    '''

yunzhou on 27 May 2016

😕4 👍4

@yunzhou, yeah, that's how it's set, but _how do you find out a sensible value to set?_

carlthome on 27 May 2016

👍11

I am also struggling with this. Can someone please tell me what are some "standard" ranges that i can put to my random search for clipnorm and clipvalue?
I have observed increased stability in my NN after adding clipnorm, but i can't just go on trying random values. I need a range for searching.

cbaziotis on 15 Jan 2017

As proposed in https://github.com/fchollet/keras/issues/arxiv.org/pdf/1212.0901 i've build this little code snippet (maybe it helps). It is not beautiful but does its job:

def average_gradient_norm(model, data):
    # just checking if the model was already compiled
    if not hasattr(model, "train_function"):
        raise RuntimeError("You must compile your model before using it.")

    weights = model.trainable_weights  # weight tensors

    get_gradients = model.optimizer.get_gradients(model.total_loss, weights)  # gradient tensors

    input_tensors = [
        # input data
        model.inputs[0],
        # how much to weight each sample by
        model.sample_weights[0],
        # labels
        model.targets[0],
        # train or test mode
        K.learning_phase()
    ]

    grad_fct = K.function(inputs=input_tensors, outputs=get_gradients)

    steps = 0
    total_norm = 0
    s_w = None
    while steps < data.steps_per_epoch:
        X, y = next(data)
        # set sample weights to one
        # for every input
        if s_w is None:
            s_w = np.ones(X.shape[0])

        gradients = grad_fct([X, s_w, y, 0])
        total_norm += np.sqrt(np.sum([np.sum(np.square(g)) for g in gradients]))
        steps += 1

    return total_norm / float(steps)

It takes a compiled keras model and a data generator (as you would use with fit_generator) as input and computes the average gradient norm over the entire dataset. This norm can then be used, e.g. with the Adam optimizer.

thresh = average_gradient_norm(model, data)
optimizer = Adam(clipnorm=thresh)

To obtain the gradients, if have used @ebanner implementation from: https://github.com/fchollet/keras/issues/2226

peschn on 27 Jul 2017

👍5

For what it's worth: There's also a cheaper way to obtain some insight about the gradient norm by peeking inside the optimizer's weights. RMSprop already tracks the running average square of gradients which can easily be accessed with model.optimizer.weights. To compute the norm:

norm = math.sqrt(sum(numpy.sum(K.get_value(w)) for w in model.optimizer.weights))

This doesn't require compiling and running another function, but of course this will be the norm of the running squared average of gradients according to rho, but it can be good enough to be able to set a sensible clipnorm value.

With some adjustments this might also work with other optimizers than RMSprop.