Incubator-mxnet: 【Question】Are there any examples for gradients clipping in gluon?

Created on 30 Jun 2018 · 6Comments · Source: apache/incubator-mxnet

This is what I guess. Is it right?

    trainer.allreduce_grads()
    with autograd.record():
        logits = model(input)
        loss = criterion(logits, target)
    loss.backward()

    grads = [i.grad(ctx) for i in model.params.values()]
    gluon.utils.clip_global_norm(grads, args.grad_clip)
    trainer.update(args.batch_size)

Gluon Modeling

Source

yukang2017

Most helpful comment

@yukang2017 would you be able to open a separate issue for the performance issue? that way we can keep api usage separate from performance, and it can be tagged correctly for better community help. thanks!

With regards to how to perform gradient clipping with Gluon, another option is to specify gradient clipping as part of the Optimizer given to the Trainer. Although this will simply clip by value: potentially changing the direction of the gradient tensor and effecting different gradient tensors in different ways. So overall, gluon.utils.clip_global_norm is the best for maintaining direction and relative magnitudes of gradient tensors. Still, this is an example of how value gradient clipping can be done:

mxnet.gluon.Trainer(net.collect_params(), optimizer='sgd',
                    optimizer_params={'learning_rate': 0.1, 'clip_gradient':5},
                    kvstore='device') #for GPU

When elements of the gradient tensor are:

less than -5, they will be set to -5,
greater than +5, they will be set to +5.

thomelane on 13 Jul 2018

👍3

All 6 comments

That's correct, though you may want to switch model.params to model.collect_params() to include the parameters of all children blocks too.

szha on 30 Jun 2018

thanks

yukang2017 on 1 Jul 2018

I compare the speed of grad clipping between pytorch(0.3.1) and mxnet-gluon(1.2.0) on a Titan X. Pytorch gets nearly 10 times faster than mxnet-gluon.

mxnet:
grads = [i.grad(ctx) for i in model.collect_params().values() if i._grad is not None]
gluon.utils.clip_global_norm(grads, args.grad_clip)
pytorch:
nn.utils.clip_grad_norm(model.parameters(), args.grad_clip)

Are there something wrong with my implementation?

yukang2017 on 9 Jul 2018

mxnet.gluon.Trainer(net.collect_params(), optimizer='sgd',
                    optimizer_params={'learning_rate': 0.1, 'clip_gradient':5},
                    kvstore='device') #for GPU

When elements of the gradient tensor are: