Incubator-mxnet: 銆怮uestion銆慉re there any examples for gradients clipping in gluon?

Created on 30 Jun 2018  路  6Comments  路  Source: apache/incubator-mxnet

This is what I guess. Is it right?

    trainer.allreduce_grads()
    with autograd.record():
        logits = model(input)
        loss = criterion(logits, target)
    loss.backward()

    grads = [i.grad(ctx) for i in model.params.values()]
    gluon.utils.clip_global_norm(grads, args.grad_clip)
    trainer.update(args.batch_size)
Gluon Modeling

Most helpful comment

@yukang2017 would you be able to open a separate issue for the performance issue? that way we can keep api usage separate from performance, and it can be tagged correctly for better community help. thanks!

With regards to how to perform gradient clipping with Gluon, another option is to specify gradient clipping as part of the Optimizer given to the Trainer. Although this will simply clip by value: potentially changing the direction of the gradient tensor and effecting different gradient tensors in different ways. So overall, gluon.utils.clip_global_norm is the best for maintaining direction and relative magnitudes of gradient tensors. Still, this is an example of how value gradient clipping can be done:

mxnet.gluon.Trainer(net.collect_params(), optimizer='sgd',
                    optimizer_params={'learning_rate': 0.1, 'clip_gradient':5},
                    kvstore='device') #for GPU

When elements of the gradient tensor are:

  • less than -5, they will be set to -5,
  • greater than +5, they will be set to +5.

All 6 comments

That's correct, though you may want to switch model.params to model.collect_params() to include the parameters of all children blocks too.

thanks

I compare the speed of grad clipping between pytorch(0.3.1) and mxnet-gluon(1.2.0) on a Titan X. Pytorch gets nearly 10 times faster than mxnet-gluon.

mxnet:
grads = [i.grad(ctx) for i in model.collect_params().values() if i._grad is not None]
gluon.utils.clip_global_norm(grads, args.grad_clip)
pytorch:
nn.utils.clip_grad_norm(model.parameters(), args.grad_clip)

Are there something wrong with my implementation?

@yukang2017 would you be able to open a separate issue for the performance issue? that way we can keep api usage separate from performance, and it can be tagged correctly for better community help. thanks!

With regards to how to perform gradient clipping with Gluon, another option is to specify gradient clipping as part of the Optimizer given to the Trainer. Although this will simply clip by value: potentially changing the direction of the gradient tensor and effecting different gradient tensors in different ways. So overall, gluon.utils.clip_global_norm is the best for maintaining direction and relative magnitudes of gradient tensors. Still, this is an example of how value gradient clipping can be done:

mxnet.gluon.Trainer(net.collect_params(), optimizer='sgd',
                    optimizer_params={'learning_rate': 0.1, 'clip_gradient':5},
                    kvstore='device') #for GPU

When elements of the gradient tensor are:

  • less than -5, they will be set to -5,
  • greater than +5, they will be set to +5.

@thomelane I will open a new issue. Thank you!
I tested 'clip_gradient' in _mxnet.gluon.Trainer_ and torch.optim. The time cost is similar.

@sandeep-krishnamurthy Please close this issue. Thx

Was this page helpful?
0 / 5 - 0 ratings