This is what I guess. Is it right?
trainer.allreduce_grads()
with autograd.record():
logits = model(input)
loss = criterion(logits, target)
loss.backward()
grads = [i.grad(ctx) for i in model.params.values()]
gluon.utils.clip_global_norm(grads, args.grad_clip)
trainer.update(args.batch_size)
That's correct, though you may want to switch model.params to model.collect_params() to include the parameters of all children blocks too.
thanks
I compare the speed of grad clipping between pytorch(0.3.1) and mxnet-gluon(1.2.0) on a Titan X. Pytorch gets nearly 10 times faster than mxnet-gluon.
mxnet:
grads = [i.grad(ctx) for i in model.collect_params().values() if i._grad is not None]
gluon.utils.clip_global_norm(grads, args.grad_clip)
pytorch:
nn.utils.clip_grad_norm(model.parameters(), args.grad_clip)
Are there something wrong with my implementation?
@yukang2017 would you be able to open a separate issue for the performance issue? that way we can keep api usage separate from performance, and it can be tagged correctly for better community help. thanks!
With regards to how to perform gradient clipping with Gluon, another option is to specify gradient clipping as part of the Optimizer given to the Trainer. Although this will simply clip by value: potentially changing the direction of the gradient tensor and effecting different gradient tensors in different ways. So overall, gluon.utils.clip_global_norm is the best for maintaining direction and relative magnitudes of gradient tensors. Still, this is an example of how value gradient clipping can be done:
mxnet.gluon.Trainer(net.collect_params(), optimizer='sgd',
optimizer_params={'learning_rate': 0.1, 'clip_gradient':5},
kvstore='device') #for GPU
When elements of the gradient tensor are:
@thomelane I will open a new issue. Thank you!
I tested 'clip_gradient' in _mxnet.gluon.Trainer_ and torch.optim. The time cost is similar.
@sandeep-krishnamurthy Please close this issue. Thx
Most helpful comment
@yukang2017 would you be able to open a separate issue for the performance issue? that way we can keep api usage separate from performance, and it can be tagged correctly for better community help. thanks!
With regards to how to perform gradient clipping with Gluon, another option is to specify gradient clipping as part of the
Optimizergiven to theTrainer. Although this will simply clip by value: potentially changing the direction of the gradient tensor and effecting different gradient tensors in different ways. So overall,gluon.utils.clip_global_normis the best for maintaining direction and relative magnitudes of gradient tensors. Still, this is an example of how value gradient clipping can be done:When elements of the gradient tensor are: