I think it should copy the gradient as well automatically?
ctx = mx.cpu()
a = mx.nd.ones((10,10), ctx=ctx)
a.attach_grad()
print(a.context)
print(a.grad.context)
ctx = mx.gpu()
a = a.as_in_context(ctx)
print(a.context)
print(a.grad.context)
cpu(0)
cpu(0)
gpu(0)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-16-deab22190603> in <module>()
5 ctx = mx.gpu()
6 a = a.as_in_context(ctx)
----> 7 print(a.context, a.grad.context)
AttributeError: 'NoneType' object has no attribute 'context'
Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug
as_in_context will return a new copied NDArray object, but it doesn't copy the gradient.
If you need to copy the gradient, a.grad.as_in_context(ctx) is okay.
I think it Is not necessary to copy the gradient automatically, because the gradient is not used usually, it may be useless to copy automatically.
I disagree I think it would make more sense to copy the gradient as well as the tensor. It doesn't make sense for them to be on different contexts?
Also a variable that had gradient attached after being moved does not have a gradient attached anymore, that seems clearly a bug to me?
You are right. Let me think about a lazy method to return the gradient.
Is it good to define the function def as_in_context(ctx, ignore_grad=True)?
Because sometime we do not need to copy the gradient, and I do not how to copy the gradient lazily.
If we are worried about breaking APIs, I think we could start indeed with a as_in_context(ctx, copy_grad=False) to avoid a breaking API change but for MXNet 2.0 I would suggest to move to copy_grad=True which makes more sense to me.
I'm worry copy_grad=True will drop the performance. Assume that it takes 1s to copy the data of a large tensor, it will takes extra 1s to copy the gradient.
In what case would you not want to copy gradients that has been attached?
Most as in context happened on data without gradient, I only discovered the
bug on some niche use case that requires data with gradient attached.
If the data does not have gradient attached the runtime would be the same
since no gradient would need moving
On Wed, Mar 13, 2019, 19:00 JackieWu notifications@github.com wrote:
I'm worry copy_grad=True will drop the performance. Assume that it takes
1s to copy the data of a large tensor, it will takes extra 1s to copy the
gradient.—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/apache/incubator-mxnet/issues/14391#issuecomment-472674519,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADi00-T27pr84_lNYc42pd4Jrfq8QmnEks5vWa1JgaJpZM4bphps
.
I can see @szha argument on that one, though I remain on the fence. However that's a niche use-case and I'm happy to go on the side of simplicity. To add a use-case to the conversation: when you want to differentiate the loss with respect to the input for example (DeepDream style visualization) and want to visualize the gradient. If you simply copy back the ndarray to the CPU, the gradient is not forwarded.