Relay's automatic differentiation is still missing primal gradients. It would be interesting to integrate with the Tensor level AD at some point, but for the time being we should focus on adding primal gradients. I will open an PR adding to the basic set but we should work towards completion for Relay operators. Those with expertise on the less straight forward gradient computations help would be appreciated.
The gradients should be in C++ and provide tests, see below for complete list.
does this mean we need to write all gradient ops in TOPI (conv2d_grad etc)? That would be major undertaking.
To ease the work of implementing so many gradient expressions, I think we can take advantage of this PR https://github.com/dmlc/tvm/pull/2498 for simple operators and attach appropriate schedules. For complicated operators such as convolution, we will probably need to implement gradient expression manually.
We think that a portion of above operations may indeed be handled by #2498. We will test tensor-level AD for compatibility with listed operations and publish results. Meanwhile, we work on integration of AD with Relay. We plan to provide a layer similar in spirit to our NNVM draft https://github.com/sgrechanik-h/tvm/blob/87d6f319f74360b9dfd0578b68214d1309b208fe/nnvm/src/top/tensor/gradient.cc .
@jroesch given how many of these are just simple either elementwise ops (log, etc) or reductions (broadcast, etc) - would it be possible for you (or someone familiar with how you want this work done) to first implement one of them as a template (i.e. showing desired code location (alongside or in separate file?), primal grad registration, direct + gradient checking in unittests, etc), which will allow others to efficiently use that as a template for the similar work?
@ajtulloch yes, there are a few basic ones committed to the repo, I will try to open a PR with multiple examples from level 1 this week. I've been busy prototyping other Relay features for training and execution which I hope to RFC in the coming weeks.
@reminisce @grwlf I think it would be great if we could get default behavior for Relay, and if the generated gradient's performance isn't sufficient we can hand implement them. @tqchen what do you think about this approach?
@jroesch , dear all. We made a quick check of AD-Relay compatibility: For every relay operation from the above list, we (a) Look at its FTVMCompute attribute (b) determine which TOPI function corresponds to it and (c) Compare the gradients of this function calculated by AD with their numerical estimations. The results are in the table below.
Additional notes:
PS We think about writing TVM Python codegen to pretty-print TVM IR code. Does anybody work on it?
Legend:
|Status | Name | Comment
|:-----:|:-----|:-------
|Level 1||
||tvm.relay.log|Currently we do not assert on negative values which may be incorrect
||tvm.relay.sqrt|Missing by accident, easy to fix
||tvm.relay.exp||
||tvm.relay.sigmoid||
||tvm.relay.add||
||tvm.relay.substract||
||tvm.relay.multiply||
||tvm.relay.divide||
||tvm.relay.mod|:1234: Integer gradients
||tvm.relay.tanh||
||tvm.relay.concatenate||
||tvm.relay.expand_dims||
||tvm.relay.softmax||
||tvm.relay.log_softmax||
||tvm.relay.relu||
||tvm.relay.dropout|:computer: Missing FTVMCompute attribute
||tvm.relay.batch_norm|:computer: Missing FTVMCompute attribute
||tvm.relay.bias_add||
|Level 2||
||tvm.relay.conv2d||
||tvm.relay.conv2d_transpose||
||tvm.relay.dense||
||tvm.relay.max_pool||
||tvm.relay.avg_pool||
||tvm.relay.global_max_pool||
||tvm.relay.global_avg_pool||
||tvm.relay.upsampling||
||tvm.relay.flatten||
||tvm.relay.pad||
||tvm.relay.lrn|Blocked by missing pow intrinsic
||tvm.relay.l2_normalize|Blocked by missing sqrt intrinsic
||tvm.relay.conv2d_winograd_without_weight_transform|Missing TOPI implementation
||tvm.relay.conv2d_winograd_weight_transform||
|Level 3||
||tvm.relay.leaky_relu||
||tvm.relay.prelu||
||tvm.relay.reshape||
||tvm.relay.reshape_like||
||tvm.relay.copy_identity||
||tvm.relay.transpose||
||tvm.relay.squeeze||
||tvm.relay.floor|:1234: Integer gradients
||tvm.relay.ceil|:1234: Integer gradients
||tvm.relay.trunc|:1234: Integer gradients
||tvm.relay.clip|Missing Not operation
||tvm.relay.round|:1234: Integer gradients
||tvm.relay.abs||
||tvm.relay.negative||
||tvm.relay.take||
||tvm.relay.zeros||
||tvm.relay.zeros_like||
||tvm.relay.ones||
||tvm.relay.ones_like||
||tvm.relay.full||
||tvm.relay.full_like||
||tvm.relay.cast|Currently, differentiate returns zeros for non-float32 inputs
||tvm.relay.split||
|Level 4||
||tvm.relay.right_shift|:1234: Integer gradients
||tvm.relay.left_shift|:1234: Integer gradients
||tvm.relay.equal|:1234: Integer gradients
||tvm.relay.not_equal|:1234: Integer gradients
||tvm.relay.greater|:1234: Integer gradients
||tvm.relay.greater_equal|:1234: Integer gradients
||tvm.relay.less|:1234: Integer gradients
||tvm.relay.less_equal|:1234: Integer gradients
||tvm.relay.maximum||
||tvm.relay.minimum||
||tvm.relay.power|Missing by accident, should be easy to fix.
||tvm.relay.where|:snake: Missing Python API
||tvm.relay.argmax|
||tvm.relay.argmin|
||tvm.relay.sum||
||tvm.relay.max||
||tvm.relay.min||
||tvm.relay.mean||
||tvm.relay.prod||
||tvm.relay.strided_slice||
||tvm.relay.broadcast_to||
|Level 5||
||tvm.relay.resize|Blocked by missing floor intrinsic
||tvm.relay.multibox_prior|
||tvm.relay.multibox_transform_loc|
||tvm.relay.nms|
|Level 10||
||tvm.relay.broadcast_to_like||
||tvm.relay.collapse_sum_like|:snake: Missing Python API
||tvm.relay.slice_like||
||tvm.relay.layout_transform|:snake: Missing Python API
||tvm.relay.device_copy|:computer: Missing FTVMCompute attribute
||tvm.relay.on_device|:computer: Missing FTVMCompute attribute
While it is great to have a tensor expression gradient support. I recommend we provide the primal gradient in the form of relay operators, at this moment.
The main reason is that the relay-> relay transformation and makes it easier to do follow up analysis and transformations in relay, it also makes sure that each op can generate different variants easily(winograd, spatial pack for conv2d).
This does not eliminate the value of expression level gradient though, as they could be nice complementary when a user define custom op, and as a topic of research in the long run, if integrated properly with relay
Expressing gradients in relay would be a good design test. My thoughts regarding this design choice are follows:
dense, softmax are possible candidates for this approach.I am working on adding gradient definition for some level 1/2 operators, see https://github.com/dmlc/tvm/pull/2633 for details
I'm interested in helping contribute gradient implementations, but I'm finding it a bit difficult to understand what orientation the original op arguments are in, and what role collapse_sum_like plays (its documentation, "Return a scalar value array with the same shape and type as the input array." is identical to broadcast_to_like, and I'm not really understanding the mathematical operation it's performing).
As an example, by trial and error I arrived at the following for nn.dense:
@register_gradient("nn.dense")
def dense_grad(orig, grad):
data, weight = orig.args
return [collapse_sum_like(transpose(transpose(weight) * grad), data),
collapse_sum_like(transpose(grad * transpose(data)), weight)]
I'm verifying this by checking gradient values numerically from a toy tensorflow model with a dense layer that I converted. I would not have expected to need the outer transpose here, but without it it seems like collapse_sum_like was broadcasting a sum on the wrong axis.
Would it be possible to provide a more detailed tutorial about how to translate a known mathematical form of a gradient to a relay implementation, to make it easier for the community to contribute some of these implementations?
@altanh could you maybe further improve upon the docs when you open your PR and try to address some of @SWu's comments.
Altan has revived the work in the past few weeks and we have been working on a library for using Relay for training, he will hopefully follow up on this thread with more details.
@SWu this is an issue that I've run into as well. I believe the specific documentation issue you ran into is indeed a copy-paste error, which we should fix. Overall though, the documentation is lacking as @jroesch said, and we (who implement more grads) should definitely update it with better descriptions as we work through them.
For collapse_sum_like, I dug into the TOPI code, and it looks like the general idea is to match up tensor dimensions (starting from the last dimension of both) and reduce them (using sum) until it matches the target shape. If two dims don't match, then the input dim is reduced and squeezed, and we continue trying to match. If they are equal, then do nothing. If the output dimension is 1, then we reduce down to 1.
For example, if A.shape() = (4,5,3) and B.shape() = (5, 1), then collapse_sum_like(A, B) will reduce the 3rd dim of A to 1 (i.e. keepdims=True), not reduce the 2nd dimension, and then reduce and squeeze (i.e. keepdims=False) the 1st dimension. It's unclear to me how this will work for 'mismatched' shapes like (4,4,4) and (3,2), since the input will just be completely squeezed (and from what I can tell, there's no error check for this, so maybe this is correct behavior that I don't understand).
We also need to think about the best way to verify correctness of these implementations, since currently the numerical tests in TVM are somewhat arbitrary. Your approach seems solid for ensuring correct behavior with respect to existing frameworks. This problem is more general than just for gradients though, and I think we should have a TVM-wide discussion.
As for your last point, I think this would be a good idea. I'll try to type up a tutorial of sorts walking through my implementation of softmax once I'm done with my current work. I don't want to write too much more here (and maybe this is already too much), but hopefully this helped. I'll make a more comprehensive post once the PR is ready.
closing for now due to inactive status, let us open new thread for new TODOs of gradients
Most helpful comment
@altanh could you maybe further improve upon the docs when you open your PR and try to address some of @SWu's comments.
Altan has revived the work in the past few weeks and we have been working on a library for using Relay for training, he will hopefully follow up on this thread with more details.