Tvm: [RELAY] Add primal gradients for Relay operators.

Created on 4 Feb 2019 · 14Comments · Source: apache/tvm

Relay's automatic differentiation is still missing primal gradients. It would be interesting to integrate with the Tensor level AD at some point, but for the time being we should focus on adding primal gradients. I will open an PR adding to the basic set but we should work towards completion for Relay operators. Those with expertise on the less straight forward gradient computations help would be appreciated.

The gradients should be in C++ and provide tests, see below for complete list.

Level 1

[x] tvm.relay.log
[x] tvm.relay.sqrt
[x] tvm.relay.exp
[x] tvm.relay.sigmoid
[x] tvm.relay.add
[x] tvm.relay.subtract
[x] tvm.relay.multiply
[x] tvm.relay.divide
[ ] tvm.relay.mod
[ ] tvm.relay.tanh
[ ] tvm.relay.concatenate
[ ] tvm.relay.expand_dims
[ ] tvm.relay.nn.softmax
[ ] tvm.relay.nn.log_softmax
[x] tvm.relay.nn.relu
[ ] tvm.relay.nn.dropout
[ ] tvm.relay.nn.batch_norm
[ ] tvm.relay.nn.bias_add

Level 2

[ ] tvm.relay.nn.conv2d
[ ] tvm.relay.nn.conv2d_transpose
[ ] tvm.relay.nn.dense
[ ] tvm.relay.nn.max_pool2d
[ ] tvm.relay.nn.avg_pool2d
[ ] tvm.relay.nn.global_max_pool2d
[ ] tvm.relay.nn.global_avg_pool2d
[ ] tvm.relay.nn.upsampling
[ ] tvm.relay.nn.batch_flatten
[ ] tvm.relay.nn.pad
[ ] tvm.relay.nn.lrn
[ ] tvm.relay.nn.l2_normalize
[ ] tvm.relay.nn.contrib_conv2d_winograd_without_weight_transform
[ ] tvm.relay.nn.contrib_conv2d_winograd_weight_transform

Level 3

[ ] tvm.relay.nn.leaky_relu
[ ] tvm.relay.nn.prelu
[ ] tvm.relay.reshape
[ ] tvm.relay.reshape_like
[ ] tvm.relay.copy
[ ] tvm.relay.transpose
[ ] tvm.relay.squeeze
[ ] tvm.relay.floor
[ ] tvm.relay.ceil
[ ] tvm.relay.trunc
[ ] tvm.relay.clip
[ ] tvm.relay.round
[ ] tvm.relay.abs
[ ] tvm.relay.negative
[ ] tvm.relay.take
[ ] tvm.relay.zeros
[ ] tvm.relay.zeros_like
[ ] tvm.relay.ones
[ ] tvm.relay.ones_like
[ ] tvm.relay.full
[ ] tvm.relay.full_like
[ ] tvm.relay.cast
[ ] tvm.relay.split

Level 4

[ ] tvm.relay.right_shift
[ ] tvm.relay.left_shift
[ ] tvm.relay.equal
[ ] tvm.relay.not_equal
[ ] tvm.relay.greater
[ ] tvm.relay.greater_equal
[ ] tvm.relay.less
[ ] tvm.relay.less_equal
[ ] tvm.relay.maximum
[ ] tvm.relay.minimum
[ ] tvm.relay.power
[ ] tvm.relay.where
[ ] tvm.relay.argmax
[ ] tvm.relay.argmin
[ ] tvm.relay.sum
[ ] tvm.relay.max
[ ] tvm.relay.min
[ ] tvm.relay.mean
[ ] tvm.relay.prod
[ ] tvm.relay.strided_slice
[ ] tvm.relay.broadcast_to

Level 5

[ ] tvm.relay.image.resize
[ ] tvm.relay.vision.multibox_prior
[ ] tvm.relay.vision.multibox_transform_loc
[ ] tvm.relay.vision.nms

Level 10

[ ] tvm.relay.broadcast_to_like
[ ] tvm.relay.collapse_sum_like
[ ] tvm.relay.slice_like
[ ] tvm.relay.layout_transform
[ ] tvm.relay.device_copy
[ ] tvm.relay.annotation.on_device

help wanted inactive

Source

jroesch

👍2

Most helpful comment

@altanh could you maybe further improve upon the docs when you open your PR and try to address some of @SWu's comments.

Altan has revived the work in the past few weeks and we have been working on a library for using Relay for training, he will hopefully follow up on this thread with more details.

jroesch on 31 May 2019

❤2

All 14 comments

does this mean we need to write all gradient ops in TOPI (conv2d_grad etc)? That would be major undertaking.

masahi on 5 Feb 2019

To ease the work of implementing so many gradient expressions, I think we can take advantage of this PR https://github.com/dmlc/tvm/pull/2498 for simple operators and attach appropriate schedules. For complicated operators such as convolution, we will probably need to implement gradient expression manually.

reminisce on 5 Feb 2019

We think that a portion of above operations may indeed be handled by #2498. We will test tensor-level AD for compatibility with listed operations and publish results. Meanwhile, we work on integration of AD with Relay. We plan to provide a layer similar in spirit to our NNVM draft https://github.com/sgrechanik-h/tvm/blob/87d6f319f74360b9dfd0578b68214d1309b208fe/nnvm/src/top/tensor/gradient.cc .

grwlf on 5 Feb 2019

@jroesch given how many of these are just simple either elementwise ops (log, etc) or reductions (broadcast, etc) - would it be possible for you (or someone familiar with how you want this work done) to first implement one of them as a template (i.e. showing desired code location (alongside or in separate file?), primal grad registration, direct + gradient checking in unittests, etc), which will allow others to efficiently use that as a template for the similar work?

ajtulloch on 5 Feb 2019

@ajtulloch yes, there are a few basic ones committed to the repo, I will try to open a PR with multiple examples from level 1 this week. I've been busy prototyping other Relay features for training and execution which I hope to RFC in the coming weeks.

@reminisce @grwlf I think it would be great if we could get default behavior for Relay, and if the generated gradient's performance isn't sufficient we can hand implement them. @tqchen what do you think about this approach?

jroesch on 6 Feb 2019

@jroesch , dear all. We made a quick check of AD-Relay compatibility: For every relay operation from the above list, we (a) Look at its FTVMCompute attribute (b) determine which TOPI function corresponds to it and (c) Compare the gradients of this function calculated by AD with their numerical estimations. The results are in the table below.

Additional notes:

Numerical check in this test may need adjustments, we saw rare random failures due to precision problems
Some functions run different implementations depending on parameters passed. We attempted to include the most common cases, but some combinations may be missing.
Checking the performance of all operations would require additional efforts, we don't do it now.
For cases with 'Integers gradients' comment: we need to clarify the gradient semantics for such operations. One possible solutions is to just return zeros. But we think that it may be incorrect for some tasks.
To reproduce, apply #2498 to the 427bdcc26 commit of TVM and use the following test.

PS We think about writing TVM Python codegen to pretty-print TVM IR code. Does anybody work on it?

Legend:

Supported, numerical check passed
Missing by accident/easy to add
Need to think first
Need to debug
Unable to check

|Status | Name | Comment
|:-----:|:-----|:-------
|Level 1||
| orange |tvm.relay.log|Currently we do not assert on negative values which may be incorrect
| yellow |tvm.relay.sqrt|Missing by accident, easy to fix
| green |tvm.relay.exp||
| green |tvm.relay.sigmoid||
| green |tvm.relay.add||
| green |tvm.relay.substract||
| green |tvm.relay.multiply||
| green |tvm.relay.divide||
| orange |tvm.relay.mod|:1234: Integer gradients
| green |tvm.relay.tanh||
| green |tvm.relay.concatenate||
| green |tvm.relay.expand_dims||
| green |tvm.relay.softmax||
| green |tvm.relay.log_softmax||
| green |tvm.relay.relu||
| grey |tvm.relay.dropout|:computer: Missing FTVMCompute attribute
| grey |tvm.relay.batch_norm|:computer: Missing FTVMCompute attribute
| green |tvm.relay.bias_add||
|Level 2||
| green |tvm.relay.conv2d||
| green |tvm.relay.conv2d_transpose||
| green |tvm.relay.dense||
| green |tvm.relay.max_pool||
| green |tvm.relay.avg_pool||
| green |tvm.relay.global_max_pool||
| green |tvm.relay.global_avg_pool||
| green |tvm.relay.upsampling||
| green |tvm.relay.flatten||
| green |tvm.relay.pad||
| yellow |tvm.relay.lrn|Blocked by missing pow intrinsic
| yellow |tvm.relay.l2_normalize|Blocked by missing sqrt intrinsic
| grey |tvm.relay.conv2d_winograd_without_weight_transform|Missing TOPI implementation
| green |tvm.relay.conv2d_winograd_weight_transform||
|Level 3||
| green |tvm.relay.leaky_relu||
| green |tvm.relay.prelu||
| green |tvm.relay.reshape||
| green |tvm.relay.reshape_like||
| green |tvm.relay.copy_identity||
| green |tvm.relay.transpose||
| green |tvm.relay.squeeze||
| orange |tvm.relay.floor|:1234: Integer gradients
| orange |tvm.relay.ceil|:1234: Integer gradients
| orange |tvm.relay.trunc|:1234: Integer gradients
| red |tvm.relay.clip|Missing Not operation
| orange |tvm.relay.round|:1234: Integer gradients
| green |tvm.relay.abs||
| green |tvm.relay.negative||
| green |tvm.relay.take||
| green |tvm.relay.zeros||
| green |tvm.relay.zeros_like||
| green |tvm.relay.ones||
| green |tvm.relay.ones_like||
| green |tvm.relay.full||
| green |tvm.relay.full_like||
| grey |tvm.relay.cast|Currently, differentiate returns zeros for non-float32 inputs
| green |tvm.relay.split||
|Level 4||
| orange |tvm.relay.right_shift|:1234: Integer gradients
| orange |tvm.relay.left_shift|:1234: Integer gradients
| orange |tvm.relay.equal|:1234: Integer gradients
| orange |tvm.relay.not_equal|:1234: Integer gradients
| orange |tvm.relay.greater|:1234: Integer gradients
| orange |tvm.relay.greater_equal|:1234: Integer gradients
| orange |tvm.relay.less|:1234: Integer gradients
| orange |tvm.relay.less_equal|:1234: Integer gradients
| green |tvm.relay.maximum||
| green |tvm.relay.minimum||
| yellow |tvm.relay.power|Missing by accident, should be easy to fix.
| grey |tvm.relay.where|:snake: Missing Python API
| red |tvm.relay.argmax|
| red |tvm.relay.argmin|
| green |tvm.relay.sum||
| green |tvm.relay.max||
| green |tvm.relay.min||
| green |tvm.relay.mean||
| green |tvm.relay.prod||
| green |tvm.relay.strided_slice||
| green |tvm.relay.broadcast_to||
|Level 5||
| orange |tvm.relay.resize|Blocked by missing floor intrinsic
| red |tvm.relay.multibox_prior|
| red |tvm.relay.multibox_transform_loc|
| red |tvm.relay.nms|
|Level 10||
| green |tvm.relay.broadcast_to_like||
| grey |tvm.relay.collapse_sum_like|:snake: Missing Python API
| green |tvm.relay.slice_like||
| grey |tvm.relay.layout_transform|:snake: Missing Python API
| grey |tvm.relay.device_copy|:computer: Missing FTVMCompute attribute
| grey |tvm.relay.on_device|:computer: Missing FTVMCompute attribute

grwlf on 8 Feb 2019

👍1

While it is great to have a tensor expression gradient support. I recommend we provide the primal gradient in the form of relay operators, at this moment.

The main reason is that the relay-> relay transformation and makes it easier to do follow up analysis and transformations in relay, it also makes sure that each op can generate different variants easily(winograd, spatial pack for conv2d).

This does not eliminate the value of expression level gradient though, as they could be nice complementary when a user define custom op, and as a topic of research in the long run, if integrated properly with relay

tqchen on 8 Feb 2019

👍1

Expressing gradients in relay would be a good design test. My thoughts regarding this design choice are follows:

I am not sure that all listed operations have gradients which may be expressed in Relay language currently. Ideally we should move towards creating a list of basic operations which form a closed set in the sense that they have gradients expressible in themselves.
Expressing gradients in relay may foster Relay's C++ API.
As an option, one may implement operations in Relay (in addition to provide its FTVMCompute attribute). This way It would become a subject to relay's differentiation engine. dense, softmax are possible candidates for this approach.

grwlf on 11 Feb 2019

I've updated the tensor expression AD PR with a Relay integration, here. The commit itself is here.

sgrechanik-h on 15 Feb 2019

I am working on adding gradient definition for some level 1/2 operators, see https://github.com/dmlc/tvm/pull/2633 for details

ZihengJiang on 20 Feb 2019

I'm interested in helping contribute gradient implementations, but I'm finding it a bit difficult to understand what orientation the original op arguments are in, and what role collapse_sum_like plays (its documentation, "Return a scalar value array with the same shape and type as the input array." is identical to broadcast_to_like, and I'm not really understanding the mathematical operation it's performing).

As an example, by trial and error I arrived at the following for nn.dense:

@register_gradient("nn.dense")
def dense_grad(orig, grad):
    data, weight = orig.args
    return [collapse_sum_like(transpose(transpose(weight) * grad), data),
            collapse_sum_like(transpose(grad * transpose(data)), weight)]

I'm verifying this by checking gradient values numerically from a toy tensorflow model with a dense layer that I converted. I would not have expected to need the outer transpose here, but without it it seems like collapse_sum_like was broadcasting a sum on the wrong axis.

Would it be possible to provide a more detailed tutorial about how to translate a known mathematical form of a gradient to a relay implementation, to make it easier for the community to contribute some of these implementations?

SWu on 30 May 2019

@altanh could you maybe further improve upon the docs when you open your PR and try to address some of @SWu's comments.

Altan has revived the work in the past few weeks and we have been working on a library for using Relay for training, he will hopefully follow up on this thread with more details.

jroesch on 31 May 2019

❤2

@SWu this is an issue that I've run into as well. I believe the specific documentation issue you ran into is indeed a copy-paste error, which we should fix. Overall though, the documentation is lacking as @jroesch said, and we (who implement more grads) should definitely update it with better descriptions as we work through them.

For collapse_sum_like, I dug into the TOPI code, and it looks like the general idea is to match up tensor dimensions (starting from the last dimension of both) and reduce them (using sum) until it matches the target shape. If two dims don't match, then the input dim is reduced and squeezed, and we continue trying to match. If they are equal, then do nothing. If the output dimension is 1, then we reduce down to 1.

For example, if A.shape() = (4,5,3) and B.shape() = (5, 1), then collapse_sum_like(A, B) will reduce the 3rd dim of A to 1 (i.e. keepdims=True), not reduce the 2nd dimension, and then reduce and squeeze (i.e. keepdims=False) the 1st dimension. It's unclear to me how this will work for 'mismatched' shapes like (4,4,4) and (3,2), since the input will just be completely squeezed (and from what I can tell, there's no error check for this, so maybe this is correct behavior that I don't understand).

We also need to think about the best way to verify correctness of these implementations, since currently the numerical tests in TVM are somewhat arbitrary. Your approach seems solid for ensuring correct behavior with respect to existing frameworks. This problem is more general than just for gradients though, and I think we should have a TVM-wide discussion.

As for your last point, I think this would be a good idea. I'll try to type up a tutorial of sorts walking through my implementation of softmax once I'm done with my current work. I don't want to write too much more here (and maybe this is already too much), but hopefully this helped. I'll make a more comprehensive post once the PR is ready.

altanh on 1 Jun 2019

closing for now due to inactive status, let us open new thread for new TODOs of gradients

tqchen on 3 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[Torch] Support aten::tensor, aten::empty and aten::numel

zhiqwang · 4Comments

[WINDOWS][AutoTVM] OSError: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted and OSError: [WinError 10049] The requested address is not valid in its context

Coderx7 · 5Comments

[RPC] Make RPC fork safe on macOS?

eqy · 3Comments

[RELAY] Avoid eager creation of global target object

tqchen · 4Comments

[DOCS] Neural network Deployment Guide with System Module Mode

tqchen · 3Comments