Addons: 2nd order gradients for activations

Created on 17 Feb 2020  路  13Comments  路  Source: tensorflow/addons

Describe the feature and the current behavior/state.
Currently the activation functions in tf-addons are missing 2nd order gradients, this makes it impossible to use them for training GAN's that need various forms of gradient penalties (WGAN-GP, StyleGAN 1/2, etc).
I suggest adding 2nd order gradients for these functions

Relevant information

  • Are you willing to contribute it (yes/no):
    No
  • Are you willing to maintain it going forward? (yes/no):
    No
  • Is there a relevant academic paper? (if so, where):
    different for every activation function
  • Is there already an implementation in another framework? (if so, where):
    Unknown
  • Was it part of tf.contrib? (if so, where):
    No

Which API type would this fall under (layer, metric, optimizer, etc.)
activations
Who will benefit with this feature?
Anyone doing research and/or training GAN's using activation functions in tf-addons
Any other info.

bug custom-ops help wanted

Most helpful comment

So here it is! https://github.com/failure-to-thrive/addons/tree/2nd-order-gradients-for-activations
Clone and checkout that branch. The rest is the same.
I was unable to find a unittests infrastructure for testing 2nd order derivatives, so here is a small test program:

import tensorflow as tf

x = tf.Variable([-2.0, -1.0, 0.0, 1.0, 2.0])


def _mish_py(x):
    return x * tf.math.tanh(tf.math.softplus(x))

with tf.GradientTape() as gg:
  with tf.GradientTape() as g:
    y = _mish_py(x)
  dy_dx = g.gradient(y, x)
d2y_dx2 = gg.gradient(dy_dx, x)
print("_mish_py", d2y_dx2.numpy())


from tensorflow_addons.activations import mish

with tf.GradientTape() as gg:
  with tf.GradientTape() as g:
    y = mish(x)
  dy_dx = g.gradient(y, x)
d2y_dx2 = gg.gradient(dy_dx, x)
print("mish    ", d2y_dx2.numpy())

The output is almost identical:

_mish_py [ 0.03502709  0.3497057   0.64        0.18468581 -0.05772461]
mish     [ 0.03502715  0.34970567  0.64        0.18468583 -0.05772461]

All 13 comments

As of me, it's interesting, it's understandable as the math, it's feasible as the coding. However, I need to be guided on how to integrate it into TFA seamless. So unless maintainers or anyone else much more experienced want to take this on, I would be happy to try.

Thanks @veqtor for bringing this up! From my understanding higher order gradients should be automatically differentiated if we have our setup correct:
https://www.tensorflow.org/tutorials/customization/autodiff#higher-order_gradients

If I run:

import tensorflow as tf
import tensorflow_addons as tfa

x = tf.Variable(1.0) 

with tf.GradientTape() as t:
  with tf.GradientTape() as t2:
    y = tfa.activations.gelu(x)
  # Compute the gradient inside the 't' context manager
  # which means the gradient computation is differentiable as well.
  dy_dx = t2.gradient(y, x)
  print(dy_dx)

d2y_dx2 = t.gradient(dy_dx, x)
print(d2y_dx2)

I get the correct first derrivative, but the second order fails for:
LookupError: gradient registry has no entry for: Addons>GeluGrad

@failure-to-thrive It would be great if you want to look into this! I haven't fully looked into this, but it seems to be related to properly registering in the gradient registry. Hand calculating 2nd order grads shouldn't be required except for some test cases (IIUC)

From my understanding higher order gradients should be automatically differentiated if we have our setup correct:

It's true if activation function is expressed with tensorflow ops. However, TFA activations (most? all?) deal with C++ code. Every TFA C++ activation has its *Grad successor.

You're right. This may be helpful while working on this:
https://github.com/tensorflow/tensorflow/blob/r2.1/tensorflow/python/ops/custom_gradient.py#L146-L168

@veqtor Do you want to participate as a beta-tester?

@failure-to-thrive sure would, but I don't know if I can build tfa incl cuda deps etc

This complicates the things. I have to find out how to build a _.whl_ for your OS. Perhaps the same way as packages for PyPI the maintainers do.
@seanpmorgan what do you think about it?

Perhaps writing tests for the 2nd order grads is better?
Should be quite easy to verify that the autograph version is the same as the cuda implementation

Of course, unittests is a first-line defense against bugs. :beetle: :beetle: :beetle: But, what if some of them sneak anyway? :beetle: Pushing changes through the main TFA repo is not a good idea. Although, it is too early to think about it.
OK. Could you please suggest what to implement first, math papers and some test values to test against?

I can try to build TFA for my platform.

Maybe start with Mish:

definition:
y = x * tanh(softplus(x))

https://www.wolframalpha.com/input/?i=x+*+tanh%28log%281+%2B+exp%28x%29%29%29

First order derivative:
https://www.wolframalpha.com/input/?i=derivative+of+x+*+tanh%28log%281+%2B+exp%28x%29%29%29

Second order:
https://www.wolframalpha.com/input/?i=second+derivative+of+x+*+tanh%28log%281+%2B+exp%28x%29%29%29

I looked a bit at the code for mish grads and the 2nd order derivatives, maybe it can help:
https://gist.github.com/veqtor/794434261abcbb51d67678d5a73caa1d

So here it is! https://github.com/failure-to-thrive/addons/tree/2nd-order-gradients-for-activations
Clone and checkout that branch. The rest is the same.
I was unable to find a unittests infrastructure for testing 2nd order derivatives, so here is a small test program:

import tensorflow as tf

x = tf.Variable([-2.0, -1.0, 0.0, 1.0, 2.0])


def _mish_py(x):
    return x * tf.math.tanh(tf.math.softplus(x))

with tf.GradientTape() as gg:
  with tf.GradientTape() as g:
    y = _mish_py(x)
  dy_dx = g.gradient(y, x)
d2y_dx2 = gg.gradient(dy_dx, x)
print("_mish_py", d2y_dx2.numpy())


from tensorflow_addons.activations import mish

with tf.GradientTape() as gg:
  with tf.GradientTape() as g:
    y = mish(x)
  dy_dx = g.gradient(y, x)
d2y_dx2 = gg.gradient(dy_dx, x)
print("mish    ", d2y_dx2.numpy())

The output is almost identical:

_mish_py [ 0.03502709  0.3497057   0.64        0.18468581 -0.05772461]
mish     [ 0.03502715  0.34970567  0.64        0.18468583 -0.05772461]

lgtm only I haven't tried it except small experiments

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ididhmc picture ididhmc  路  4Comments

seanpmorgan picture seanpmorgan  路  4Comments

iskorini picture iskorini  路  4Comments

seanpmorgan picture seanpmorgan  路  4Comments

WindQAQ picture WindQAQ  路  4Comments