Addons: 2nd order gradients for activations

Created on 17 Feb 2020 · 13Comments · Source: tensorflow/addons

Describe the feature and the current behavior/state.
Currently the activation functions in tf-addons are missing 2nd order gradients, this makes it impossible to use them for training GAN's that need various forms of gradient penalties (WGAN-GP, StyleGAN 1/2, etc).
I suggest adding 2nd order gradients for these functions

Relevant information

Are you willing to contribute it (yes/no):
No
Are you willing to maintain it going forward? (yes/no):
No
Is there a relevant academic paper? (if so, where):
different for every activation function
Is there already an implementation in another framework? (if so, where):
Unknown
Was it part of tf.contrib? (if so, where):
No

Which API type would this fall under (layer, metric, optimizer, etc.)
activations
Who will benefit with this feature?
Anyone doing research and/or training GAN's using activation functions in tf-addons
Any other info.

bug custom-ops help wanted

Source

veqtor

Most helpful comment

So here it is! https://github.com/failure-to-thrive/addons/tree/2nd-order-gradients-for-activations
Clone and checkout that branch. The rest is the same.
I was unable to find a unittests infrastructure for testing 2nd order derivatives, so here is a small test program:

import tensorflow as tf

x = tf.Variable([-2.0, -1.0, 0.0, 1.0, 2.0])


def _mish_py(x):
    return x * tf.math.tanh(tf.math.softplus(x))

with tf.GradientTape() as gg:
  with tf.GradientTape() as g:
    y = _mish_py(x)
  dy_dx = g.gradient(y, x)
d2y_dx2 = gg.gradient(dy_dx, x)
print("_mish_py", d2y_dx2.numpy())


from tensorflow_addons.activations import mish

with tf.GradientTape() as gg:
  with tf.GradientTape() as g:
    y = mish(x)
  dy_dx = g.gradient(y, x)
d2y_dx2 = gg.gradient(dy_dx, x)
print("mish    ", d2y_dx2.numpy())

The output is almost identical:

_mish_py [ 0.03502709  0.3497057   0.64        0.18468581 -0.05772461]
mish     [ 0.03502715  0.34970567  0.64        0.18468583 -0.05772461]

failure-to-thrive on 25 Feb 2020

👍2

All 13 comments

As of me, it's interesting, it's understandable as the math, it's feasible as the coding. However, I need to be guided on how to integrate it into TFA seamless. So unless maintainers or anyone else much more experienced want to take this on, I would be happy to try.

failure-to-thrive on 17 Feb 2020

Thanks @veqtor for bringing this up! From my understanding higher order gradients should be automatically differentiated if we have our setup correct:
https://www.tensorflow.org/tutorials/customization/autodiff#higher-order_gradients

If I run:

import tensorflow as tf
import tensorflow_addons as tfa

x = tf.Variable(1.0) 

with tf.GradientTape() as t:
  with tf.GradientTape() as t2:
    y = tfa.activations.gelu(x)
  # Compute the gradient inside the 't' context manager
  # which means the gradient computation is differentiable as well.
  dy_dx = t2.gradient(y, x)
  print(dy_dx)

d2y_dx2 = t.gradient(dy_dx, x)
print(d2y_dx2)

I get the correct first derrivative, but the second order fails for:
LookupError: gradient registry has no entry for: Addons>GeluGrad

@failure-to-thrive It would be great if you want to look into this! I haven't fully looked into this, but it seems to be related to properly registering in the gradient registry. Hand calculating 2nd order grads shouldn't be required except for some test cases (IIUC)

seanpmorgan on 18 Feb 2020

From my understanding higher order gradients should be automatically differentiated if we have our setup correct:

It's true if activation function is expressed with tensorflow ops. However, TFA activations (most? all?) deal with C++ code. Every TFA C++ activation has its *Grad successor.

failure-to-thrive on 18 Feb 2020

You're right. This may be helpful while working on this:
https://github.com/tensorflow/tensorflow/blob/r2.1/tensorflow/python/ops/custom_gradient.py#L146-L168

seanpmorgan on 18 Feb 2020

@veqtor Do you want to participate as a beta-tester?

failure-to-thrive on 18 Feb 2020

@failure-to-thrive sure would, but I don't know if I can build tfa incl cuda deps etc

veqtor on 19 Feb 2020

This complicates the things. I have to find out how to build a _.whl_ for your OS. Perhaps the same way as packages for PyPI the maintainers do.
@seanpmorgan what do you think about it?

failure-to-thrive on 19 Feb 2020

Perhaps writing tests for the 2nd order grads is better?
Should be quite easy to verify that the autograph version is the same as the cuda implementation

veqtor on 19 Feb 2020

Of course, unittests is a first-line defense against bugs. :beetle: :beetle: :beetle: But, what if some of them sneak anyway? :beetle: Pushing changes through the main TFA repo is not a good idea. Although, it is too early to think about it.
OK. Could you please suggest what to implement first, math papers and some test values to test against?

failure-to-thrive on 20 Feb 2020

I can try to build TFA for my platform.

Maybe start with Mish:

definition:
y = x * tanh(softplus(x))

https://www.wolframalpha.com/input/?i=x+*+tanh%28log%281+%2B+exp%28x%29%29%29

First order derivative:
https://www.wolframalpha.com/input/?i=derivative+of+x+*+tanh%28log%281+%2B+exp%28x%29%29%29

Second order:
https://www.wolframalpha.com/input/?i=second+derivative+of+x+*+tanh%28log%281+%2B+exp%28x%29%29%29

veqtor on 20 Feb 2020

I looked a bit at the code for mish grads and the 2nd order derivatives, maybe it can help:
https://gist.github.com/veqtor/794434261abcbb51d67678d5a73caa1d

veqtor on 22 Feb 2020

import tensorflow as tf

x = tf.Variable([-2.0, -1.0, 0.0, 1.0, 2.0])


def _mish_py(x):
    return x * tf.math.tanh(tf.math.softplus(x))

with tf.GradientTape() as gg:
  with tf.GradientTape() as g:
    y = _mish_py(x)
  dy_dx = g.gradient(y, x)
d2y_dx2 = gg.gradient(dy_dx, x)
print("_mish_py", d2y_dx2.numpy())


from tensorflow_addons.activations import mish

with tf.GradientTape() as gg:
  with tf.GradientTape() as g:
    y = mish(x)
  dy_dx = g.gradient(y, x)
d2y_dx2 = gg.gradient(dy_dx, x)
print("mish    ", d2y_dx2.numpy())

The output is almost identical:

_mish_py [ 0.03502709  0.3497057   0.64        0.18468581 -0.05772461]
mish     [ 0.03502715  0.34970567  0.64        0.18468583 -0.05772461]

failure-to-thrive on 25 Feb 2020

👍2

lgtm only I haven't tried it except small experiments

veqtor on 5 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

tensorflow2.0 can't use this module

ididhmc · 4Comments

Request for example: Weight Decay Optimizers / Super Convergence

seanpmorgan · 4Comments

Cannot compile with GPU Support

iskorini · 4Comments

WeightNormalization data init fails with Keras experimental_run_tf_function

seanpmorgan · 4Comments

Windows nightly is broken

WindQAQ · 4Comments