Addons: Feature request: add parametric ELU (PELU) activation function

Created on 20 Feb 2017 · 16Comments · Source: tensorflow/addons

Proposal

The exponential linear unit (ELU) is already in TensorFlow as tf.nn.elu which is great. The new parametric version (called PELU) shows very promising experimental results so I wonder if it could be added in to TensorFlow too in order to encourage more widespread experimentation with it by the deep learning community. One problem with it though is that it's stateful (e.g. tf.Variable), meaning it's not clear to me where in TensorFlow it fits in.

Implementation

Here's an implementation of the PELU that I've been using lately (I'm assuming batch_size is the first dimension in x):

def pelu(x):
  """Parametric Exponential Linear Unit (https://arxiv.org/abs/1605.09332v1)."""
  with tf.variable_scope(x.op.name + '_activation', initializer=tf.constant_initializer(1.0)):
    shape = x.get_shape().as_list()[1:]
    alpha = tf.get_variable('alpha', shape)
    beta = tf.get_variable('beta', shape)
    positive = tf.nn.relu(x) * alpha / (beta + 1e-9)
    negative = alpha * (tf.exp((-tf.nn.relu(-x)) / (beta + 1e-9)) - 1)
    return negative + positive

Reference

https://arxiv.org/abs/1605.09332v1

Feature Request activations layers

Source

carlthome

👍6

Most helpful comment

Is it a good implementation of the PELU? In the paper there are lines like:

Unlike Maxout, our PELU adds only 2L parameters, where
L is the number of layers, which makes our activation as
computationally demanding as the original ELU function

and
```
It is interesting to note that PELU only adds 112 additional
parameters, a negligible increase of 0.006% over the total
number of parameters.

(regarding to ResNet-112). Still, your implementation introduces one parameter for each neuron, which is much more then the number mentioned in the paper. I suggest following change from:
```python
    shape = x.get_shape().as_list()[1:]
    alpha = tf.get_variable('alpha', shape)
    beta = tf.get_variable('beta', shape)

    alpha = tf.get_variable('alpha', 1)
    beta = tf.get_variable('beta', 1)

Please, correct me if I didn't understand the paper correctly.

kacper1095 on 11 Sep 2018

👍2

All 16 comments

@fchollet: Is this a welcome contribution?

poxvoculi on 21 Feb 2017

Anything that has a state (e.g. this function) should not be added as a function in tf.nn (which should only contain pure functions), but rather as a layer (tf.layers). This is precisely what layers are meant to cover. This is a PELU layer, not an activation function.

However I would be wary of adding new core layers for each and every paper out there. At best this would be a candidate for contrib.

fchollet on 22 Feb 2017

Anything that has a state (e.g. this function) should not be added as a function in tf.nn (which should only contain pure functions), but rather as a layer (tf.layers). This is precisely what layers are meant to cover. This is a PELU layer, not an activation function.

Yeah, figured tf.nn should be without side effects. Good to hear! :+1:

However I would be wary of adding new core layers for each and every paper out there. At best this would be a candidate for contrib.

Which contrib package would be suited for new activation functions? tf.contrib.layers is still out-of-sync with tf.layers (both have similar purpose but different origins and APIs) so I expect it might be deprecated or heavily refactored soon? It would be nice if tf.contrib.layers were an extension for tf.layers, but I suppose it's hard to juggle the backwards compatibility.

carlthome on 22 Feb 2017

A little warning: the PELU implementation above might cause NaN loss when it becomes too steep (inf gradient). :koala:

carlthome on 12 Apr 2017

@carlthome, where are you implementing the non-negativity constraint for a and b? Just in the loss function?

rmkemker on 15 Jun 2017

Derp. Good catch, @rmkemker. That's why I've been getting NaNs sometimes...

I guess just penalizing the loss could work

loss += tf.minimum(tf.reduce_min(alpha), 0)
loss += tf.minimum(tf.reduce_min(beta), 0)

but how about just wrapping the parameters in tf.abs and hoping it doesn't matter that two points on the loss surface map to the same ELU shape?

carlthome on 16 Jun 2017

That is exactly what I did, but I am still getting the NaN values after it trains for a while. I even tried weighting the penalties a bit higher with the same result. My code trains till completion with ReLU activation, so it must be the PELU somehow.

rmkemker on 16 Jun 2017

SELU feels like a big improvement over ReLU (without needing extra parameters like PELU) so I don't believe PELU is relevant anymore. As far as I'm concerned this issue could be closed.

carlthome on 7 Jul 2017

I will give it a shot. As a FYI, I was able to stabilize training for PELU in some instances by lowering the learning rate. It is still pretty finicky. Hopefully, SELU is more stable during training!

rmkemker on 7 Jul 2017

Is it a good implementation of the PELU? In the paper there are lines like:

Unlike Maxout, our PELU adds only 2L parameters, where
L is the number of layers, which makes our activation as
computationally demanding as the original ELU function

and
```
It is interesting to note that PELU only adds 112 additional
parameters, a negligible increase of 0.006% over the total
number of parameters.

(regarding to ResNet-112). Still, your implementation introduces one parameter for each neuron, which is much more then the number mentioned in the paper. I suggest following change from:
```python
    shape = x.get_shape().as_list()[1:]
    alpha = tf.get_variable('alpha', shape)
    beta = tf.get_variable('beta', shape)

    alpha = tf.get_variable('alpha', 1)
    beta = tf.get_variable('beta', 1)

Please, correct me if I didn't understand the paper correctly.

kacper1095 on 11 Sep 2018

👍2

Thanks for correcting this! 😃

carlthome on 11 Sep 2018

No problem! Authors mentioned also about constraining value of alpha and beta to be always positive as For preserving parameter positivity after the updates, we constrain them to always be greater than 0.1.. It can be changed as:

    alpha = tf.get_variable('alpha', 1)
    beta = tf.get_variable('beta', 1)

    alpha = tf.get_variable('alpha', 1, constraint=lambda t: tf.maximum(t, 0.1))
    beta = tf.get_variable('beta', 1, constraint=lambda t: tf.maximum(t, 0.1))

but I don't know what's the most elegant way to put such constraint in older tensorflow versions (such as 1.2.1). Will it work like?

    alpha = tf.maximum(tf.get_variable('alpha', 1), 0.1)
    beta = tf.maximum(tf.get_variable('beta', 1), 0.1)

kacper1095 on 11 Sep 2018

Hello,
Is anyone working on this?
TIA

SSaishruthi on 23 May 2019

Thank you for suggesting this feature request! Though parametric ELU would not necessarily fit in tensorflow/tensorflow, I'm sure it would be a welcome contribution to activations or layers in Addons.

@seanpmorgan for awareness; am going to transfer this issue to tensorflow/addons.

dynamicwebpaige on 27 Oct 2019

My only hesitation would be whether or not PELU is still relevant now that SELU is part of TF-Core. Does anyone in the community still see this as useful and if so could you point to an architecture or experiment that shows value. Thanks!

seanpmorgan on 28 Oct 2019

👍1

Seeing as the original author of this issue no longer sees value and there are several alternatives I'm going to close this issue. Happy to have someone re-open it if there is evidence for its usefulness.

seanpmorgan on 4 Nov 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

BeamSearchDecoder with non LSTM cells raises ValueError exception

jimthompson5802 · 3Comments

Complete black formatting

seanpmorgan · 3Comments

Windows nightly is broken

WindQAQ · 4Comments

AttentionWrapperTest results failing on nightlies

seanpmorgan · 4Comments

Cannot compile with GPU Support

iskorini · 4Comments