Addons: Feature request: add parametric ELU (PELU) activation function

Created on 20 Feb 2017  路  16Comments  路  Source: tensorflow/addons

Proposal

The exponential linear unit (ELU) is already in TensorFlow as tf.nn.elu which is great. The new parametric version (called PELU) shows very promising experimental results so I wonder if it could be added in to TensorFlow too in order to encourage more widespread experimentation with it by the deep learning community. One problem with it though is that it's stateful (e.g. tf.Variable), meaning it's not clear to me where in TensorFlow it fits in.

Implementation

Here's an implementation of the PELU that I've been using lately (I'm assuming batch_size is the first dimension in x):

def pelu(x):
  """Parametric Exponential Linear Unit (https://arxiv.org/abs/1605.09332v1)."""
  with tf.variable_scope(x.op.name + '_activation', initializer=tf.constant_initializer(1.0)):
    shape = x.get_shape().as_list()[1:]
    alpha = tf.get_variable('alpha', shape)
    beta = tf.get_variable('beta', shape)
    positive = tf.nn.relu(x) * alpha / (beta + 1e-9)
    negative = alpha * (tf.exp((-tf.nn.relu(-x)) / (beta + 1e-9)) - 1)
    return negative + positive

Reference

https://arxiv.org/abs/1605.09332v1

Feature Request activations layers

Most helpful comment

Is it a good implementation of the PELU? In the paper there are lines like:

Unlike Maxout, our PELU adds only 2L parameters, where
L is the number of layers, which makes our activation as
computationally demanding as the original ELU function

and
```
It is interesting to note that PELU only adds 112 additional
parameters, a negligible increase of 0.006% over the total
number of parameters.

(regarding to ResNet-112). Still, your implementation introduces one parameter for each neuron, which is much more then the number mentioned in the paper. I suggest following change from:
```python
    shape = x.get_shape().as_list()[1:]
    alpha = tf.get_variable('alpha', shape)
    beta = tf.get_variable('beta', shape)

to

    alpha = tf.get_variable('alpha', 1)
    beta = tf.get_variable('beta', 1)

Please, correct me if I didn't understand the paper correctly.

All 16 comments

@fchollet: Is this a welcome contribution?

Anything that has a state (e.g. this function) should not be added as a function in tf.nn (which should only contain pure functions), but rather as a layer (tf.layers). This is precisely what layers are meant to cover. This is a PELU layer, not an activation function.

However I would be wary of adding new core layers for each and every paper out there. At best this would be a candidate for contrib.

Anything that has a state (e.g. this function) should not be added as a function in tf.nn (which should only contain pure functions), but rather as a layer (tf.layers). This is precisely what layers are meant to cover. This is a PELU layer, not an activation function.

Yeah, figured tf.nn should be without side effects. Good to hear! :+1:

However I would be wary of adding new core layers for each and every paper out there. At best this would be a candidate for contrib.

Which contrib package would be suited for new activation functions? tf.contrib.layers is still out-of-sync with tf.layers (both have similar purpose but different origins and APIs) so I expect it might be deprecated or heavily refactored soon? It would be nice if tf.contrib.layers were an extension for tf.layers, but I suppose it's hard to juggle the backwards compatibility.

A little warning: the PELU implementation above might cause NaN loss when it becomes too steep (inf gradient). :koala:

@carlthome, where are you implementing the non-negativity constraint for a and b? Just in the loss function?

Derp. Good catch, @rmkemker. That's why I've been getting NaNs sometimes...

I guess just penalizing the loss could work

loss += tf.minimum(tf.reduce_min(alpha), 0)
loss += tf.minimum(tf.reduce_min(beta), 0)

but how about just wrapping the parameters in tf.abs and hoping it doesn't matter that two points on the loss surface map to the same ELU shape?

That is exactly what I did, but I am still getting the NaN values after it trains for a while. I even tried weighting the penalties a bit higher with the same result. My code trains till completion with ReLU activation, so it must be the PELU somehow.

SELU feels like a big improvement over ReLU (without needing extra parameters like PELU) so I don't believe PELU is relevant anymore. As far as I'm concerned this issue could be closed.

I will give it a shot. As a FYI, I was able to stabilize training for PELU in some instances by lowering the learning rate. It is still pretty finicky. Hopefully, SELU is more stable during training!

Is it a good implementation of the PELU? In the paper there are lines like:

Unlike Maxout, our PELU adds only 2L parameters, where
L is the number of layers, which makes our activation as
computationally demanding as the original ELU function

and
```
It is interesting to note that PELU only adds 112 additional
parameters, a negligible increase of 0.006% over the total
number of parameters.

(regarding to ResNet-112). Still, your implementation introduces one parameter for each neuron, which is much more then the number mentioned in the paper. I suggest following change from:
```python
    shape = x.get_shape().as_list()[1:]
    alpha = tf.get_variable('alpha', shape)
    beta = tf.get_variable('beta', shape)

to

    alpha = tf.get_variable('alpha', 1)
    beta = tf.get_variable('beta', 1)

Please, correct me if I didn't understand the paper correctly.

Thanks for correcting this! 馃槂

No problem! Authors mentioned also about constraining value of alpha and beta to be always positive as For preserving parameter positivity after the updates, we constrain them to always be greater than 0.1.. It can be changed as:

    alpha = tf.get_variable('alpha', 1)
    beta = tf.get_variable('beta', 1)

to

    alpha = tf.get_variable('alpha', 1, constraint=lambda t: tf.maximum(t, 0.1))
    beta = tf.get_variable('beta', 1, constraint=lambda t: tf.maximum(t, 0.1))

but I don't know what's the most elegant way to put such constraint in older tensorflow versions (such as 1.2.1). Will it work like?

    alpha = tf.maximum(tf.get_variable('alpha', 1), 0.1)
    beta = tf.maximum(tf.get_variable('beta', 1), 0.1)

Hello,
Is anyone working on this?
TIA

Thank you for suggesting this feature request! Though parametric ELU would not necessarily fit in tensorflow/tensorflow, I'm sure it would be a welcome contribution to activations or layers in Addons.

@seanpmorgan for awareness; am going to transfer this issue to tensorflow/addons.

My only hesitation would be whether or not PELU is still relevant now that SELU is part of TF-Core. Does anyone in the community still see this as useful and if so could you point to an architecture or experiment that shows value. Thanks!

Seeing as the original author of this issue no longer sees value and there are several alternatives I'm going to close this issue. Happy to have someone re-open it if there is evidence for its usefulness.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jimthompson5802 picture jimthompson5802  路  3Comments

seanpmorgan picture seanpmorgan  路  3Comments

WindQAQ picture WindQAQ  路  4Comments

seanpmorgan picture seanpmorgan  路  4Comments

iskorini picture iskorini  路  4Comments