Keras: Layer Normalization -- easy to do in Keras?

Created on 27 Sep 2016  路  11Comments  路  Source: keras-team/keras

https://github.com/ryankiros/layer-norm
https://arxiv.org/abs/1607.06450

Is it easy to implement Layer Normalization in Keras, as suggested in paper/code above?
Anyone aware of any examples how to do so in Keras?

stale

Most helpful comment

@JulesGM I tried your LayerNorm1D layer but got NaNs for loss. Could you post an example of how to use? Thanks!

All 11 comments

As far as I remember @rylanchiu was working on an implementation (https://github.com/fchollet/keras/issues/3519) otherwise I can probably get an example out this weekend.

@abishekk92: Thanks, an example would be fantastic!

As mentioned under https://github.com/fchollet/keras/issues/3519, would BatchNormalization with mode=1 do this? -- It looks like it when going through the code but I'm not 100% sure (I'm still very new to Keras inner workings ). Can you confirm?

Can BatchNormalization be used to do the same as MinMaxScaler or MaxAbsScaler
from sklearn.preprocessing ?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@JulesGM I tried your LayerNorm1D layer but got NaNs for loss. Could you post an example of how to use? Thanks!

@JulesGM I tried your LayerNorm1D layer but got NaNs for loss. Could you post an example of how to use? Thanks!

@cpury

Hey, I think maybe the input data you have given to the layer above contains columns (last dimension) with only zeros. Because the layer needs the standard deviation to calculate the normalization, which has the square root operation, while the derivative of square root at zero is infinity, you'll get unstable gradients in back-propagation phase.

Most layers have zero initializers for bias, therefore it's unavoidable even you have stacked layers like convolutions. You'll get normal results in the first step, and NaN in the second.

I think keras-layer-normalization would be a more stable implementation.

See https://github.com/Lsdefine/attention-is-all-you-need-keras/blob/master/transformer.py#L14 for a good implementation

@JulesGM

Surely it is a good implementation for transformers, however, we can easily trigger NaN loss with the following code in 2 steps:

import keras
from keras.datasets import mnist
from keras.layers import Layer, Conv2D, MaxPool2D, Flatten, Dense
from keras.initializers import Ones, Zeros
import keras.backend as K
import numpy as np


class LayerNormalization(Layer):
    def __init__(self, eps=1e-6, **kwargs):
        self.eps = eps
        super(LayerNormalization, self).__init__(**kwargs)
    def build(self, input_shape):
        self.gamma = self.add_weight(name='gamma', shape=input_shape[-1:],
                                     initializer=Ones(), trainable=True)
        self.beta = self.add_weight(name='beta', shape=input_shape[-1:],
                                    initializer=Zeros(), trainable=True)
        super(LayerNormalization, self).build(input_shape)
    def call(self, x):
        mean = K.mean(x, axis=-1, keepdims=True)
        std = K.std(x, axis=-1, keepdims=True)
        return self.gamma * (x - mean) / (std + self.eps) + self.beta
    def compute_output_shape(self, input_shape):
        return input_shape


(x_train, y_train), _ = mnist.load_data()

x_train = np.expand_dims(x_train.astype(K.floatx()) / 255, axis=-1)
y_train = np.expand_dims(y_train, axis=-1)

model = keras.models.Sequential()
model.add(Conv2D(input_shape=(28, 28, 1), kernel_size=3, filters=32))
model.add(MaxPool2D(pool_size=2))
model.add(LayerNormalization())
model.add(Flatten())
model.add(Dense(units=10, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(x_train, y_train, steps_per_epoch=2)

That is because the images in MNIST dataset all have black borders, while in transformers the position embedding makes sure that there are no zero inputs.

yeah I meant for text

In TensorFlow 2.0, this will likely be present as tf.keras.layers.LayerNormalization. Currently if you build from the r2.0 branch, you can get it with tf.keras.layers.experimental.LayerNormalization.

This exists now in TensorFlow: tf.keras.layers.LayerNormalization

yeah I bet, now that every platform and their uncle has a transformer implementation

Was this page helpful?
0 / 5 - 0 ratings