Keras: Layer Normalization -- easy to do in Keras?

Created on 27 Sep 2016 · 11Comments · Source: keras-team/keras

https://github.com/ryankiros/layer-norm
https://arxiv.org/abs/1607.06450

Is it easy to implement Layer Normalization in Keras, as suggested in paper/code above?
Anyone aware of any examples how to do so in Keras?

stale

Source

kzk2000

👍4

Most helpful comment

@JulesGM I tried your LayerNorm1D layer but got NaNs for loss. Could you post an example of how to use? Thanks!

cpury on 28 May 2018

👍3

All 11 comments

As far as I remember @rylanchiu was working on an implementation (https://github.com/fchollet/keras/issues/3519) otherwise I can probably get an example out this weekend.

abishekk92 on 27 Sep 2016

👍2

@abishekk92: Thanks, an example would be fantastic!

As mentioned under https://github.com/fchollet/keras/issues/3519, would BatchNormalization with mode=1 do this? -- It looks like it when going through the code but I'm not 100% sure (I'm still very new to Keras inner workings ). Can you confirm?

kzk2000 on 27 Sep 2016

👍1

Can BatchNormalization be used to do the same as MinMaxScaler or MaxAbsScaler
from sklearn.preprocessing ?

solarjoe on 22 Jun 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

stale[bot] on 20 Sep 2017

@JulesGM I tried your LayerNorm1D layer but got NaNs for loss. Could you post an example of how to use? Thanks!

cpury on 28 May 2018

👍3

@JulesGM I tried your LayerNorm1D layer but got NaNs for loss. Could you post an example of how to use? Thanks!

@cpury

Hey, I think maybe the input data you have given to the layer above contains columns (last dimension) with only zeros. Because the layer needs the standard deviation to calculate the normalization, which has the square root operation, while the derivative of square root at zero is infinity, you'll get unstable gradients in back-propagation phase.

Most layers have zero initializers for bias, therefore it's unavoidable even you have stacked layers like convolutions. You'll get normal results in the first step, and NaN in the second.

I think keras-layer-normalization would be a more stable implementation.

CyberZHG on 22 Nov 2018

See https://github.com/Lsdefine/attention-is-all-you-need-keras/blob/master/transformer.py#L14 for a good implementation

@JulesGM

Surely it is a good implementation for transformers, however, we can easily trigger NaN loss with the following code in 2 steps:

import keras
from keras.datasets import mnist
from keras.layers import Layer, Conv2D, MaxPool2D, Flatten, Dense
from keras.initializers import Ones, Zeros
import keras.backend as K
import numpy as np


class LayerNormalization(Layer):
    def __init__(self, eps=1e-6, **kwargs):
        self.eps = eps
        super(LayerNormalization, self).__init__(**kwargs)
    def build(self, input_shape):
        self.gamma = self.add_weight(name='gamma', shape=input_shape[-1:],
                                     initializer=Ones(), trainable=True)
        self.beta = self.add_weight(name='beta', shape=input_shape[-1:],
                                    initializer=Zeros(), trainable=True)
        super(LayerNormalization, self).build(input_shape)
    def call(self, x):
        mean = K.mean(x, axis=-1, keepdims=True)
        std = K.std(x, axis=-1, keepdims=True)
        return self.gamma * (x - mean) / (std + self.eps) + self.beta
    def compute_output_shape(self, input_shape):
        return input_shape


(x_train, y_train), _ = mnist.load_data()

x_train = np.expand_dims(x_train.astype(K.floatx()) / 255, axis=-1)
y_train = np.expand_dims(y_train, axis=-1)

model = keras.models.Sequential()
model.add(Conv2D(input_shape=(28, 28, 1), kernel_size=3, filters=32))
model.add(MaxPool2D(pool_size=2))
model.add(LayerNormalization())
model.add(Flatten())
model.add(Dense(units=10, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(x_train, y_train, steps_per_epoch=2)

That is because the images in MNIST dataset all have black borders, while in transformers the position embedding makes sure that there are no zero inputs.

CyberZHG on 22 Nov 2018

👍1

yeah I meant for text

JulesGM on 22 Nov 2018

In TensorFlow 2.0, this will likely be present as tf.keras.layers.LayerNormalization. Currently if you build from the r2.0 branch, you can get it with tf.keras.layers.experimental.LayerNormalization.