https://github.com/ryankiros/layer-norm
https://arxiv.org/abs/1607.06450
Is it easy to implement Layer Normalization in Keras, as suggested in paper/code above?
Anyone aware of any examples how to do so in Keras?
As far as I remember @rylanchiu was working on an implementation (https://github.com/fchollet/keras/issues/3519) otherwise I can probably get an example out this weekend.
@abishekk92: Thanks, an example would be fantastic!
As mentioned under https://github.com/fchollet/keras/issues/3519, would BatchNormalization with mode=1 do this? -- It looks like it when going through the code but I'm not 100% sure (I'm still very new to Keras inner workings ). Can you confirm?
Can BatchNormalization be used to do the same as MinMaxScaler or MaxAbsScaler
from sklearn.preprocessing ?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
@JulesGM I tried your LayerNorm1D layer but got NaNs for loss. Could you post an example of how to use? Thanks!
@JulesGM I tried your LayerNorm1D layer but got
NaNs for loss. Could you post an example of how to use? Thanks!
@cpury
Hey, I think maybe the input data you have given to the layer above contains columns (last dimension) with only zeros. Because the layer needs the standard deviation to calculate the normalization, which has the square root operation, while the derivative of square root at zero is infinity, you'll get unstable gradients in back-propagation phase.
Most layers have zero initializers for bias, therefore it's unavoidable even you have stacked layers like convolutions. You'll get normal results in the first step, and NaN in the second.
I think keras-layer-normalization would be a more stable implementation.
See https://github.com/Lsdefine/attention-is-all-you-need-keras/blob/master/transformer.py#L14 for a good implementation
@JulesGM
Surely it is a good implementation for transformers, however, we can easily trigger NaN loss with the following code in 2 steps:
import keras
from keras.datasets import mnist
from keras.layers import Layer, Conv2D, MaxPool2D, Flatten, Dense
from keras.initializers import Ones, Zeros
import keras.backend as K
import numpy as np
class LayerNormalization(Layer):
def __init__(self, eps=1e-6, **kwargs):
self.eps = eps
super(LayerNormalization, self).__init__(**kwargs)
def build(self, input_shape):
self.gamma = self.add_weight(name='gamma', shape=input_shape[-1:],
initializer=Ones(), trainable=True)
self.beta = self.add_weight(name='beta', shape=input_shape[-1:],
initializer=Zeros(), trainable=True)
super(LayerNormalization, self).build(input_shape)
def call(self, x):
mean = K.mean(x, axis=-1, keepdims=True)
std = K.std(x, axis=-1, keepdims=True)
return self.gamma * (x - mean) / (std + self.eps) + self.beta
def compute_output_shape(self, input_shape):
return input_shape
(x_train, y_train), _ = mnist.load_data()
x_train = np.expand_dims(x_train.astype(K.floatx()) / 255, axis=-1)
y_train = np.expand_dims(y_train, axis=-1)
model = keras.models.Sequential()
model.add(Conv2D(input_shape=(28, 28, 1), kernel_size=3, filters=32))
model.add(MaxPool2D(pool_size=2))
model.add(LayerNormalization())
model.add(Flatten())
model.add(Dense(units=10, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(x_train, y_train, steps_per_epoch=2)
That is because the images in MNIST dataset all have black borders, while in transformers the position embedding makes sure that there are no zero inputs.
yeah I meant for text
In TensorFlow 2.0, this will likely be present as tf.keras.layers.LayerNormalization. Currently if you build from the r2.0 branch, you can get it with tf.keras.layers.experimental.LayerNormalization.
This exists now in TensorFlow: tf.keras.layers.LayerNormalization
yeah I bet, now that every platform and their uncle has a transformer implementation
Most helpful comment
@JulesGM I tried your LayerNorm1D layer but got
NaNs for loss. Could you post an example of how to use? Thanks!