Keras: Average of non-zero word embeddings

Created on 28 Jan 2016 · 20Comments · Source: keras-team/keras

I'd like to implement a layer that receives some word embeddings from an Embedding layer, which received a padded list of word indexes, and averages all non-zero word vectors to produce one output vector. The input dimensions would be something like (None, 5, 200) and the corresponding output dimensions should then be (None, 1, 200).

I've implemented this so far:

class NonZeroAverage(Layer):

    @property
    def output_shape(self):
        shape = list(self.input_shape)
        assert len(shape) == 3  # only valid for 3D tensors
        shape[1] = 1
        return tuple(shape)

    def get_output(self, train=False):
        x = self.get_input(train)
        shape = list(self.input_shape)
        sums = x.sum(axis=-1)
        counts = T.neq(x, 0).sum(axis=-1)
        avg_non_zeros = sums / counts
        reshaped = avg_non_zeros.reshape((shape[0], 1, shape[2])).astype('float32') # convolution requires float32
        return reshaped

I'm not sure if this is correct, though. I'm getting the following error as I'm not sure how to broadcast to the None dimension again.

theano.tensor.var.AsTensorError: ('Cannot convert (None, 1, 200) to TensorType', <type 'tuple'>)

Can you help me?

Source

sebastianruder

Most helpful comment

For anyone who stumbles onto this post looking to deal with Embeddings, zeros, and masks, the following works in both Theano and TF.

My solution to this problem is as follows:

(1) Make a custom ZeroMaskedEntries layer that (a) zeros out all of the masked-out embedding rows and (b) swallows the mask so it doesn't pass on.

(2) Use a lambda function called mask_aware_mean that knows to ignore all-zero rows when taking the mean.

This is a little bit silly (inefficient) because first I get rid of the mask, and then I reconstruct, but it gets rid of the whole MaskEatingLambda business. You can also use ZeroMaskedEntries in other places, and easily modify it to pass on the mask if need be.

Here is ZeroMaskedEntries:

import keras.backend as K
from keras.engine.topology import Layer

class ZeroMaskedEntries(Layer):
    """
    This layer is called after an Embedding layer.
    It zeros out all of the masked-out embeddings.
    It also swallows the mask without passing it on.
    You can change this to default pass-on behavior as follows:

    def compute_mask(self, x, mask=None):
        if not self.mask_zero:
            return None
        else:
            return K.not_equal(x, 0)
    """

    def __init__(self, **kwargs):
        self.support_mask = True
        super(ZeroMaskedEntries, self).__init__(**kwargs)

    def build(self, input_shape):
        self.output_dim = input_shape[1]
        self.repeat_dim = input_shape[2]

    def call(self, x, mask=None):
        mask = K.cast(mask, 'float32')
        mask = K.repeat(mask, self.repeat_dim)
        mask = K.permute_dimensions(mask, (0, 2, 1))
        return x * mask

    def compute_mask(self, input_shape, input_mask=None):
        return None

Below is a way to take the mean of what comes out of ZeroMaskedEntries. It does the silly business mentioned above of reconstructing the mask, but the computational hit is minor in my experience.

def mask_aware_mean(x):
    # recreate the masks - all zero rows have been masked
    mask = K.not_equal(K.sum(K.abs(x), axis=2, keepdims=True), 0)

    # number of that rows are not all zeros
    n = K.sum(K.cast(mask, 'float32'), axis=1, keepdims=False)

    # compute mask-aware mean of x
    x_mean = K.sum(x, axis=1, keepdims=False) / n

    return x_mean

def mask_aware_mean_output_shape(input_shape):
    shape = list(input_shape)
    assert len(shape) == 3 
    return (shape[0], shape[2])

And here is a test to make sure it all works:

import numpy as np
from keras.layers import Input, Embedding, Lambda
from keras.models import Model

output_dim = 2
input_dim = 25
input_length = 4
main_input = Input(shape=(input_length,), dtype='int32')
embed = Embedding(output_dim=output_dim, input_dim=input_dim, input_length=input_length, mask_zero=True)(main_input)
embed_zeroed = ZeroMaskedEntries()(embed)
lambda_mean = Lambda(mask_aware_mean, mask_aware_mean_output_shape)(embed_zeroed)

model = Model(input=main_input,output=lambda_mean)
model.compile(optimizer='rmsprop',loss='mse')

# test
test_input = [[0,0,2,0],[0,0,0,1],[0,0,2,1]]
test_output =  model.predict(test_input)
print('Mean is working?', np.all(np.isclose(test_output[0:2,:].mean(0),test_output[2,:])))

sergeyf on 9 Aug 2016

👍10

All 20 comments

Well, this is a problem of Theano not Keras.
If I understand correctly, you are going to average all word embeddings of a sequence, except for zero-padded stuff.
In your code,

sums = x.sum(axis=-1)
counts = T.neq(x, 0).sum(axis=-1)

the axis is not specified correctly.

You can try the following code.

class NonZeroAverage(Layer):
    @property
    def output_shape(self):
        shape = list(self.input_shape)
        assert len(shape) == 3  # only valid for 3D tensors
        return tuple(shape[0],shape[2])   #your output is just a 2D tensor, no need to include shape[1], which is equal to 1

    def get_output(self, train=False):
        x = self.get_input(train)
        shape = list(self.input_shape)
        sums = x.sum(axis=1)
        c = T.neq(x,0).sum(axis=2)
        count = T.neq(c,0).sum(axis=1)
        t = [count] * shape[2]
        stacked = T.stack(*t).transpose()
        ave = sums / stacked
        return ave

The final output is a 2D tensor of shape (batch_size, dim)
if you REALLY want to output a 3D tensor of shape (batch_size, 1, dim), you can use Reshape after the NoneZeroAverage layer.

ymcui on 30 Jan 2016

Thanks, this was very helpful!
For completeness: I had to add to convert the output to a float32 with ave.astype('float32') to play nice with the filter weights and added Reshape((1, embedding_size)) after the NonZeroAverageLayer.

sebastianruder on 1 Feb 2016

Hi @sebastianruder - would you mind posting your code? I have a similar use case.

sergeyf on 3 May 2016

@sergeyf, sure.
The layer ended up looking like this:

class NonZeroAverage(Layer):
    """
    Layer that averages over non-zero word embeddings to produce an average vector for e.g. an entity or an aspect.
    Not fully implemented yet.
    """
    @property
    def output_shape(self):
        shape = list(self.input_shape)
        assert len(shape) == 3 # only valid for 3D tensors
        return tuple([shape[0], shape[2]])

    def get_output(self, train=False):
        x = self.get_input(train)
        shape = list(self.input_shape)
        sums = x.sum(axis=1)
        c = T.neq(x,0).sum(axis=2)
        count = T.neq(c,0).sum(axis=1)
        t = [count] * shape[2]
        stacked = T.stack(*t).transpose()
        ave = sums / stacked
        return ave.astype('float32')

I added the layer to the graph followed by a Reshape in order to obtain a 3D tensor of shape (batch_size, 1, category_embedding_size).

graph.add_node(NonZeroAverage(), name='non_zero_average', input='category_embedding')
graph.add_node(Reshape((1, category_embedding_size)), name='category_vector', input='non_zero_average')

sebastianruder on 4 May 2016

Thanks! I ended up customizing the Lambda layer into taking masks and then not emitting it, so you could attach other layers after (Dense).

sergeyf on 4 May 2016

@sergeyf would you mind sharing your lambda layer ?

I am also looking at averaging the embedding vectors of a sequence so the output can be fed into a dense layer.

ArdalanM on 19 May 2016

@ArdalanM See here for the custom layers: https://gist.github.com/sergeyf/a95de7d089668b41decad343ee30b89e

To use these layers, you can do something like the following:

main_input = Input(shape=(input_length,),dtype='int32')
m = Embedding(output_dim=dense_dim, 
                      input_dim=input_dim, 
                      input_length=input_length,
                      mask_zero=True)(main_input)
m = MaskEatingLambda(lambda_mask_sum, output_shape=(dense_dim,))(m)
# insert whatever other layers you want here
model = Model(input=main_input, output=m)

sergeyf on 25 May 2016

For anyone who stumbles onto this post looking to deal with Embeddings, zeros, and masks, the following works in both Theano and TF.

My solution to this problem is as follows:

(1) Make a custom ZeroMaskedEntries layer that (a) zeros out all of the masked-out embedding rows and (b) swallows the mask so it doesn't pass on.

(2) Use a lambda function called mask_aware_mean that knows to ignore all-zero rows when taking the mean.

Here is ZeroMaskedEntries:

import keras.backend as K
from keras.engine.topology import Layer

class ZeroMaskedEntries(Layer):
    """
    This layer is called after an Embedding layer.
    It zeros out all of the masked-out embeddings.
    It also swallows the mask without passing it on.
    You can change this to default pass-on behavior as follows:

    def compute_mask(self, x, mask=None):
        if not self.mask_zero:
            return None
        else:
            return K.not_equal(x, 0)
    """

    def __init__(self, **kwargs):
        self.support_mask = True
        super(ZeroMaskedEntries, self).__init__(**kwargs)

    def build(self, input_shape):
        self.output_dim = input_shape[1]
        self.repeat_dim = input_shape[2]

    def call(self, x, mask=None):
        mask = K.cast(mask, 'float32')
        mask = K.repeat(mask, self.repeat_dim)
        mask = K.permute_dimensions(mask, (0, 2, 1))
        return x * mask

    def compute_mask(self, input_shape, input_mask=None):
        return None

Below is a way to take the mean of what comes out of ZeroMaskedEntries. It does the silly business mentioned above of reconstructing the mask, but the computational hit is minor in my experience.

def mask_aware_mean(x):
    # recreate the masks - all zero rows have been masked
    mask = K.not_equal(K.sum(K.abs(x), axis=2, keepdims=True), 0)

    # number of that rows are not all zeros
    n = K.sum(K.cast(mask, 'float32'), axis=1, keepdims=False)

    # compute mask-aware mean of x
    x_mean = K.sum(x, axis=1, keepdims=False) / n

    return x_mean

def mask_aware_mean_output_shape(input_shape):
    shape = list(input_shape)
    assert len(shape) == 3 
    return (shape[0], shape[2])

And here is a test to make sure it all works:

import numpy as np
from keras.layers import Input, Embedding, Lambda
from keras.models import Model

output_dim = 2
input_dim = 25
input_length = 4
main_input = Input(shape=(input_length,), dtype='int32')
embed = Embedding(output_dim=output_dim, input_dim=input_dim, input_length=input_length, mask_zero=True)(main_input)
embed_zeroed = ZeroMaskedEntries()(embed)
lambda_mean = Lambda(mask_aware_mean, mask_aware_mean_output_shape)(embed_zeroed)

model = Model(input=main_input,output=lambda_mean)
model.compile(optimizer='rmsprop',loss='mse')

# test
test_input = [[0,0,2,0],[0,0,0,1],[0,0,2,1]]
test_output =  model.predict(test_input)
print('Mean is working?', np.all(np.isclose(test_output[0:2,:].mean(0),test_output[2,:])))

sergeyf on 9 Aug 2016

👍10

@sergeyf Thank you, this is helpful!
The only caveat was that in case all entries in the sequence are masked, this generates NaNs.
I could only think of adding a dummy entry when i knew all entries will be masked, Is there some better solution?

yotam-happy on 8 Oct 2016

@yotam-happy the mean of a zero-length vector is undefined, but for a quick and dirty fix you can do something like this (untested):

    n = K.sum(K.cast(mask, 'float32'), axis=1, keepdims=False)
    n = K.maximum(n, 1.0)

So now the mean of a zero-length vector will be zero. Not sure if this behavior will necessarily make sense downstream, so take care!

sergeyf on 9 Oct 2016

Would this approach work downstream (i.e. after an LSTM layer with return_sequence=True)?

It seems that there's some related work:
https://github.com/fchollet/keras/issues/2728
https://github.com/fchollet/keras/pull/3678

@sergeyf What are your thoughts on which implementation should be preferred?

PiranjaF on 25 Oct 2016

@PiranjaF The whole point here is to take means of a sequence keeping the mask in mind, so there is no time-series left to put into an LSTM. What are you trying to do?

sergeyf on 25 Oct 2016

I'm using multiple LSTMs, each with return_sequence=True to capture patterns across time. But instead of using a final LSTM with return_sequence=False I need to pool the output for various reasons.

So the means would be taken after the LSTM's, utilizing the mask that the LSTM operates on. Of course, after this mean-pooling the mask would disappear.

PiranjaF on 25 Oct 2016

I think it should be no problem to just replace the simple Embedding layer in my example with whatever it is you're doing if return_sequence=True:

main_input = Input(shape=(input_length,), dtype='int32')
embed = INSERT YOUR CRAZY LSTM HERE
embed_zeroed = ZeroMaskedEntries()(embed)
lambda_mean = Lambda(mask_aware_mean)(embed_zeroed)

Give it a try?

sergeyf on 25 Oct 2016

@sergeyf I've tried implementing your code with a Dense layer afterwards:

main_input = Input(shape=(input_length,), dtype='int32')
embed = Embedding(input_dim=50000, output_dim=100,  mask_zero=True)(main_input)

embed_zeroed = ZeroMaskedEntries()(embed)
lambda_mean = Lambda(mask_aware_mean)(embed_zeroed)
hidden = Dense(50, activation='relu')(lambda_mean)

and I keep running into an AssertionError dealing with the size of the input to the Dense layer when I try to compile:

Traceback (most recent call last):
  File "/home/Experiments/Averaging_Network.py", line 110, in <module>
    hidden = Dense(50, activation='relu')(lambda_mean)
  File "/usr/local/lib/python3.4/dist-packages/keras/engine/topology.py", line 487, in __call__
    self.build(input_shapes[0])
  File "/usr/local/lib/python3.4/dist-packages/keras/layers/core.py", line 689, in build
    assert len(input_shape) == 2
AssertionError

Without the Dense layer, everything compiles fine and model.predict returns a (N, 100) matrix where N is the batch size. Do you have any advice?

jerbarnes on 27 Oct 2016

@jbarnesspain A good question here - I'll have to update my example. Lambda wants an output shape, so you can do this:

def mask_aware_mean_output_shape(input_shape):
    shape = list(input_shape)
    assert len(shape) == 3 
    return (shape[0], shape[2])

main_input = Input(shape=(input_length,), dtype='int32')
embed = Embedding(input_dim=50000, output_dim=100,  mask_zero=True)(main_input)

embed_zeroed = ZeroMaskedEntries()(embed)
lambda_mean = Lambda(mask_aware_mean, mask_aware_mean_output_shape)(embed_zeroed)
hidden = Dense(50, activation='relu')(lambda_mean)

sergeyf on 27 Oct 2016

👍1

@sergeyf Thank you for your sharing. Could you explain what does the function "lambda_mask_sum" do ?

Maggione on 5 May 2017

Here is my snippet code, the point is to find the average of the two embedded input and flatten them before fully-connected layer. How can i used your example, please help:
model = Sequential()
e = Embedding(len(word_index)+1, 100, weights=[embedding_matrix], input_length=4, trainable=True)
e1 = Embedding(len(word_index)+1, 100, weights=[embedding_matrix1], input_length=4, trainable=True)

Average code here!!!!

model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.summary()

dupsys on 9 Oct 2017

@sergeyf
I use your maskeating lambda layer and add dense layers after it. The network can be trained. But when I want to save the best model by callbacks in Keras, it throws an error at 'base_config = super(Lambda, self).get_config()' in your file maskeatinglambda.py. TypeError: super(type, obj): obj must be an instance or subtype of type.

how to solve it? I can't find a way. Thanks.

LuoDQ on 21 Jan 2018

It seems the following code may cause problems if n is zero:

embed_zeroed = ZeroMaskedEntries()(embed)
lambda_mean = Lambda(mask_aware_mean, mask_aware_mean_output_shape)(embed_zeroed)

where n is as defined in:

# number of that rows are not all zeros
    n = K.sum(K.cast(mask, 'float32'), axis=1, keepdims=False)
# compute mask-aware mean of x
 x_mean = K.sum(x, axis=1, keepdims=False) / n

I was coding a DAN (Deep Average Network) when I found this problem. The DAN model first randomly drops some word then does embedding average, and this could sometimes cause n=0, which leads to loss: nan and accuracy: nan

In case anyone runs into similar problems, I suggest the following fix in the code:

def mask_aware_mean(x):
    # recreate the masks - all zero rows have been masked
    mask = K.not_equal(K.sum(K.abs(x), axis=2, keepdims=True), 0)

    # number of that rows are not all zeros
    n = K.sum(K.cast(mask, 'float32'), axis=1, keepdims=False)
    # compute mask-aware mean of x
    if K.equal(n, 0) is False:
        x_mean = K.sum(x, axis=1, keepdims=False) / n
        return x_mean
    else:
        return K.mean(x, axis=1)