Keras: TimeDistributed layer does not correctly pass on mask

Created on 21 Dec 2016 · 9Comments · Source: keras-team/keras

A well-known usecase for the TimeDistributed layer is to make a hierarchical LSTM. For instance, one can first run an LSTM over words in a sentence and then an LSTM over the sentences. Similarly, this can be done with LSTM over characters in words and then over words etc.

This Keras example shows this functionality nicely. However, what is not obvious is how the masking is not correctly passed on when using the TimeDistributed layer. This is critical as a sentence will often not have the same number of words (or all words the same number of characters).

To illustrate this issue I've modified the MNIST Hierarchical RNN example by removing the right half of all the images and adding a masking layer (see below). Now add if mask == None: raise ValueError() to https://github.com/fchollet/keras/blob/master/keras/layers/recurrent.py#L198 and you'll see that the mask is not passed on.

This is done without any warnings whatsoever, making the user unaware of this behavior. How can we modify the TimeDistributed wrapper to correctly pass on the mask on the lower level?

from __future__ import print_function

from keras.datasets import mnist
from keras.models import Sequential, Model
from keras.layers import Input, Dense, TimeDistributed, Masking
from keras.layers import LSTM
from keras.utils import np_utils

# Training parameters.
batch_size = 32
nb_classes = 10
nb_epochs = 5

# Embedding dimensions.
row_hidden = 128
col_hidden = 128

# The data, shuffled and split between train and test sets.
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Reshapes data to 4D for Hierarchical RNN.
X_train = X_train.reshape(X_train.shape[0], 28, 28, 1)
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# ADDED: Remove pixel values for right half of the image
# This is similar to the use case of running an LSTM over
# multiple sentences, where each sentence has some masking.
X_train[:,:,14:] = 0
X_test[:,:,14:] = 0

# Converts class vectors to binary class matrices.
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)

row, col, pixel = X_train.shape[1:]

# 4D input.
x = Input(shape=(row, col, pixel))
#x = Input(batch_shape=(batch_size, row, col, pixel))

# ADDED: Masking layer to take into account that right
# half of image is removed.
x_masked = TimeDistributed(Masking())(x)

# Encodes a row of pixels using TimeDistributed Wrapper.
encoded_rows = TimeDistributed(LSTM(output_dim=row_hidden))(x_masked)

# Encodes columns of encoded rows.
encoded_columns = LSTM(col_hidden)(encoded_rows)

# Final predictions and model.
prediction = Dense(nb_classes, activation='softmax')(encoded_columns)
model = Model(input=x, output=prediction)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

# Training.
model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epochs,
          verbose=1, validation_data=(X_test, Y_test))

# Evaluation.
scores = model.evaluate(X_test, Y_test, verbose=0)
print('Test loss:', scores[0])
print('Test accuracy:', scores[1])

tensorflow

Source

bfelbo

👍14

All 9 comments

+1 -- a fix for this would be great, particularly for text modeling.

At the moment, TimeDistributed(Embedding(..., mask_zero=True)) drops the mask.

However, it seems like you can get around it with something like:

avg_emb = Sequential()
avg_emb.add(Embedding(..., mask_zero=True))
avg_emb.add(GlobalAveragePooling1D())

x = Input(...)
emb = TimeDistributed(avg_emb)(x)
model = Model(input=x, output=emb)

bkj on 28 Dec 2016

👍1

This just bit me. I wonder if it's worth making a pull request doing something like @pifelbo suggests, i.e. adding something like if mask == None: raise ValueError() or else issuing a warning to tell the user masking isn't yet supported from inside a TimeDistributed layer.

JohnHBrock on 26 Jan 2017

I think this issue has been solved in Keras version 2. I've used this code :
import numpy as np;
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.layers.wrappers import TimeDistributed;
from keras import backend as K;
from keras.layers.core import Masking;

train_data = np.random.rand(100,20,30,50);
train_label = np.random.randint(2, size=(100, 20, 1))
test_data = np.random.rand(10,20,30,50);

model = Sequential();
model.add(TimeDistributed(Masking(mask_value=0., input_shape=(30, 50)), input_shape=(20,30,50)))
model.add(TimeDistributed(LSTM(128, input_shape=(30, 50))))
model.add(Dense(1))

model.compile(loss='binary_crossentropy', optimizer='RMSprop', metrics=['accuracy'])
model.fit(train_data, train_label, batch_size=10, epochs=2, validation_split=0.2)

print model.predict_classes(test_data).shape;

I've tested it by adding if mask == None : raise ValueError() in https://github.com/fchollet/keras/blob/master/keras/layers/recurrent.py#L207 and no error was raised.

amirveyseh on 14 Apr 2017

@amirveyseh the problem happens when you use an Embedding layer

erickrf on 29 Jun 2017

👍1

I've added this function to TimeDistributed Class and when I use TimeDistributed with Embedding and set mask_zero=True the masking is passed to subsequent LSTM layer:

def compute_mask(self, inputs, mask=None): return self.layer.compute_mask(inputs, mask)

example model:

` x_train = np.random.randint(low=1, high=800, size=(100, 40, 30))
y_train = [[1, 0, 0] if x == 0 else [0, 1, 0] if x == 1 else [0, 0, 1] for x in np.random.randint(3, size=100)]
x_test = np.random.randint(low=1, high=800, size=(10, 40, 30))
y_test = [[1, 0, 0] if x == 0 else [0, 1, 0] if x == 1 else [0, 0, 1] for x in np.random.randint(3, size=10)]

x_train[:, :, 20:] = 0
x_test[:, :, 20:] = 0

model = Sequential()
model.add(TimeDistributed(Embedding(800, 20, input_length=30, mask_zero=True), input_shape=(40, 30)))
model.add(TimeDistributed(LSTM(50, input_shape=(30, 20), name="First LSTM")))
model.add(LSTM(60, input_shape=(40, 50), name="Second LSTM"))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='RMSprop', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=32, epochs=2, validation_split=0.2)
prediction = model.predict_classes(x_test)
print prediction
`
However I'm not sure it could be a solution to this problem or not.

amirveyseh on 14 Aug 2017

@bkj How does that manage to avoid the issue?

sanchom on 26 Sep 2017

Is this still an issue ?

Harshini-Gadige on 2 Nov 2018

@bfelbo I encounter the same issue. Did you find a workaround or we still can't implement two level hierarchical model?

hyoceansun on 7 Jul 2019

I solved this by simply creating a new layer that inherits from TimeDistributed and passes on the mask. (Note: due to legacy reasons, I had to stay on Keras 2.2.4. Don't know if this has been fixed since, I hope so.) This allows you to have Masking layers before this one.

class MaskedTimeDistributed(TimeDistributed):                                                                                                                                                                                                                                   
    def __init__(self, layer, **kwargs):                                                                                                                                                                                                                                        
        self.supports_masking = True                                                                                                                                                                                                                                            
        super(TimeDistributed, self).__init__(layer, **kwargs)                                                                                                                                                                                                                  

    def compute_mask(self, inputs, mask=None):                                                                                                                                                                                                                                  
        return mask