Keras: Masking is not working properly!

Created on 6 Apr 2016 · 14Comments · Source: keras-team/keras

@fchollet @wxs @EderSantana @amitbeka
I was trying to run a simple example that counts how many 1's in a sequence of {0's,1's}. I made the examples of a fixed length, and everything worked fine. Then I wanted to used Masking to allow different lengths. But to check what I'm doing, I kept the length fixed as before and I added the masking layer. So far, the lengths are the same for both cases, and only a masking layer is added. Using the same random seed, I get different accuracies after the same number of Epochs.

I looked into the output of the masking layer using masking value of (12345), to avoid masking the 0's. It looked correct since the all values were masked with 1's.

from keras import backend as K

get_the_mask_output = K.function([model.layers[0].input], [model.layers[0].get_output_mask(train=None)]) 

layer_output = get_the_mask_output([X[:,:,:]])[0]

Shouldn't it with and without masking in this case produce the same accuracy after the same number of epochs and with the same random seed? just commenting out the masking layer I go back to the original values, so it is clearly something is happening with Masking. I tried to look into the source code, but nothing is alarming!

Below is the full code~~

import numpy as np
np.random.seed(1337)  # **for reproducibility**

# **Create dataset of lists of 1's and zeros**
nb_of_samples = 5500
sequence_len = 10

# **Create the sequences**
X = np.zeros((nb_of_samples, sequence_len,1))
for row_idx in range(nb_of_samples):
    X[row_idx,:,0] = np.around(np.random.rand(sequence_len)*1)

# **Create the targets for each sequence, the # of 1's is the sum of the list**
t = np.sum(X, axis=1)

#--------- Import Keras-----------
#**Just to make sure** 
np.random.seed(1337)

from keras.models import Sequential
from keras.layers.core import Dense, Activation, TimeDistributedDense, Masking
from keras.layers.recurrent import LSTM, SimpleRNN, GRU
from keras.optimizers import rmsprop, sgd, Adagrad

#--------- Create the Model -----------
#**Just to make sure** 
np.random.seed(1337)

in_out_neurons = 1
hidden_neurons = 10

model = Sequential()

#------ Uncomment This -------
#model.add(Masking(mask_value=12345,input_shape=(X.shape[1], X.shape[2]))) 

model.add(SimpleRNN(hidden_neurons, input_dim=in_out_neurons, return_sequences=False,activation='linear'))
model.add(Dense(in_out_neurons, input_dim=hidden_neurons))
model.add(Activation("linear"))

_rmsprop = rmsprop(lr=0.0005, rho=0.9, epsilon=1e-06)
model.compile(loss="mean_squared_error", optimizer=_rmsprop)
model.fit(X, t, batch_size=256, nb_epoch=10, validation_split=0.1,show_accuracy=True)

I'm using the latest Theano -dev with CuDNN running on Titan X.

stale

Source

myhussien

Most helpful comment

Is this in tensorflow or theano? You should be able to pass None instead of 26 to your masking and have it figure it out if in Theano.

Of course in either case you could also just pass _in_ a 26-step-long vector that's mostly masked, something ilke this

input = np.zeros(1,26,N)
input[0,-1,:] = Train_Data[0,0,:]
out = model.predict(input)

wxs on 27 Apr 2016

👍2

All 14 comments

I think this is just something fishy going on with random seeding. Does np.random.seed necessarily reset theano's internal rng's seed?

If I create the two separate models and force one of them to have exactly the same weights as the other before training I get the same results. Here's a sample script

import keras


import numpy as np
np.random.seed(1337)  # **for reproducibility**

# **Create dataset of lists of 1's and zeros**
nb_of_samples = 5500
sequence_len = 10

# **Create the sequences**
X = np.zeros((nb_of_samples, sequence_len,1))
for row_idx in range(nb_of_samples):
    X[row_idx,:,0] = np.around(np.random.rand(sequence_len)*1)

# **Create the targets for each sequence, the # of 1's is the sum of the list**
t = np.sum(X, axis=1)

#--------- Import Keras-----------
#**Just to make sure** 
np.random.seed(1337)


from keras.models import Sequential
from keras.layers import Dense, Activation, TimeDistributedDense, Masking, LSTM, SimpleRNN, GRU
from keras.optimizers import rmsprop, sgd, Adagrad



#--------- Create the Model (1) -----------
#**Just to make sure** 
np.random.seed(1337)

in_out_neurons = 1
hidden_neurons = 10

model1 = Sequential()

#------ Uncomment This -------
model1.add(Masking(mask_value=12345,input_shape=(X.shape[1], X.shape[2]))) 

model1.add(SimpleRNN(hidden_neurons, input_dim=in_out_neurons, return_sequences=False,activation='linear'))
model1.add(Dense(in_out_neurons))
model1.add(Activation("linear"))

_rmsprop = rmsprop(lr=0.0005, rho=0.9, epsilon=1e-06)
model1.compile(loss="mean_squared_error", optimizer=_rmsprop)



#--------- Create the Model -----------
#**Just to make sure** 
np.random.seed(1337)

model2 = Sequential()

model2.add(SimpleRNN(hidden_neurons, input_dim=in_out_neurons, return_sequences=False,activation='linear'))
model2.add(Dense(in_out_neurons))
model2.add(Activation("linear"))

_rmsprop = rmsprop(lr=0.0005, rho=0.9, epsilon=1e-06)
model2.compile(loss="mean_squared_error", optimizer=_rmsprop)

#--------- Ensure that they are identically initialized --------
for i in range(len(model2.layers)):
    model2.layers[i].set_weights(model1.layers[i+1].get_weights())

model1.fit(X, t, batch_size=256, nb_epoch=10, validation_split=0.1,show_accuracy=True, shuffle=False)
model2.fit(X, t, batch_size=256, nb_epoch=10, validation_split=0.1,show_accuracy=True, shuffle=False)

wxs on 7 Apr 2016

If np.random.seed and Theano rgn (I assume RandomStreams) are independent, then when I run the same example without masking many times I shouldn't get the same results due to Theano random initializations. However, I do get the same results which is very strange!

I tried to set Theano rng to a fixed seed as follows:

from theano.tensor.shared_randomstreams import RandomStreams
from theano.sandbox.rng_mrg import MRG_RandomStreams

MRG_RandomStreams(seed=1337)
RandomStreams(seed=1337)

but I still get different results unless I do what you did. What is strange is that Masking layer has no random initialization.

The other thing I can think of is that maybe my use of Maksing is wrong. In the input_shape I put (time_steps, feat_dim) which is the 2nd and the 3rd dimensions of my Training Data. Can you confirm @wxs?

myhussien on 7 Apr 2016

I believe that is correct for the input_shape to Masking. I did not write the Masking layer however, though I did write the general mask system. I think @amitbeka wrote Masking.

I agree that it seems fishy given that there's nothing random in Masking that should alter the behaviour of the random streams. However this makes it less likely that it's an issue with Masking per se, since again, the behaviour is _exactly_ what we'd expect once we set the same weights.

wxs on 7 Apr 2016

I agree that, since we can replicate the results by making sure the initializations are similar, this is less likely an issue. So, I'll leave the topic open if anyone wants to comment or know what is happening. Otherwise, please feel free to close it if you see there is no need to keep it open.

Thanks!

myhussien on 7 Apr 2016

@wxs Sorry to extend this post, but since it is open and my question is kinda related, I think I can ask it here. I'm trying to create a system that completes sentences. My inputs are of variable lengths, and this is can be handled by masking without issues. However my outputs are also chunks of sentences of variable lengths, and I don't want to truncate them. If my time series inputs are words encoded as 1-hot-vectors at each time step, and the output is also encoded the same and returned at each time step. How can I handle the variable length output (eg. different output time-steps)? If I padded the output as well, the loss values are crazy numbers. What do you suggest? is it correct to pad the output?

myhussien on 26 Apr 2016

You do this with sample_weight, set the weight to 0 for the outputs that should be masked (and make sure you make sample_weight_mode="temporal" when you call compile.

wxs on 26 Apr 2016

👍1

Thank you. This saved me so much time!

One note, is it possible to have the embedding layer to get the mask output from the masking layer as the first layer? Embedding only uses the mask_zero flag. But what if zero is important to me and I want to mask with other than zero?

instead of

    def compute_mask(self, x, mask=None):
        if not self.mask_zero:
            return None
        else:
            return K.not_equal(x, 0)

something like

    def get_mask():
        if True:
            return get_mask_from_Masking_layer()
        else:
            return None

myhussien on 26 Apr 2016

I had at one point advocated for a mask_value parameter instead, but we decided mask_zero was simpler.

The Embedding layer does not respect an input mask although it would be trivial to make this change in its compute_mask method. You could just subclass Embedding and override that method. You might be able to get that accepted as a PR as well, I think there's a decent case for it. It would look something like this

    def compute_mask(self, x, mask=None):
        if not self.mask_zero:
            return mask
        else:
            if mask is None:
                return K.not_equal(x, 0)
            else:
                raise Exception("It wouldn't be too hard to combine the mask_zero with the input mask, but isn't supported here")

Perhaps the simplest option though, just don't make zero important to you. Increment all your values by one or something.

wxs on 26 Apr 2016

I found an interesting behaviour with masking. suppose I have the following code that is meant to predict at each time step.

model = Sequential()
model.add(Masking(mask_value=0.,input_shape=(26, vocabulary_size))) 
model.add(LSTM(512, input_dim=vocabulary_size, return_sequences=True))
model.add(LSTM(512, return_sequences=True))
model.add(TimeDistributedDense(vocabulary_size,activation='relu'))
model.add(TimeDistributedDense(vocabulary_size))
model.add(Activation('softmax'))
sgd = SGD(lr=0.001, decay=1e-6, momentum=0.9, nesterov=True,clipvalue=10)
model.compile(loss='categorical_crossentropy', optimizer=sgd, sample_weight_mode = "temporal")

Now if I want to test by predicting a single time step at a time. For example feeding the input at time step zero as a single vector of shape (1,1,N), then get an output.

out = model.predict(Train_Data[:1,:1,:])

Then use this output to generate the next time-step output and so on. However, the network raises an Error.

GpuReshape: cannot reshape input of shape (1, 512) to shape (0, 26, 512).
Apply node that caused the error: GpuReshape{3}(GpuElemwise{Add}[(0, 0)].0, TensorConstant{[ -1  26 512]})

Which comes from the masking layer trying to have a shape of 26 as my time steps. I can create another model without masking and load the weights whenever I want to generate text, but this is a very ugly way. Is this how it is meant to be?

myhussien on 26 Apr 2016

Is this in tensorflow or theano? You should be able to pass None instead of 26 to your masking and have it figure it out if in Theano.

Of course in either case you could also just pass _in_ a 26-step-long vector that's mostly masked, something ilke this

input = np.zeros(1,26,N)
input[0,-1,:] = Train_Data[0,0,:]
out = model.predict(input)

wxs on 27 Apr 2016

👍2

I'm working on Theano. I honestly didn't know that I can replace time steps with None. I'll try that!

I used the other method you suggested, but I prefer the first one if it works for me :)

myhussien on 27 Apr 2016

Hi @myhussien @wxs, do you mind give an example to explain how to use the sample_weight?
I have a similar problem with @myhussien, my inputs are variable sequences and for each sequence, at each time step, it returns an output (the length/timesteps of the output is the same as the model input).

I read the doc, but still confused, how to set the sample_weight in the model.fit()? (I will pad variable sequences with zeros, and using a Masking layer (mask with zeros) before LSTMs)

pxlong on 14 Jun 2016

@pxlong This might help: https://github.com/EderSantana/seya/blob/master/examples/NTM.ipynb, by @EderSantana, from #957

dragonhyq on 26 Aug 2016

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.