Keras: Mapping sequences of variable length to embeddings

Created on 26 Jul 2016 · 6Comments · Source: keras-team/keras

I'm trying to do a NLP task, in which an LSTM layer take as input the embeddings of the sequence of words. I have followed the instructions in #853 to set the embedding layers, but how do I map the variable length of words to the input of the embedding layer? Here is the code (which is not able to map words to embeddings and has some errors):

# src is a dict
src = p.load(fin)

# word -> idx
src_word2idx = src['word2idx']

# idx -> embeddings
src_embeddings = src['embeddings']

# dimensionality of word embeddings
vocab_dim = 100 

# adding 1 to account for 0th index (for masking)
n_symbols = len(src_word2idx) + 1 

vocab_size = len(src_embeddings)

newrow = np.zeros((1, vocab_dim))

idx2embeddings = np.concatenate((newrow, src_embeddings))

print 'idx2embeddings =', len(idx2embeddings)

model = Sequential()
# warning: plus one when looking for corresponding word embeddings
model.add(Embedding(input_dim = vocab_size + 1, output_dim = vocab_dim, mask_zero = True, weights = [idx2embeddings]))
model.add(LSTM(100, return_sequences = False))
model.add(Dropout = 0.3)
model.add(Dense(n_symbols, activation='softmax'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

And when I run the above code, there is an error:

Using TensorFlow backend.
idx2embeddings = 209393
Traceback (most recent call last):
  File "lstm_bin.py", line 36, in <module>
    model.add(LSTM(100, return_sequences = False))
  File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 146, in add
    output_tensor = layer(self.outputs[0])
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 485, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 543, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 148, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
  File "/usr/local/lib/python2.7/dist-packages/keras/layers/recurrent.py", line 219, in call
    ': ' + str(input_shape))
Exception: When using TensorFlow, you should define explicitly the number of timesteps of your sequences.
If your first layer is an Embedding, make sure to pass it an "input_length" argument. Otherwise, make sure the first layer has an "input_shape" or "batch_input_shape" argument, including the time axis. Found input shape at layer lstm_1: (None, None, 100)

I'm quite confused that since my input is of variable length, why should I define the input_length and the timesteps?

stale

Source

zhaopku

👍1

All 6 comments

It is required by tensorflow to provide input length. To overcome this, you need to pad your sequence. See http://keras.io/preprocessing/sequence/

linxihui on 26 Jul 2016

👍1

@linxihui so the padding sequences are just like placeholders which have no effect?

zhaopku on 26 Jul 2016

@zhaopku
It does not affect if you have a Masking layer before passing to RNN. In your case, where embedding is right before RNN, you only need to turn on the masking by Embedding(mask_zero=True), which you already did.

There is two type of padding. Either works as expected.

right padding like (1,2,3) -> (1,2,3,0,0): then RNN will carry on the last unmasked (hidden) state on, i.e. you get (1,3, 6, 6, 6) assuming your RNN is doing simple + operation.
left padding like (1,2,3) -> (0,0,1,2,3), then you get (0,0,1,3,6) since the initial state is default to 0.

In either case, you get the desired (hidden) states.

linxihui on 27 Jul 2016

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.

stale[bot] on 23 May 2017

is there any way that is not needed to paadding(right or left)
like this link: https://cs.stanford.edu/~quocle/paragraph_vector.pdf
can it possible？Variable length sentence emberdding