I'm trying to do a NLP task, in which an LSTM layer take as input the embeddings of the sequence of words. I have followed the instructions in #853 to set the embedding layers, but how do I map the variable length of words to the input of the embedding layer? Here is the code (which is not able to map words to embeddings and has some errors):
# src is a dict
src = p.load(fin)
# word -> idx
src_word2idx = src['word2idx']
# idx -> embeddings
src_embeddings = src['embeddings']
# dimensionality of word embeddings
vocab_dim = 100
# adding 1 to account for 0th index (for masking)
n_symbols = len(src_word2idx) + 1
vocab_size = len(src_embeddings)
newrow = np.zeros((1, vocab_dim))
idx2embeddings = np.concatenate((newrow, src_embeddings))
print 'idx2embeddings =', len(idx2embeddings)
model = Sequential()
# warning: plus one when looking for corresponding word embeddings
model.add(Embedding(input_dim = vocab_size + 1, output_dim = vocab_dim, mask_zero = True, weights = [idx2embeddings]))
model.add(LSTM(100, return_sequences = False))
model.add(Dropout = 0.3)
model.add(Dense(n_symbols, activation='softmax'))
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
And when I run the above code, there is an error:
Using TensorFlow backend.
idx2embeddings = 209393
Traceback (most recent call last):
File "lstm_bin.py", line 36, in <module>
model.add(LSTM(100, return_sequences = False))
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 146, in add
output_tensor = layer(self.outputs[0])
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 485, in __call__
self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 543, in add_inbound_node
Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 148, in create_node
output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
File "/usr/local/lib/python2.7/dist-packages/keras/layers/recurrent.py", line 219, in call
': ' + str(input_shape))
Exception: When using TensorFlow, you should define explicitly the number of timesteps of your sequences.
If your first layer is an Embedding, make sure to pass it an "input_length" argument. Otherwise, make sure the first layer has an "input_shape" or "batch_input_shape" argument, including the time axis. Found input shape at layer lstm_1: (None, None, 100)
I'm quite confused that since my input is of variable length, why should I define the input_length and the timesteps?
It is required by tensorflow to provide input length. To overcome this, you need to pad your sequence. See http://keras.io/preprocessing/sequence/
@linxihui so the padding sequences are just like placeholders which have no effect?
@zhaopku
It does not affect if you have a Masking layer before passing to RNN. In your case, where embedding is right before RNN, you only need to turn on the masking by Embedding(mask_zero=True), which you already did.
There is two type of padding. Either works as expected.
+ operation.In either case, you get the desired (hidden) states.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.
is there any way that is not needed to paadding(right or left)
like this link: https://cs.stanford.edu/~quocle/paragraph_vector.pdf
can it possible锛烿ariable length sentence emberdding
@linxihui What is the default value of input_length when not passed during call to the Embedding() function?