Keras: Many-to-many variable-length sequence labeling (such as POS)

Created on 29 Sep 2016 · 5Comments · Source: keras-team/keras

Been following some related threads, such as #395, #2654, and #2403, but still cannot sort out how to get it to work. The Keras API doc is already very dated so it's not very helpful for this issue.

So I want to use a pretrained word2vec word presentation + Keras LSTM to do POS tagging.

My first question is: is there a better way to feed in the pretrained vector presentation than the embedding_weights method mentioned at #853?

Say we embed using the method mentioned in #853, and get a (M+2) by N embedding matrix. We also pad the variable-length sentences. Then we have

X_pad.shape = (M, N)
y_pad.shape = (M, N)

where M is the number of sentences in the corpus (in my case 18421), and N is padded sentence length (originals vary from 15-140 so in this case N=140)

Here is how I initialized the model

  model = Sequential()

  # first embedding layer
  model.add(Embedding(input_dim=vocab_size, output_dim=embed_size, input_length=N, mask_zero=True, weights=[embedding_matrix]))

  # hidden layer
  model.add(LSTM(output_dim=hidden_dim, return_sequences=True))

  # output layer
  model.add(TimeDistributed(Dense(num_class, activation='softmax')))

  # compile
  model.compile(loss='categorical_crossentropy', optimizer='adam')

When I run model.fit(X_pad, y_pad), I got this error:

Exception: Error when checking model target: expected timedistributed_1 to have 3 dimensions, but got array with shape (18421, 140)

Been stuck here for a while. Any suggestion is appreciated!

Source

ShuaiW

Most helpful comment

@dieuwkehupkes thanks for the hint! Turns out one-hot encoding is needed.

And for people who have similar issues, you can solve the problem by creating and feeding the 3-d y_pad_one_hot into the previous model

import numpy as np
from keras.utils.np_utils import to_categorical

# y_pad_one_hot.shape: (M, N, nb_classes)
y_pad_one_hot = np.array([to_categorical(sent_label, nb_classes=nb_classes) for sent_label in y_pad])
model.fit(X_pad, y_pad_one_hot)

Still need to find the best way to mask the padding, though.

ShuaiW on 30 Sep 2016

👍2

All 5 comments

I ran across this problem as well. I am still not sure why this is the case and if this is the desired behaviour, but I did manage to get around it by putting all my output values in separate arrays, i.e.:

X = [[1, 2]] 
X_padded = keras.preprocessing.sequence.pad_sequences(X, dtype='float32', maxlen=3) 
Y = [[[1], [2]]] 
Y_padded = keras.preprocessing.sequence.pad_sequences(Y, maxlen=3, dtype='float32')

See also #3855, which is about a different sequence to sequence learning with variable length problem, but also mentions this issue.

dieuwkehupkes on 30 Sep 2016

👍2

@dieuwkehupkes thanks for the hint! Turns out one-hot encoding is needed.

And for people who have similar issues, you can solve the problem by creating and feeding the 3-d y_pad_one_hot into the previous model

import numpy as np
from keras.utils.np_utils import to_categorical

# y_pad_one_hot.shape: (M, N, nb_classes)
y_pad_one_hot = np.array([to_categorical(sent_label, nb_classes=nb_classes) for sent_label in y_pad])
model.fit(X_pad, y_pad_one_hot)

Still need to find the best way to mask the padding, though.

ShuaiW on 30 Sep 2016

👍2

@ShuaiW can you provide the detailed value of "nb_classes" and "num_class". I encountered the same problem , please help!

yangxiufengsia on 14 Mar 2017

@yangxiufengsia num_class/nb_classes is the number of classes.

neingeist on 13 Jul 2017

@ShuaiW If the output is a set of words, the num_class becomes the vocab_size. Assuming that I am expecting an output of 20 words, a one hot encoded Y becomes [vocab_size, max_output_words]. Is this correct?