Keras: Tokenizer word_index 0 reserved for what?

Created on 13 Mar 2018 · 2Comments · Source: keras-team/keras

In the method fit_on_texts of Tokenizer class (keras.preprocessing.text.Tokenizer - line 209), there is a comment shown below on line 4:

wcounts = list(self.word_counts.items())
wcounts.sort(key=lambda x: x[1], reverse=True)
sorted_voc = [wc[0] for wc in wcounts]
# note that index 0 is reserved, never assigned to an existing word
self.word_index = dict(list(zip(sorted_voc, list(range(1, len(sorted_voc) + 1)))))

I am interested for what is 0 reserved for? The only logical answer, implied from the comment as well, would be the uknown word token (given by parameter oov_token), but this is not true. The index of oov_token is 1 + word_count from the input texts. If this is somehow a mistake, and the comment is a legacy which is out of order, then I suggest index 0 becomes reserved for the oov_token.

Source

ghost

Most helpful comment

Because if you use the pad_sequence to process the sequence, you will find the 0 is used as the padding value. In order to distinguish between PAD and UNKNOWN, keras use word_count+1 as the index of UNKNOWN.

num_words = 3
tk = Tokenizer(num_words=num_words+1, oov_token='UNK')
texts = ["my name is far faraway asdasd", "my name is","your name is"]
tk.fit_on_texts(texts)
# see #8092 below why I do these two line
tk.word_index = {e:i for e,i in tk.word_index.items() if i <= num_words} # <= because tokenizer is 1 indexed
tk.word_index[tk.oov_token] = num_words + 1 
print(tk.word_index)
print(tk.texts_to_sequences(texts))

# output
{'name': 1, 'my': 3, 'is': 2, 'UNK': 4}
[[3, 1, 2, 4, 4, 4], [3, 1, 2], [4, 1, 2]]

#8092
Then we use pad_sequences to pad the sentence to a fixed length. For example, we take 10 as the sequence length.

sequences = tk.texts_to_sequences(texts)
print(sequences)
data = pad_sequences(sequences, maxlen=10, padding='post')
print(data)

#output
[[3, 1, 2, 4, 4, 4], [3, 1, 2], [4, 1, 2]]
[[3 1 2 4 4 4 0 0 0 0]
 [3 1 2 0 0 0 0 0 0 0]
 [4 1 2 0 0 0 0 0 0 0]]

You can see we can distinguish between UNKNOWN and PAD clearly.

BTW, if we use padding, we should set mask=True in the Embedding layer. If we use LSTM, it will ignore the padding part. And for the vector of UNKNOWN, we can not just set 0 because UNKNOWN also contains some information. There are some methods to try, averaging vectors for many infrequent words, or you can use a random vector. You can find more info here

BrambleXu on 4 Jul 2018

👍10

All 2 comments

num_words = 3
tk = Tokenizer(num_words=num_words+1, oov_token='UNK')
texts = ["my name is far faraway asdasd", "my name is","your name is"]
tk.fit_on_texts(texts)
# see #8092 below why I do these two line
tk.word_index = {e:i for e,i in tk.word_index.items() if i <= num_words} # <= because tokenizer is 1 indexed
tk.word_index[tk.oov_token] = num_words + 1 
print(tk.word_index)
print(tk.texts_to_sequences(texts))

# output
{'name': 1, 'my': 3, 'is': 2, 'UNK': 4}
[[3, 1, 2, 4, 4, 4], [3, 1, 2], [4, 1, 2]]

#8092
Then we use pad_sequences to pad the sentence to a fixed length. For example, we take 10 as the sequence length.

sequences = tk.texts_to_sequences(texts)
print(sequences)
data = pad_sequences(sequences, maxlen=10, padding='post')
print(data)

#output
[[3, 1, 2, 4, 4, 4], [3, 1, 2], [4, 1, 2]]
[[3 1 2 4 4 4 0 0 0 0]
 [3 1 2 0 0 0 0 0 0 0]
 [4 1 2 0 0 0 0 0 0 0]]

You can see we can distinguish between UNKNOWN and PAD clearly.

BrambleXu on 4 Jul 2018

👍10

when they say PAD,did they mean sanitary napkin?