In the method fit_on_texts of Tokenizer class (keras.preprocessing.text.Tokenizer - line 209), there is a comment shown below on line 4:
wcounts = list(self.word_counts.items())
wcounts.sort(key=lambda x: x[1], reverse=True)
sorted_voc = [wc[0] for wc in wcounts]
# note that index 0 is reserved, never assigned to an existing word
self.word_index = dict(list(zip(sorted_voc, list(range(1, len(sorted_voc) + 1)))))
I am interested for what is 0 reserved for? The only logical answer, implied from the comment as well, would be the uknown word token (given by parameter oov_token), but this is not true. The index of oov_token is 1 + word_count from the input texts. If this is somehow a mistake, and the comment is a legacy which is out of order, then I suggest index 0 becomes reserved for the oov_token.
Because if you use the pad_sequence to process the sequence, you will find the 0 is used as the padding value. In order to distinguish between PAD and UNKNOWN, keras use word_count+1
as the index of UNKNOWN.
num_words = 3
tk = Tokenizer(num_words=num_words+1, oov_token='UNK')
texts = ["my name is far faraway asdasd", "my name is","your name is"]
tk.fit_on_texts(texts)
# see #8092 below why I do these two line
tk.word_index = {e:i for e,i in tk.word_index.items() if i <= num_words} # <= because tokenizer is 1 indexed
tk.word_index[tk.oov_token] = num_words + 1
print(tk.word_index)
print(tk.texts_to_sequences(texts))
# output
{'name': 1, 'my': 3, 'is': 2, 'UNK': 4}
[[3, 1, 2, 4, 4, 4], [3, 1, 2], [4, 1, 2]]
#8092
Then we use pad_sequences to pad the sentence to a fixed length. For example, we take 10 as the sequence length.
sequences = tk.texts_to_sequences(texts)
print(sequences)
data = pad_sequences(sequences, maxlen=10, padding='post')
print(data)
#output
[[3, 1, 2, 4, 4, 4], [3, 1, 2], [4, 1, 2]]
[[3 1 2 4 4 4 0 0 0 0]
[3 1 2 0 0 0 0 0 0 0]
[4 1 2 0 0 0 0 0 0 0]]
You can see we can distinguish between UNKNOWN and PAD clearly.
BTW, if we use padding, we should set mask=True
in the Embedding layer. If we use LSTM, it will ignore the padding part. And for the vector of UNKNOWN, we can not just set 0 because UNKNOWN also contains some information. There are some methods to try, averaging vectors for many infrequent words, or you can use a random vector. You can find more info here
when they say PAD,did they mean sanitary napkin?
Most helpful comment
Because if you use the pad_sequence to process the sequence, you will find the 0 is used as the padding value. In order to distinguish between PAD and UNKNOWN, keras use
word_count+1
as the index of UNKNOWN.#8092
Then we use pad_sequences to pad the sentence to a fixed length. For example, we take 10 as the sequence length.
You can see we can distinguish between UNKNOWN and PAD clearly.
BTW, if we use padding, we should set
mask=True
in the Embedding layer. If we use LSTM, it will ignore the padding part. And for the vector of UNKNOWN, we can not just set 0 because UNKNOWN also contains some information. There are some methods to try, averaging vectors for many infrequent words, or you can use a random vector. You can find more info here