Hi,
I am currently working with the Tokenizer class and I have a question about the relevance of num_words. Reading the documentation suggests that when .fit_on_texts is run the Tokenizer will only take the most common num_words amounts. I currently have a dataset consisting of 10358 uniques words. When I run Tokenizer specifying num_words = 1000 I then call the word index which has a length of 10358. Does the Tokenizer generate an index of the top 1000 then add any others on after this when running fit_on_texts?
Thanks
The Tokenizer
stores everything in the word_index
during fit_on_texts
. Then, when calling the texts_to_sequences
method, only the top num_words
are considered.
In [1]: from keras.preprocessing.text import Tokenizer
In [2]: texts = ['a a a', 'b b', 'c']
In [3]: tokenizer = Tokenizer(num_words=2)
In [4]: tokenizer.fit_on_texts(texts)
In [5]: tokenizer.word_index
Out[5]: {'a': 1, 'b': 2, 'c': 3}
In [6]: tokenizer.texts_to_sequences(texts)
Out[6]: [[1, 1, 1], [], []]
There's actually an off-by-one error as you can see; the output should be [[1, 1, 1], [2, 2], []]
. I am fixing, but in the meantime you can set your num_words
to be one more than you intended.
Actually, this might be by design. The Embedding layer expects input_dim
to be vocabulary size + 1.
Thanks for the response and sorry for taking so long to get back! Yes I think Tokenizer reserves 0 as an "out of scope" index for when comparing words in a dataset, makes it easier for using embeddings where you can explicitly state that the first embedding is for unknown words or chars
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
Any chance the off-by-one design could be explicitly mentioned in the Keras documentation?
https://keras.io/preprocessing/text/#tokenizer
I really wish the documentation was a bit more explicit about this.
So, what exactly is the bugfix?:
embedding_layer = Embedding(num_words +1, ....)
is also throwing an out of bounds error for the last element.
Another try:
vocabulary = imdb.get_word_index()
vocabulary_inv = dict((v, k) for k, v in vocabulary.items())
vocabulary_inv[0] = "<PAD/>"
Apparently vocabulary.items() is starting with index=1. So should one fill index=0 with a random string to prevent an out of bounds error?
I am a new learner at text classification could anyone explain the whole concept, please?
There's actually an off-by-one error as you can see; the output should be
[[1, 1, 1], [2, 2], []]
. I am fixing, but in the meantime you can set yournum_words
to be one more than you intended.
Is there any plan when this should be fixed in the code ?
@kleysonr
There's actually an off-by-one error as you can see; the output should be
[[1, 1, 1], [2, 2], []]
. I am fixing, but in the meantime you can set yournum_words
to be one more than you intended.Is there any plan when this should be fixed in the code ?
Never, if I understood right, because this is a feature rather than a bug.
I really wish the documentation was a bit more explicit about this.
The documentation does not even mention the methods of this class.
I am a new learner at text classification could anyone explain the whole concept, please?
@ialihaider75 Here, This might help
https://www.kdnuggets.com/2020/03/tensorflow-keras-tokenization-text-data-prep.html
The
Tokenizer
stores everything in theword_index
duringfit_on_texts
. Then, when calling thetexts_to_sequences
method, only the topnum_words
are considered.In [1]: from keras.preprocessing.text import Tokenizer In [2]: texts = ['a a a', 'b b', 'c'] In [3]: tokenizer = Tokenizer(num_words=2) In [4]: tokenizer.fit_on_texts(texts) In [5]: tokenizer.word_index Out[5]: {'a': 1, 'b': 2, 'c': 3} In [6]: tokenizer.texts_to_sequences(texts) Out[6]: [[1, 1, 1], [], []]
There's actually an off-by-one error as you can see; the output should be
[[1, 1, 1], [2, 2], []]
. I am fixing, but in the meantime you can set yournum_words
to be one more than you intended.
Are the word indexes sorted by the most frequent words, for eg: a: 1, b: 2, c:3 in this case
Most helpful comment
The
Tokenizer
stores everything in theword_index
duringfit_on_texts
. Then, when calling thetexts_to_sequences
method, only the topnum_words
are considered.There's actually an off-by-one error as you can see; the output should be
[[1, 1, 1], [2, 2], []]
. I am fixing, but in the meantime you can set yournum_words
to be one more than you intended.