Keras: How does Tokenizer work?

Created on 7 Aug 2017 · 13Comments · Source: keras-team/keras

Hi,

I am currently working with the Tokenizer class and I have a question about the relevance of num_words. Reading the documentation suggests that when .fit_on_texts is run the Tokenizer will only take the most common num_words amounts. I currently have a dataset consisting of 10358 uniques words. When I run Tokenizer specifying num_words = 1000 I then call the word index which has a length of 10358. Does the Tokenizer generate an index of the top 1000 then add any others on after this when running fit_on_texts?

Thanks

stale

Source

HFulcher

👍22

Most helpful comment

The Tokenizer stores everything in the word_index during fit_on_texts. Then, when calling the texts_to_sequences method, only the top num_words are considered.

In [1]: from keras.preprocessing.text import Tokenizer
In [2]: texts = ['a a a', 'b b', 'c']
In [3]: tokenizer = Tokenizer(num_words=2)
In [4]: tokenizer.fit_on_texts(texts)
In [5]: tokenizer.word_index
Out[5]: {'a': 1, 'b': 2, 'c': 3}
In [6]: tokenizer.texts_to_sequences(texts)
Out[6]: [[1, 1, 1], [], []]

There's actually an off-by-one error as you can see; the output should be [[1, 1, 1], [2, 2], []]. I am fixing, but in the meantime you can set your num_words to be one more than you intended.

nicolewhite on 14 Aug 2017

👍49 ❤1

All 13 comments

The Tokenizer stores everything in the word_index during fit_on_texts. Then, when calling the texts_to_sequences method, only the top num_words are considered.

In [1]: from keras.preprocessing.text import Tokenizer
In [2]: texts = ['a a a', 'b b', 'c']
In [3]: tokenizer = Tokenizer(num_words=2)
In [4]: tokenizer.fit_on_texts(texts)
In [5]: tokenizer.word_index
Out[5]: {'a': 1, 'b': 2, 'c': 3}
In [6]: tokenizer.texts_to_sequences(texts)
Out[6]: [[1, 1, 1], [], []]

There's actually an off-by-one error as you can see; the output should be [[1, 1, 1], [2, 2], []]. I am fixing, but in the meantime you can set your num_words to be one more than you intended.

nicolewhite on 14 Aug 2017

👍49 ❤1

Actually, this might be by design. The Embedding layer expects input_dim to be vocabulary size + 1.

nicolewhite on 14 Aug 2017

👍11

Thanks for the response and sorry for taking so long to get back! Yes I think Tokenizer reserves 0 as an "out of scope" index for when comparing words in a dataset, makes it easier for using embeddings where you can explicitly state that the first embedding is for unknown words or chars

HFulcher on 4 Sep 2017

👍7

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

stale[bot] on 3 Dec 2017

Any chance the off-by-one design could be explicitly mentioned in the Keras documentation?
https://keras.io/preprocessing/text/#tokenizer

maoredman on 14 Dec 2017

👍3

I really wish the documentation was a bit more explicit about this.

TomLisankie on 7 Aug 2018

👍19

So, what exactly is the bugfix?:

embedding_layer = Embedding(num_words +1, ....) is also throwing an out of bounds error for the last element.

Another try:

        vocabulary = imdb.get_word_index()
        vocabulary_inv = dict((v, k) for k, v in vocabulary.items())
        vocabulary_inv[0] = "<PAD/>"

Apparently vocabulary.items() is starting with index=1. So should one fill index=0 with a random string to prevent an out of bounds error?

fabianwnk on 2 Sep 2018

👍1

I am a new learner at text classification could anyone explain the whole concept, please?

ialihaider75 on 14 Oct 2018

👎7 😄3 👍1

There's actually an off-by-one error as you can see; the output should be [[1, 1, 1], [2, 2], []]. I am fixing, but in the meantime you can set your num_words to be one more than you intended.

Is there any plan when this should be fixed in the code ?

kleysonr on 13 Feb 2019

@kleysonr

There's actually an off-by-one error as you can see; the output should be [[1, 1, 1], [2, 2], []]. I am fixing, but in the meantime you can set your num_words to be one more than you intended.

Is there any plan when this should be fixed in the code ?

Never, if I understood right, because this is a feature rather than a bug.

lezsakdomi on 18 Aug 2019

👎1

I really wish the documentation was a bit more explicit about this.

The documentation does not even mention the methods of this class.

lezsakdomi on 18 Aug 2019

😕1

I am a new learner at text classification could anyone explain the whole concept, please?
@ialihaider75 Here, This might help
https://www.kdnuggets.com/2020/03/tensorflow-keras-tokenization-text-data-prep.html

SangamSwadiK on 15 Apr 2020

The Tokenizer stores everything in the word_index during fit_on_texts. Then, when calling the texts_to_sequences method, only the top num_words are considered.
In [1]: from keras.preprocessing.text import Tokenizer
In [2]: texts = ['a a a', 'b b', 'c']
In [3]: tokenizer = Tokenizer(num_words=2)
In [4]: tokenizer.fit_on_texts(texts)
In [5]: tokenizer.word_index
Out[5]: {'a': 1, 'b': 2, 'c': 3}
In [6]: tokenizer.texts_to_sequences(texts)
Out[6]: [[1, 1, 1], [], []]
There's actually an off-by-one error as you can see; the output should be [[1, 1, 1], [2, 2], []]. I am fixing, but in the meantime you can set your num_words to be one more than you intended.

Are the word indexes sorted by the most frequent words, for eg: a: 1, b: 2, c:3 in this case