Keras: How does Tokenizer work?

Created on 7 Aug 2017  路  13Comments  路  Source: keras-team/keras

Hi,

I am currently working with the Tokenizer class and I have a question about the relevance of num_words. Reading the documentation suggests that when .fit_on_texts is run the Tokenizer will only take the most common num_words amounts. I currently have a dataset consisting of 10358 uniques words. When I run Tokenizer specifying num_words = 1000 I then call the word index which has a length of 10358. Does the Tokenizer generate an index of the top 1000 then add any others on after this when running fit_on_texts?

Thanks

stale

Most helpful comment

The Tokenizer stores everything in the word_index during fit_on_texts. Then, when calling the texts_to_sequences method, only the top num_words are considered.

In [1]: from keras.preprocessing.text import Tokenizer
In [2]: texts = ['a a a', 'b b', 'c']
In [3]: tokenizer = Tokenizer(num_words=2)
In [4]: tokenizer.fit_on_texts(texts)
In [5]: tokenizer.word_index
Out[5]: {'a': 1, 'b': 2, 'c': 3}
In [6]: tokenizer.texts_to_sequences(texts)
Out[6]: [[1, 1, 1], [], []]

There's actually an off-by-one error as you can see; the output should be [[1, 1, 1], [2, 2], []]. I am fixing, but in the meantime you can set your num_words to be one more than you intended.

All 13 comments

The Tokenizer stores everything in the word_index during fit_on_texts. Then, when calling the texts_to_sequences method, only the top num_words are considered.

In [1]: from keras.preprocessing.text import Tokenizer
In [2]: texts = ['a a a', 'b b', 'c']
In [3]: tokenizer = Tokenizer(num_words=2)
In [4]: tokenizer.fit_on_texts(texts)
In [5]: tokenizer.word_index
Out[5]: {'a': 1, 'b': 2, 'c': 3}
In [6]: tokenizer.texts_to_sequences(texts)
Out[6]: [[1, 1, 1], [], []]

There's actually an off-by-one error as you can see; the output should be [[1, 1, 1], [2, 2], []]. I am fixing, but in the meantime you can set your num_words to be one more than you intended.

Actually, this might be by design. The Embedding layer expects input_dim to be vocabulary size + 1.

Thanks for the response and sorry for taking so long to get back! Yes I think Tokenizer reserves 0 as an "out of scope" index for when comparing words in a dataset, makes it easier for using embeddings where you can explicitly state that the first embedding is for unknown words or chars

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

Any chance the off-by-one design could be explicitly mentioned in the Keras documentation?
https://keras.io/preprocessing/text/#tokenizer

I really wish the documentation was a bit more explicit about this.

So, what exactly is the bugfix?:

embedding_layer = Embedding(num_words +1, ....) is also throwing an out of bounds error for the last element.

Another try:

        vocabulary = imdb.get_word_index()
        vocabulary_inv = dict((v, k) for k, v in vocabulary.items())
        vocabulary_inv[0] = "<PAD/>"

Apparently vocabulary.items() is starting with index=1. So should one fill index=0 with a random string to prevent an out of bounds error?

I am a new learner at text classification could anyone explain the whole concept, please?

There's actually an off-by-one error as you can see; the output should be [[1, 1, 1], [2, 2], []]. I am fixing, but in the meantime you can set your num_words to be one more than you intended.

Is there any plan when this should be fixed in the code ?

@kleysonr

There's actually an off-by-one error as you can see; the output should be [[1, 1, 1], [2, 2], []]. I am fixing, but in the meantime you can set your num_words to be one more than you intended.

Is there any plan when this should be fixed in the code ?

Never, if I understood right, because this is a feature rather than a bug.

I really wish the documentation was a bit more explicit about this.

The documentation does not even mention the methods of this class.

I am a new learner at text classification could anyone explain the whole concept, please?
@ialihaider75 Here, This might help
https://www.kdnuggets.com/2020/03/tensorflow-keras-tokenization-text-data-prep.html

The Tokenizer stores everything in the word_index during fit_on_texts. Then, when calling the texts_to_sequences method, only the top num_words are considered.

In [1]: from keras.preprocessing.text import Tokenizer
In [2]: texts = ['a a a', 'b b', 'c']
In [3]: tokenizer = Tokenizer(num_words=2)
In [4]: tokenizer.fit_on_texts(texts)
In [5]: tokenizer.word_index
Out[5]: {'a': 1, 'b': 2, 'c': 3}
In [6]: tokenizer.texts_to_sequences(texts)
Out[6]: [[1, 1, 1], [], []]

There's actually an off-by-one error as you can see; the output should be [[1, 1, 1], [2, 2], []]. I am fixing, but in the meantime you can set your num_words to be one more than you intended.

Are the word indexes sorted by the most frequent words, for eg: a: 1, b: 2, c:3 in this case

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kylemcdonald picture kylemcdonald  路  3Comments

vinayakumarr picture vinayakumarr  路  3Comments

snakeztc picture snakeztc  路  3Comments

anjishnu picture anjishnu  路  3Comments

LuCeHe picture LuCeHe  路  3Comments