Gensim: KeyError('all ngrams for word %s absent from model' % word)

Created on 1 Sep 2017 · 5Comments · Source: RaRe-Technologies/gensim

Hi, I am using Gensim to load fastText trained model using the code below:

from gensim.models.wrappers import FastText
self.word_model = FastText.load_fasttext_format(EMBEDDINGS_MODEL_PATH)

......

print("Embedding dictionary words")
for word in memn2n.general_config.dictionary.keys():
    memn2n.dict_vectors[word] = memn2n.word_model[word]

When trying to assign vector to each word in the dictionary, I got the error below:

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/ubuntu/BI_AI/MemN2N-tableQA/demo/qa.py", line 453, in <module>
    run_web_demo(args.data_dir, args.model_file)
  File "/home/ubuntu/BI_AI/MemN2N-tableQA/demo/qa.py", line 425, in run_web_demo
    webapp.init(data_dir, model_file)
  File "demo/web/webapp.py", line 28, in init
    memn2n.dict_vectors[word] = memn2n.word_model[word]
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 1281, in __getitem__
    return self.wv.__getitem__(words)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 589, in __getitem__
    return self.word_vec(words)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/wrappers/fasttext.py", line 94, in word_vec
    raise KeyError('all ngrams for word %s absent from model' % word)
KeyError: 'all ngrams for word 41010 absent from model'

I tried every possible approach I could find and came up with, but couldn't find anything.

Can anybody give a hand?

Much appreciated!

Source

adamleo

Most helpful comment

Just to clarify, neither Gensim nor FastText imposes any inherent constraints on numbers being present in your training data. If your training data contains either the token '41010' or any of the char-ngrams present in '41010', the model should be able to learn a vector for it.

As for whether you'd get meaningful vectors for them, that depends on your use-case. I guess the questions you need to answer are -

If the actual identity of the tokens ('41010') has any semantic significance (if not, a common strategy is to simply replace any numbers in your training data with a special token '<NUMBER>')
If there is any semantic significance at all to numbers, in general. If not, I guess you wouldn't even need to get a vector for numeric terms, and you can simply skip over them when fetching vectors.

jayantj on 1 Sep 2017

👍2

All 5 comments

Hi @adamleo

This is because for the word '41010', none of the char-ngrams are present in the training vocabulary. Hence, the FastText model cannot return a meaningful word vector for the input word.

Hope this helps. If not, please feel free to ask me to clarify further.

jayantj on 1 Sep 2017

👍2

HI @ jayantj, thanks for the response. So if the text files I use for training contain numbers, how should I deal with them if I want to continue using gensim?

adamleo on 1 Sep 2017

As for whether you'd get meaningful vectors for them, that depends on your use-case. I guess the questions you need to answer are -

If the actual identity of the tokens ('41010') has any semantic significance (if not, a common strategy is to simply replace any numbers in your training data with a special token '<NUMBER>')
If there is any semantic significance at all to numbers, in general. If not, I guess you wouldn't even need to get a vector for numeric terms, and you can simply skip over them when fetching vectors.

jayantj on 1 Sep 2017

👍2

Thanks a lot @jayantj, I appreciate the explanation.

adamleo on 2 Sep 2017

I was just wondering if I could get some clarification on this. I'm getting this error when querying for the word vector for a quote character “. However, I do get a vector returned using the same model but facebook's python wrapper. Is this expected behaviour?