Gensim: KeyError('all ngrams for word %s absent from model' % word)

Created on 1 Sep 2017  Â·  5Comments  Â·  Source: RaRe-Technologies/gensim

Hi, I am using Gensim to load fastText trained model using the code below:

from gensim.models.wrappers import FastText
self.word_model = FastText.load_fasttext_format(EMBEDDINGS_MODEL_PATH)

......

print("Embedding dictionary words")
for word in memn2n.general_config.dictionary.keys():
    memn2n.dict_vectors[word] = memn2n.word_model[word]

When trying to assign vector to each word in the dictionary, I got the error below:

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/ubuntu/BI_AI/MemN2N-tableQA/demo/qa.py", line 453, in <module>
    run_web_demo(args.data_dir, args.model_file)
  File "/home/ubuntu/BI_AI/MemN2N-tableQA/demo/qa.py", line 425, in run_web_demo
    webapp.init(data_dir, model_file)
  File "demo/web/webapp.py", line 28, in init
    memn2n.dict_vectors[word] = memn2n.word_model[word]
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 1281, in __getitem__
    return self.wv.__getitem__(words)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 589, in __getitem__
    return self.word_vec(words)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/wrappers/fasttext.py", line 94, in word_vec
    raise KeyError('all ngrams for word %s absent from model' % word)
KeyError: 'all ngrams for word 41010 absent from model'

I tried every possible approach I could find and came up with, but couldn't find anything.

Can anybody give a hand?

Much appreciated!

Most helpful comment

Just to clarify, neither Gensim nor FastText imposes any inherent constraints on numbers being present in your training data. If your training data contains either the token '41010' or any of the char-ngrams present in '41010', the model should be able to learn a vector for it.

As for whether you'd get meaningful vectors for them, that depends on your use-case. I guess the questions you need to answer are -

  1. If the actual identity of the tokens ('41010') has any semantic significance (if not, a common strategy is to simply replace any numbers in your training data with a special token '<NUMBER>')
  2. If there is any semantic significance at all to numbers, in general. If not, I guess you wouldn't even need to get a vector for numeric terms, and you can simply skip over them when fetching vectors.

All 5 comments

Hi @adamleo

This is because for the word '41010', none of the char-ngrams are present in the training vocabulary. Hence, the FastText model cannot return a meaningful word vector for the input word.

Hope this helps. If not, please feel free to ask me to clarify further.

HI @ jayantj, thanks for the response. So if the text files I use for training contain numbers, how should I deal with them if I want to continue using gensim?

Just to clarify, neither Gensim nor FastText imposes any inherent constraints on numbers being present in your training data. If your training data contains either the token '41010' or any of the char-ngrams present in '41010', the model should be able to learn a vector for it.

As for whether you'd get meaningful vectors for them, that depends on your use-case. I guess the questions you need to answer are -

  1. If the actual identity of the tokens ('41010') has any semantic significance (if not, a common strategy is to simply replace any numbers in your training data with a special token '<NUMBER>')
  2. If there is any semantic significance at all to numbers, in general. If not, I guess you wouldn't even need to get a vector for numeric terms, and you can simply skip over them when fetching vectors.

Thanks a lot @jayantj, I appreciate the explanation.

I was just wondering if I could get some clarification on this. I'm getting this error when querying for the word vector for a quote character “. However, I do get a vector returned using the same model but facebook's python wrapper. Is this expected behaviour?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dancinghui picture dancinghui  Â·  4Comments

Jianqiang picture Jianqiang  Â·  3Comments

simonm3 picture simonm3  Â·  3Comments

ahmedbhabbas picture ahmedbhabbas  Â·  4Comments

franciscojavierarceo picture franciscojavierarceo  Â·  3Comments