Hi, I am using Gensim to load fastText trained model using the code below:
from gensim.models.wrappers import FastText
self.word_model = FastText.load_fasttext_format(EMBEDDINGS_MODEL_PATH)
......
print("Embedding dictionary words")
for word in memn2n.general_config.dictionary.keys():
memn2n.dict_vectors[word] = memn2n.word_model[word]
When trying to assign vector to each word in the dictionary, I got the error below:
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/ubuntu/BI_AI/MemN2N-tableQA/demo/qa.py", line 453, in <module>
run_web_demo(args.data_dir, args.model_file)
File "/home/ubuntu/BI_AI/MemN2N-tableQA/demo/qa.py", line 425, in run_web_demo
webapp.init(data_dir, model_file)
File "demo/web/webapp.py", line 28, in init
memn2n.dict_vectors[word] = memn2n.word_model[word]
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 1281, in __getitem__
return self.wv.__getitem__(words)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 589, in __getitem__
return self.word_vec(words)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/wrappers/fasttext.py", line 94, in word_vec
raise KeyError('all ngrams for word %s absent from model' % word)
KeyError: 'all ngrams for word 41010 absent from model'
I tried every possible approach I could find and came up with, but couldn't find anything.
Can anybody give a hand?
Much appreciated!
Hi @adamleo
This is because for the word '41010', none of the char-ngrams are present in the training vocabulary. Hence, the FastText model cannot return a meaningful word vector for the input word.
Hope this helps. If not, please feel free to ask me to clarify further.
HI @ jayantj, thanks for the response. So if the text files I use for training contain numbers, how should I deal with them if I want to continue using gensim?
Just to clarify, neither Gensim nor FastText imposes any inherent constraints on numbers being present in your training data. If your training data contains either the token '41010' or any of the char-ngrams present in '41010', the model should be able to learn a vector for it.
As for whether you'd get meaningful vectors for them, that depends on your use-case. I guess the questions you need to answer are -
'41010') has any semantic significance (if not, a common strategy is to simply replace any numbers in your training data with a special token '<NUMBER>')Thanks a lot @jayantj, I appreciate the explanation.
I was just wondering if I could get some clarification on this. I'm getting this error when querying for the word vector for a quote character “. However, I do get a vector returned using the same model but facebook's python wrapper. Is this expected behaviour?
Most helpful comment
Just to clarify, neither Gensim nor FastText imposes any inherent constraints on numbers being present in your training data. If your training data contains either the token
'41010'or any of the char-ngrams present in'41010', the model should be able to learn a vector for it.As for whether you'd get meaningful vectors for them, that depends on your use-case. I guess the questions you need to answer are -
'41010') has any semantic significance (if not, a common strategy is to simply replace any numbers in your training data with a special token'<NUMBER>')