Hi there,
Lets take a case where we are training a corpus that doesn't contain a given word (say "foo").
If this word shows up in an as yet unknown test statement - you generally see a keyError for that word not being found in the model.
My question is - how does one get past this? I know that using train() on a new exemplar will not add the words to the vocabulary, only updates the weights themselves.
One interesting thing might be to return a 0 vector for every unknown word - that way, its contributions to the model would be minimal, but it would atleast not break things?
Thanks.
I generally use a filter:
for doc in labeled_corpus:
words = filter(lambda x: x in model.vocab, doc.words)
This is one simple method for getting past the KeyError on unseen words. I think the KeyError makes it explicit and returning a 0 vector could make understanding about the word vectors down the road more difficult. What would the difference of a word that is in the model that actually has no/low contributions? Most (all?) of the other models assume that unseen words have been filtered.
@cscorley Hmm, not sure I understood where you use this. Let me try to clarify.
# Load model that doesn't contain the word foo but contains bar
model = gensim.models.Word2Vec.load(...)
# Query for foo
doc_tokens = ["foo", "bar"]
print model.most_similar(positive=doc_tokens)
This will give an error for foo,
KeyError: u"word 'foo' not in vocabulary"
Now, if I understand you correctly, you're saying that you would instead do is
w = filter(lambda x: x in model.vocab, doc.tokens)
print model.most_similar(positive=w)
If this is the case, how do you handle the doc2vec case? Each sentence that you add is also filtered at the token level?
@viksit You can try asking this general question on the mailing list. https://groups.google.com/forum/#!forum/gensim
I had thought about this as well for a problem I'm working on where it is necessary to incorporate the whole corpus vocabulary.
My initial thought was given an unseen term, find the minimum edit distance to a word that is present in your word2vec vocab. Although this is a naive approach.
@jamesoneill12 a little more sophisticated approach has been implemented in fastText (now also integrated into gensim): break the unknown word into smaller character n-grams. Assemble the word vector from vectors of these ngrams.
The intuition is similar to your idea -- find similarity in the surface form, and assume similarity on the semantic level from that.
@piskvorky That sounds reasonable. If I would add to this, leveraging information about lexical units might be something to consider when choosing the character n-grams, although this might be a bit too expensive with little gain in comparison to a fixed n-gram.
@piskvorky Very lucky to see this answer. It does help me a lot on handling the words that are not in my trained model. Thanks a lot.
FastText worked perfect for me in text similarity/classification task, works perfect on small custom corpora, especially when array size is set to 100+
I used this:
for word,i in t.word_index.items():
try:
embedding_vector = model.wv[word]
except:
print(word, 'not found')
Try and catch to catch all words and print them when not found in the model.
@jamesoneill12 a little more sophisticated approach has been implemented in fastText (now also integrated into gensim): break the unknown word into smaller character n-grams. Assemble the word vector from vectors of these ngrams.
The intuition is similar to your idea -- find similarity in the surface form, and assume similarity on the semantic level from that.
The character level and word level seems totally different. Why just adding character vector together will make a good representation for the words?
@jianwenl see the original FastText paper or (countless) online tutorials and expositions for an explanation.
Most helpful comment
I generally use a filter:
This is one simple method for getting past the KeyError on unseen words. I think the KeyError makes it explicit and returning a 0 vector could make understanding about the word vectors down the road more difficult. What would the difference of a word that is in the model that actually has no/low contributions? Most (all?) of the other models assume that unseen words have been filtered.