Gensim: Handling unseen words in the word2vec/doc2vec model

Created on 1 Apr 2015 · 11Comments · Source: RaRe-Technologies/gensim

Hi there,

Lets take a case where we are training a corpus that doesn't contain a given word (say "foo").

If this word shows up in an as yet unknown test statement - you generally see a keyError for that word not being found in the model.

My question is - how does one get past this? I know that using train() on a new exemplar will not add the words to the vocabulary, only updates the weights themselves.

One interesting thing might be to return a 0 vector for every unknown word - that way, its contributions to the model would be minimal, but it would atleast not break things?

Thanks.

Source

viksit

👍9

Most helpful comment

I generally use a filter:

for doc in labeled_corpus:
    words = filter(lambda x: x in model.vocab, doc.words)

This is one simple method for getting past the KeyError on unseen words. I think the KeyError makes it explicit and returning a 0 vector could make understanding about the word vectors down the road more difficult. What would the difference of a word that is in the model that actually has no/low contributions? Most (all?) of the other models assume that unseen words have been filtered.

cscorley on 3 Apr 2015

👍6

All 11 comments

I generally use a filter:

for doc in labeled_corpus:
    words = filter(lambda x: x in model.vocab, doc.words)

cscorley on 3 Apr 2015

👍6

@cscorley Hmm, not sure I understood where you use this. Let me try to clarify.

# Load model that doesn't contain the word foo but contains bar
model = gensim.models.Word2Vec.load(...)
# Query for foo
doc_tokens = ["foo", "bar"]
print model.most_similar(positive=doc_tokens)

This will give an error for foo,

KeyError: u"word 'foo' not in vocabulary"

Now, if I understand you correctly, you're saying that you would instead do is

w = filter(lambda x: x in model.vocab, doc.tokens)
print model.most_similar(positive=w)

If this is the case, how do you handle the doc2vec case? Each sentence that you add is also filtered at the token level?

viksit on 3 Apr 2015

@viksit You can try asking this general question on the mailing list. https://groups.google.com/forum/#!forum/gensim

tmylk on 10 Jan 2016

I had thought about this as well for a problem I'm working on where it is necessary to incorporate the whole corpus vocabulary.

My initial thought was given an unseen term, find the minimum edit distance to a word that is present in your word2vec vocab. Although this is a naive approach.

jamesoneill12 on 1 Jun 2017

👍3

@jamesoneill12 a little more sophisticated approach has been implemented in fastText (now also integrated into gensim): break the unknown word into smaller character n-grams. Assemble the word vector from vectors of these ngrams.

The intuition is similar to your idea -- find similarity in the surface form, and assume similarity on the semantic level from that.

piskvorky on 1 Jun 2017

👍2

@piskvorky That sounds reasonable. If I would add to this, leveraging information about lexical units might be something to consider when choosing the character n-grams, although this might be a bit too expensive with little gain in comparison to a fixed n-gram.

jamesoneill12 on 1 Jun 2017

@piskvorky Very lucky to see this answer. It does help me a lot on handling the words that are not in my trained model. Thanks a lot.

INKWWW on 13 Jun 2018

👍1

FastText worked perfect for me in text similarity/classification task, works perfect on small custom corpora, especially when array size is set to 100+

SergiePro on 2 Aug 2018

👍2

I used this:
for word,i in t.word_index.items(): try: embedding_vector = model.wv[word] except: print(word, 'not found')
Try and catch to catch all words and print them when not found in the model.

hiteshn97 on 3 Sep 2018

👍1

@jamesoneill12 a little more sophisticated approach has been implemented in fastText (now also integrated into gensim): break the unknown word into smaller character n-grams. Assemble the word vector from vectors of these ngrams.

The intuition is similar to your idea -- find similarity in the surface form, and assume similarity on the semantic level from that.

The character level and word level seems totally different. Why just adding character vector together will make a good representation for the words?