Spacy: Unintuitive behavior of vocab

Created on 3 May 2017 · 3Comments · Source: explosion/spaCy

Hi! The following behavior is strange:

import spacy
nlp = spacy.load('en')
doc = nlp("let's get schwifty")
print([t.is_oov for t in doc])
# [False, False, False, True]

nlp.vocab.__contains__('schwifty')
# False

nlp.vocab['schwifty'] # a new lexeme is created and stored
nlp.vocab.__contains__('schwifty')
# True

print([t.is_oov for t in doc])
#  [False, False, False, True] # Why 'schwifty' is still oov?

My Environment

spacy.info()

Info about spaCy

    spaCy version      1.8.2          
    Location           /home/dan/miniconda/envs/py36/lib/python3.6/site-packages/spacy
    Platform           Linux-4.8.0-49-generic-x86_64-with-debian-stretch-sid
    Python version     3.6.0          
    Installed models    en, en_core_web_md

usage

Source

TropComplique

Most helpful comment

Hello,

Here's where I think problem comes from:

In vocab.pyx __getitem__ creates a new lexeme for the oov and stores into the lexicon. Here's the beginning of the docstring

Retrieve a lexeme, given an int ID or a unicode string. If a previously unseen unicode string is given, a new lexeme is created and stored.

Here's the code:

if type(id_or_string) == unicode:
            orth = self.strings[id_or_string]
        else:
            orth = id_or_string
return Lexeme(self, orth)

and _Lexeme __init___ doesn't play with OOV flag, indeed with any flag. OOV flags are initialized to False for the words in buildin vocabulary words once when model is loaded, right @ines ? It seems to me that this issue has a fair point from the view of intuition. If a lexeme is to get into the vocabulary, it is not an OOV anymore.

@TropComplique , if you wanna add a new word to lexicon, it seems to me only way is to set _is_oov_ to False yourself.

DuyguA on 11 May 2017

👍3

All 3 comments