Hi! The following behavior is strange:
import spacy
nlp = spacy.load('en')
doc = nlp("let's get schwifty")
print([t.is_oov for t in doc])
# [False, False, False, True]
nlp.vocab.__contains__('schwifty')
# False
nlp.vocab['schwifty'] # a new lexeme is created and stored
nlp.vocab.__contains__('schwifty')
# True
print([t.is_oov for t in doc])
# [False, False, False, True] # Why 'schwifty' is still oov?
spacy.info()
Info about spaCy
spaCy version 1.8.2
Location /home/dan/miniconda/envs/py36/lib/python3.6/site-packages/spacy
Platform Linux-4.8.0-49-generic-x86_64-with-debian-stretch-sid
Python version 3.6.0
Installed models en, en_core_web_md
Hello,
Here's where I think problem comes from:
In vocab.pyx __getitem__ creates a new lexeme for the oov and stores into the lexicon. Here's the beginning of the docstring
Retrieve a lexeme, given an int ID or a unicode string. If a previously unseen unicode string is given, a new lexeme is created and stored.
Here's the code:
if type(id_or_string) == unicode:
orth = self.strings[id_or_string]
else:
orth = id_or_string
return Lexeme(self, orth)
and _Lexeme __init___ doesn't play with OOV flag, indeed with any flag. OOV flags are initialized to False for the words in buildin vocabulary words once when model is loaded, right @ines ? It seems to me that this issue has a fair point from the view of intuition. If a lexeme is to get into the vocabulary, it is not an OOV anymore.
@TropComplique , if you wanna add a new word to lexicon, it seems to me only way is to set _is_oov_ to False yourself.
Sorry for only getting back to this now. @DuyguA's analysis is correct. Now that the new version is out, we've been thinking about adding another method to Vocab, for example, Vocab.add() that lets you add lexemes to the vocabulary. This would also integrate well with the new Vectors class.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Hello,
Here's where I think problem comes from:
In vocab.pyx __getitem__ creates a new lexeme for the oov and stores into the lexicon. Here's the beginning of the docstring
Here's the code:
and _Lexeme __init___ doesn't play with OOV flag, indeed with any flag. OOV flags are initialized to False for the words in buildin vocabulary words once when model is loaded, right @ines ? It seems to me that this issue has a fair point from the view of intuition. If a lexeme is to get into the vocabulary, it is not an OOV anymore.
@TropComplique , if you wanna add a new word to lexicon, it seems to me only way is to set _is_oov_ to False yourself.