Spacy: Unintuitive behavior of vocab

Created on 3 May 2017  路  3Comments  路  Source: explosion/spaCy

Hi! The following behavior is strange:

import spacy
nlp = spacy.load('en')
doc = nlp("let's get schwifty")
print([t.is_oov for t in doc])
# [False, False, False, True]

nlp.vocab.__contains__('schwifty')
# False

nlp.vocab['schwifty'] # a new lexeme is created and stored
nlp.vocab.__contains__('schwifty')
# True

print([t.is_oov for t in doc])
#  [False, False, False, True] # Why 'schwifty' is still oov?

My Environment

spacy.info()

Info about spaCy

    spaCy version      1.8.2          
    Location           /home/dan/miniconda/envs/py36/lib/python3.6/site-packages/spacy
    Platform           Linux-4.8.0-49-generic-x86_64-with-debian-stretch-sid
    Python version     3.6.0          
    Installed models    en, en_core_web_md
usage

Most helpful comment

Hello,

Here's where I think problem comes from:

In vocab.pyx __getitem__ creates a new lexeme for the oov and stores into the lexicon. Here's the beginning of the docstring

Retrieve a lexeme, given an int ID or a unicode string. If a previously unseen unicode string is given, a new lexeme is created and stored.

Here's the code:

if type(id_or_string) == unicode:
            orth = self.strings[id_or_string]
        else:
            orth = id_or_string
return Lexeme(self, orth)

and _Lexeme __init___ doesn't play with OOV flag, indeed with any flag. OOV flags are initialized to False for the words in buildin vocabulary words once when model is loaded, right @ines ? It seems to me that this issue has a fair point from the view of intuition. If a lexeme is to get into the vocabulary, it is not an OOV anymore.

@TropComplique , if you wanna add a new word to lexicon, it seems to me only way is to set _is_oov_ to False yourself.

All 3 comments

Hello,

Here's where I think problem comes from:

In vocab.pyx __getitem__ creates a new lexeme for the oov and stores into the lexicon. Here's the beginning of the docstring

Retrieve a lexeme, given an int ID or a unicode string. If a previously unseen unicode string is given, a new lexeme is created and stored.

Here's the code:

if type(id_or_string) == unicode:
            orth = self.strings[id_or_string]
        else:
            orth = id_or_string
return Lexeme(self, orth)

and _Lexeme __init___ doesn't play with OOV flag, indeed with any flag. OOV flags are initialized to False for the words in buildin vocabulary words once when model is loaded, right @ines ? It seems to me that this issue has a fair point from the view of intuition. If a lexeme is to get into the vocabulary, it is not an OOV anymore.

@TropComplique , if you wanna add a new word to lexicon, it seems to me only way is to set _is_oov_ to False yourself.

Sorry for only getting back to this now. @DuyguA's analysis is correct. Now that the new version is out, we've been thinking about adding another method to Vocab, for example, Vocab.add() that lets you add lexemes to the vocabulary. This would also integrate well with the new Vectors class.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tonywangcn picture tonywangcn  路  3Comments

notnami picture notnami  路  3Comments

enerrio picture enerrio  路  3Comments

besirkurtulmus picture besirkurtulmus  路  3Comments

melanietosik picture melanietosik  路  3Comments