Spacy: Test if word in vocabulary of spacy

Created on 30 Dec 2016 · 5Comments · Source: explosion/spaCy

Great job on Spacy, fantastic dependency parser!

Question: is there a way to test if words are in the (english) vocabulary?

Source

kootenpv

Most helpful comment

It's actually really obvious:

s in nlp.vocab

kootenpv on 30 Dec 2016

👍3

All 5 comments

It's actually really obvious:

s in nlp.vocab

kootenpv on 30 Dec 2016

👍3

Not working

nlp = spacy.load('en')

doc = nlp('I am sflmgmavknsaccasas')

for token in doc:
    print(token in nlp.vocab)

Error:

TypeError: an integer is required

Also, is_oov is broken:

for token in doc:
    print(token.is_oov)

True
True

FrancescoSaverioZuppichini on 15 Apr 2018

👍1

Same issue, attempting to use the method to find only real words in scraped text. The in nlp.vocab approach throws an error and all real words tested are True for is_oov

doc = nlp('I am sflmgmavknsaccasas dog cat bird bulbasaur')

[tok for tok in doc if tok in nlp.vocab]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "vocab.pyx", line 194, in spacy.vocab.Vocab.__contains__
TypeError: an integer is required



md5-198bd240e27df4cb7b2136032b82f217



[tok.is_oov for tok in doc]
[True, True, True, True, True, True, True]

spaCy version: 2.0.9
Platform: osx 10.13.4
Python version: 3.6.4
Models: en

ghonk on 8 May 2018

@ghonk here is a workaround:
for token in 'k8s sdjhsd horse hit'.split(' '): print(nlp.vocab.has_vector(token)) but it makes sense only if you use a corpus with vectors