Great job on Spacy, fantastic dependency parser!
Question: is there a way to test if words are in the (english) vocabulary?
It's actually really obvious:
s in nlp.vocab
Not working
nlp = spacy.load('en')
doc = nlp('I am sflmgmavknsaccasas')
for token in doc:
print(token in nlp.vocab)
Error:
TypeError: an integer is required
Also, is_oov is broken:
for token in doc:
print(token.is_oov)
True
True
Same issue, attempting to use the method to find only real words in scraped text. The in nlp.vocab approach throws an error and all real words tested are True for is_oov
doc = nlp('I am sflmgmavknsaccasas dog cat bird bulbasaur')
[tok for tok in doc if tok in nlp.vocab]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
File "vocab.pyx", line 194, in spacy.vocab.Vocab.__contains__
TypeError: an integer is required
md5-198bd240e27df4cb7b2136032b82f217
[tok.is_oov for tok in doc]
[True, True, True, True, True, True, True]
@ghonk here is a workaround:
for token in 'k8s sdjhsd horse hit'.split(' '):
print(nlp.vocab.has_vector(token))
but it makes sense only if you use a corpus with vectors
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
It's actually really obvious:
s in nlp.vocab