From the doc:

However, following this code:
from spacy.en import English
nlp = English()
lexeme_name = nlp.vocab[1000].orth_
#>>situation
lexeme = nlp.vocab['situation']
#>>*** TypeError: an integer is required
Should not the appropriate lexeme be returned with the string lookup argument?
Hey, Same issue here. This is what I get when running the same code above as @RomHartmann :
File "spacy/vocab.pyx", line 211, in spacy.vocab.Vocab.getitem (spacy/vocab.cpp:6216)
TypeError: an integer is required
Thaaaanks in advance
Even till 7/26 the same bug still exist~ push up a bit.
This is actually a bad error message (and probably API), rather than an outright bug. You're passing in a byte-string, which fails a type check. If you pass a unicode string, it'll work.
Here's the method implementation, so you can see the problem.
def __getitem__(self, id_or_string):
'''Retrieve a lexeme, given an int ID or a unicode string. If a previously
unseen unicode string is given, a new lexeme is created and stored.
Args:
id_or_string (int or unicode):
The integer ID of a word, or its unicode string. If an int >= Lexicon.size,
IndexError is raised. If id_or_string is neither an int nor a unicode string,
ValueError is raised.
Returns:
lexeme (Lexeme):
An instance of the Lexeme Python class, with data copied on
instantiation.
'''
cdef attr_t orth
if type(id_or_string) == unicode:
orth = self.strings[id_or_string]
else:
orth = id_or_string
return Lexeme(self, orth)
I actually figured a work around. With the OP's code, if you change to
lexeme = nlp.vocab[unicode("situation")]
It probably will work.
Well...That might work for now. But if you do it that way, you'll be in for no end of frustrations as you process more text.
You should work through the unicode/bytes difference in Python 2. It's pretty important if you're going to do NLP.
Best practices: make sure all your files have from __future__ import unicode_literals at the top, and always read in files using io.open(loc, encoding='utf8'). This will go most of the way to making things work by default.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Well...That might work for now. But if you do it that way, you'll be in for no end of frustrations as you process more text.
You should work through the unicode/bytes difference in Python 2. It's pretty important if you're going to do NLP.
Best practices: make sure all your files have
from __future__ import unicode_literalsat the top, and always read in files usingio.open(loc, encoding='utf8'). This will go most of the way to making things work by default.