Spacy: lexeme not loaded with string lookup

Created on 6 Jun 2016  路  6Comments  路  Source: explosion/spaCy

From the doc:
image

However, following this code:

from spacy.en import English
nlp = English()
lexeme_name = nlp.vocab[1000].orth_
#>>situation
lexeme = nlp.vocab['situation']
#>>*** TypeError: an integer is required

Should not the appropriate lexeme be returned with the string lookup argument?

usage

Most helpful comment

Well...That might work for now. But if you do it that way, you'll be in for no end of frustrations as you process more text.

You should work through the unicode/bytes difference in Python 2. It's pretty important if you're going to do NLP.

Best practices: make sure all your files have from __future__ import unicode_literals at the top, and always read in files using io.open(loc, encoding='utf8'). This will go most of the way to making things work by default.

All 6 comments

Hey, Same issue here. This is what I get when running the same code above as @RomHartmann :
File "spacy/vocab.pyx", line 211, in spacy.vocab.Vocab.getitem (spacy/vocab.cpp:6216)
TypeError: an integer is required

Thaaaanks in advance

Even till 7/26 the same bug still exist~ push up a bit.

This is actually a bad error message (and probably API), rather than an outright bug. You're passing in a byte-string, which fails a type check. If you pass a unicode string, it'll work.

Here's the method implementation, so you can see the problem.

    def __getitem__(self,  id_or_string):
        '''Retrieve a lexeme, given an int ID or a unicode string.  If a previously
        unseen unicode string is given, a new lexeme is created and stored.

        Args:
            id_or_string (int or unicode):
              The integer ID of a word, or its unicode string.  If an int >= Lexicon.size,
              IndexError is raised. If id_or_string is neither an int nor a unicode string,
              ValueError is raised.

        Returns:
            lexeme (Lexeme):
              An instance of the Lexeme Python class, with data copied on
              instantiation.
        '''
        cdef attr_t orth
        if type(id_or_string) == unicode:
            orth = self.strings[id_or_string]
        else:
            orth = id_or_string
        return Lexeme(self, orth)

I actually figured a work around. With the OP's code, if you change to

lexeme = nlp.vocab[unicode("situation")]

It probably will work.

Well...That might work for now. But if you do it that way, you'll be in for no end of frustrations as you process more text.

You should work through the unicode/bytes difference in Python 2. It's pretty important if you're going to do NLP.

Best practices: make sure all your files have from __future__ import unicode_literals at the top, and always read in files using io.open(loc, encoding='utf8'). This will go most of the way to making things work by default.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings