Spacy: lexeme not loaded with string lookup

Created on 6 Jun 2016 · 6Comments · Source: explosion/spaCy

From the doc:

However, following this code:

from spacy.en import English
nlp = English()
lexeme_name = nlp.vocab[1000].orth_
#>>situation
lexeme = nlp.vocab['situation']
#>>*** TypeError: an integer is required

Should not the appropriate lexeme be returned with the string lookup argument?

usage

Source

RomHartmann

Most helpful comment

Well...That might work for now. But if you do it that way, you'll be in for no end of frustrations as you process more text.

You should work through the unicode/bytes difference in Python 2. It's pretty important if you're going to do NLP.

Best practices: make sure all your files have from __future__ import unicode_literals at the top, and always read in files using io.open(loc, encoding='utf8'). This will go most of the way to making things work by default.

syllog1sm on 26 Jul 2016

👍2

All 6 comments

Hey, Same issue here. This is what I get when running the same code above as @RomHartmann :
File "spacy/vocab.pyx", line 211, in spacy.vocab.Vocab.getitem (spacy/vocab.cpp:6216)
TypeError: an integer is required

Thaaaanks in advance

ahelaatabi on 21 Jun 2016

Even till 7/26 the same bug still exist~ push up a bit.

jltchiu on 26 Jul 2016

This is actually a bad error message (and probably API), rather than an outright bug. You're passing in a byte-string, which fails a type check. If you pass a unicode string, it'll work.

Here's the method implementation, so you can see the problem.

    def __getitem__(self,  id_or_string):
        '''Retrieve a lexeme, given an int ID or a unicode string.  If a previously
        unseen unicode string is given, a new lexeme is created and stored.

        Args:
            id_or_string (int or unicode):
              The integer ID of a word, or its unicode string.  If an int >= Lexicon.size,
              IndexError is raised. If id_or_string is neither an int nor a unicode string,
              ValueError is raised.

        Returns:
            lexeme (Lexeme):
              An instance of the Lexeme Python class, with data copied on
              instantiation.
        '''
        cdef attr_t orth
        if type(id_or_string) == unicode:
            orth = self.strings[id_or_string]
        else:
            orth = id_or_string
        return Lexeme(self, orth)

syllog1sm on 26 Jul 2016

I actually figured a work around. With the OP's code, if you change to

lexeme = nlp.vocab[unicode("situation")]

It probably will work.

jltchiu on 26 Jul 2016

👍2

Well...That might work for now. But if you do it that way, you'll be in for no end of frustrations as you process more text.

You should work through the unicode/bytes difference in Python 2. It's pretty important if you're going to do NLP.

syllog1sm on 26 Jul 2016

👍2

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

lock[bot] on 9 May 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

does char level features using charCNN are used for NER in spacy?

prashant334 · 3Comments

tag every token from the matched sentence

nadachaabani1 · 3Comments

`is_stop` depends on capitalisation

peterroelants · 3Comments

nlp.vocab.set_vector fails to add new vocab in vectors_fast_text.py

ahalterman · 3Comments

Details/paper used for recent NER implementation

muzaluisa · 3Comments