Spacy: KeyError when looking up a hash in vocab.strings from a vocab.vector

Created on 14 Feb 2018  路  9Comments  路  Source: explosion/spaCy

Is this a bug or a feature? I get a KeyError when looking up a hash in vocab.strings from a vocab.vector with the en_vectors_web_lg model.

import spacy
nlp = spacy.load('en_vectors_web_lg')
for key, vector in nlp.vocab.vectors.items():
    print(key, nlp.vocab.strings[key])

I get

[...]
(6292516164439924713, u'MAKE-WAR-NOT-LAW')
(16510263609655693249, u'make-war-not-law')
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "strings.pyx", line 118, in spacy.strings.StringStore.__getitem__
KeyError: 4035656307355538346

It may be related to #1427 (the 4035656307355538346 key is the same), but I have

>>> import spacy.symbols
>>> 'LAW' in spacy.symbols.IDS
True

and nlp.vocab.strings.add('LAW') does not help.

Info about spaCy

  • Python version: 2.7.14
  • Platform: Linux-4.13.0-32-generic-x86_64-with-Ubuntu-17.10-artful
  • spaCy version: 2.0.7
  • Models: en_core_web_lg, en, en_vectors_web_lg, xx
bug feat / vectors models

Most helpful comment

@vish0701:

Thanks for the report! This is a place where the API is a little inconsistent and this example should be clarified.

nlp.vocab.strings["coffee"] calculates the hash but does not add it to the StringStore. To add it, you have to use nlp.vocab.strings.add("coffee").

Confusingly, nlp.vocab["coffee"] does add it to the StringStore as part of creating and adding the lexeme to the vocab.

If you've already processed a document containing the word "coffee", then the hash will be stored in the StringStore under nlp.vocab.strings or doc.vocab.strings (the two Vocabs will be same object if you created the doc with this pipeline).

All 9 comments

I find these two keys missing: 4035656307355538346 and 9493573674140310719

import spacy
nlp = spacy.load('en_vectors_web_lg')
vector_words = []
for key, vector in nlp.vocab.vectors.items():
    try:
        vector_words.append(nlp.vocab.strings[key])
    except KeyError:
        print(key)

Thanks for noting this, I'll fix it in the next model release. I'm not immediately sure what the missing keys might be.

I can still replicate this issue, so will label it as bug to try and get it a bit higher up on our TODO list ;-)

I am going through the spaCy learning video https://course.spacy.io/en/chapter2 and trying out the following lines of code

import` spacy
from spacy.lang.en import English
nlp = English()

coffee_hash = nlp.vocab.strings['coffee']
print('coffee hash:', coffee_hash)

coffee_string = nlp.vocab.strings[coffee_hash]
print('coffee string:', coffee_string)

I get the following error for the line of code in bold above

KeyError: "[E018] Can't retrieve string for hash '3197928453018144401'. This usually refers to an issue with the Vocab or StringStore."

Note: This works fine if I use doc.vocab.strings.

@vish0701:

Thanks for the report! This is a place where the API is a little inconsistent and this example should be clarified.

nlp.vocab.strings["coffee"] calculates the hash but does not add it to the StringStore. To add it, you have to use nlp.vocab.strings.add("coffee").

Confusingly, nlp.vocab["coffee"] does add it to the StringStore as part of creating and adding the lexeme to the vocab.

If you've already processed a document containing the word "coffee", then the hash will be stored in the StringStore under nlp.vocab.strings or doc.vocab.strings (the two Vocabs will be same object if you created the doc with this pipeline).

Not sure if this is the right issue, but I am seeing something similar with spacy 2.2.4 (model en_core_wb_md 2.2.5) - there is one vector hash for which the string is missing (raised by https://github.com/SeldonIO/alibi/issues/275):

found = []
missing = []

for key in nlp.vocab.vectors:
    try:
        found.append(nlp.vocab[key])
    except KeyError:
        missing.append(key)

print(missing) # [11580349482641876976]

I can confirm that that key is missing in en_core_web_md 2.2.5. I'm not quite sure which change in v2.3 made a difference in the 2.3.x models, though, which is a little disconcerting. Also, if you use en_vectors_web_lg there are four keys missing in both v2.2 and v2.3, one of which is the same key as here.

I am going through the spaCy learning video https://course.spacy.io/en/chapter2 and trying out the following lines of code

import` spacy
from spacy.lang.en import English
nlp = English()

coffee_hash = nlp.vocab.strings['coffee']
print('coffee hash:', coffee_hash)

coffee_string = nlp.vocab.strings[coffee_hash]
print('coffee string:', coffee_string)

I get the following error for the line of code in bold above

KeyError: "[E018] Can't retrieve string for hash '3197928453018144401'. This usually refers to an issue with the Vocab or StringStore."

Note: This works fine if I use doc.vocab.strings.

I have the exact same issue, from the same source, and the same code.
(This is to bump this up in the issue queue as it seems to be a recurring thing (googled it and found posts from over a year ago already mentioning the issue :) )


EDIT

Another way to get the word back from the hash

Also from your course, seems like bypassing the call to .strings on vocab does the trick

doc = nlp("I love ccoffee")
lexeme = nlp.vocab["coffee"]
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True

lexeme.orth gives the hash, and to get the word back :
print(nlp.vocab[lexeme.orth].text)

coffee

Don't know if this helps in any way for debugging, but at least people encountering the same problem have a way of going through the exercise :)

Was this page helpful?
0 / 5 - 0 ratings