Spacy: KeyError during stream processing

Created on 8 Nov 2017 · 13Comments · Source: explosion/spaCy

@ligser writes on Gitter:

Hello all! I have some troubles with spacy-nightly==2.0.0rc1 (a18 has same behavior) with en_core_web_lg model. When I run nlp.pipe with a generator of a texts I get the exception: KeyError: 4405041669077156115..
Exception raised after amount of texts (average 10000).
Stacktrace looks like that:
    nlp.pipe((c.content_text for c in texts), batch_size=24, n_threads=8)
  File "doc.pyx", line 375, in spacy.tokens.doc.Doc.text.__get__
  File "doc.pyx", line 232, in __iter__ 
  File "token.pyx", line 178, in spacy.tokens.token.Token.text_with_ws.__get__
  File "strings.pyx", line 116, in spacy.strings.StringStore.__getitem__
KeyError: 4405041669077156115
That looks like a bug with a StringStore cleanup or something related (maybe shared string-store that clean-up by one of threads?).
My code just get a texts from mysql split it to texts and ids and do: for id, doc in zip(ids_gen,nlp.pipe(docs_gen, ...)).

I think this is likely due to the solution added in spaCy 2 to address the streaming data memory growth.

bug

Source

honnibal

Most helpful comment

(nlp.tokenizer(doc) for doc in corpus) don't clean up StringStore — looks like that cause error.

ligser on 10 Nov 2017

👍2

All 13 comments

I made some experiments and reproduce similar error as _test case_. I'm very new in python, and code looks ugly and work slow, but it cause error that looks same (but have different stacktrace):

# coding: utf8
from __future__ import unicode_literals

import random
import string

from ...lang.en import English


def test_issue1506():
    nlp = English()

    def random_string_generator(string_length, limit):
        for _ in range(limit):
            yield ''.join(
                random.choice(string.digits + string.ascii_letters + '. ') for _ in range(string_length))

    for i, d in zip(
        (i for i in range(20007)),
        nlp.pipe(random_string_generator(600, 20007))
    ):
        str(d.text)

Info about spaCy

spaCy version: 2.0.2.dev0
Platform: Darwin-17.2.0-x86_64-i386-64bit
Python version: 3.6.2

Stacktrace:

spacy/language.py:554: in pipe
    for doc in docs:
spacy/language.py:534: in <genexpr>
    docs = (self.make_doc(text) for text in texts)
spacy/language.py:357: in make_doc
    return self.tokenizer(text)
tokenizer.pyx:106: in spacy.tokenizer.Tokenizer.__call__
    ???
tokenizer.pyx:156: in spacy.tokenizer.Tokenizer._tokenize
    ???
tokenizer.pyx:235: in spacy.tokenizer.Tokenizer._attach_tokens
    ???
doc.pyx:547: in spacy.tokens.doc.Doc.push_back
    ???
morphology.pyx:81: in spacy.morphology.Morphology.assign_untagged
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   KeyError: 10868232842057966403

strings.pyx:116: KeyError

ligser on 8 Nov 2017

❤1

Also when I revert from the language.py changes from PR #1424 — that test light green.

ligser on 8 Nov 2017

I'm having a similar issue when accessing token.lemma_ for some tokens.

andharris on 9 Nov 2017

@andharris Do you still have an example text by any chance?

ines on 9 Nov 2017

@ines Here's a reproducible script:

import spacy
import thinc.extra.datasets


def main():
    nlp = spacy.blank('en')
    data, _ = thinc.extra.datasets.imdb()
    corpus = (i[0] for i in data)
    docs = nlp.pipe(corpus)
    lemmas = [[token.lemma_ for token in doc] for doc in docs]
    print("Parsed lemmas for {} docs in corpus".format(len(lemmas)))


if __name__ == '__main__':
    main()

Info:

spacy: 2.0.2
python: 3.6.2

Stacktrace:

Traceback (most recent call last):
  File "spacy_bug.py", line 15, in <module>
    main()
  File "spacy_bug.py", line 10, in main
    lemmas = [[token.lemma_ for token in doc] for doc in docs]
  File "spacy_bug.py", line 10, in <listcomp>
    lemmas = [[token.lemma_ for token in doc] for doc in docs]
  File ".../venv/lib/python3.6/site-packages/spacy/language.py", line 554, in pipe
    for doc in docs:
  File ".../venv/lib/python3.6/site-packages/spacy/language.py", line 534, in <genexpr>
    docs = (self.make_doc(text) for text in texts)
  File ".../venv/lib/python3.6/site-packages/spacy/language.py", line 357, in make_doc
    return self.tokenizer(text)
  File "tokenizer.pyx", line 106, in spacy.tokenizer.Tokenizer.__call__
  File "tokenizer.pyx", line 156, in spacy.tokenizer.Tokenizer._tokenize
  File "tokenizer.pyx", line 235, in spacy.tokenizer.Tokenizer._attach_tokens
  File "doc.pyx", line 547, in spacy.tokens.doc.Doc.push_back
  File "morphology.pyx", line 81, in spacy.morphology.Morphology.assign_untagged
  File "strings.pyx", line 116, in spacy.strings.StringStore.__getitem__
KeyError: 5846064049184721376

Interestingly if I replace docs = nlp.pipe(corpus) with docs = (nlp.tokenizer(doc) for doc in corpus) I no longer get the error. Not user why this works though and the other fails.

andharris on 10 Nov 2017

👍1

(nlp.tokenizer(doc) for doc in corpus) don't clean up StringStore — looks like that cause error.

ligser on 10 Nov 2017

👍2

Also, I try to work around that case. And looks like it working...

        original_strings_data = self.vocab.strings.to_bytes()
        nr_seen = 0
        for doc in docs:
            yield doc
            recent_refs.add(doc)
            if nr_seen < 10000:
                old_refs.add(doc)
                nr_seen += 1
            elif len(old_refs) == 0:
                # All the docs in the 'old' set have expired, so the only
                # difference between the backup strings and the current
                # string-store should be obsolete. We therefore swap out the
                # old strings data.
                old_refs, recent_refs = recent_refs, old_refs
                tmp = self.vocab.strings.to_bytes()
                self.vocab.strings.from_bytes(original_strings_data)
                original_strings_data = tmp
                nr_seen = 0

I try to not track string manually and just swype it by lowlevel data.

Maybe that:

for word in doc:
    recent_strings.add(word.text)

not track all strings? (Looks like it does not track lemmas at all)

Or maybe I did not see something wrong with my code, and it just not cleans up?

ligser on 10 Nov 2017

@ligser : You're exactly right. It's not adding the lemmas or other new strings --- just the word text.

Periodically we need to do:

current = original + recent

Currently we're getting recent by just tracking doc.text. It might be best to add something to the StringStore, but I'm worried that this adds more state that can be lost in serialisation, causing confuing results.

What if we had:

recent = current - previous
current, previous = (original + recent), current

This seems like it should work.

honnibal on 10 Nov 2017

If I properly understand what you mean, that code does that:

        origin_strings = list(self.vocab.strings)
        previous_strings = list()
        nr_seen = 0
        for doc in docs:
            yield doc
            recent_refs.add(doc)
            if nr_seen < 10000:
                old_refs.add(doc)
                nr_seen += 1
            elif len(old_refs) == 0:
                # All the docs in the 'old' set have expired, so the only
                # difference between the backup strings and the current
                # string-store should be obsolete. We therefore swap out the
                # old strings data.
                old_refs, recent_refs = recent_refs, old_refs
                current_strings = list(self.vocab.strings)
                recent_strings = [item for item in current_strings if item not in previous_strings]
                self.vocab.strings._reset_and_load(recent_strings + origin_strings)
                previous_strings = current_strings
                nr_seen = 0

ligser on 10 Nov 2017

But that not work.

Because if I subtract previous from current — I lost strings that presented in both (created at previous and used at current too), but I shouldn't do that. I try to think little more about strings that can be wiped out, looks like I understand things too literally.

ligser on 10 Nov 2017

Looks like at that level pipe just cannot decide which strings are fresh and which obsolete.
In your solution — you try to track truly recent strings. The only problem there is not completed list of words, because of lemmas and other changes to StringStore. That solution can work if you know how to track all of the strings.
In my solution with tmp var just not caused any cleanup — I just try to work in Nth iteration with N-1th string store and it works because of luck.

ligser on 10 Nov 2017

I think there can be used another version of StringStore class (PipedStringStore for example) that holds two different stores — «old» and «new» and know about iterations.
When we start new iteration it swaps StringStore and clean-up new one. When we work with PipedStringStore — it tries to use «new» StringStore and if the key does not exist here — it fallback to «old» and copy value that exists in «old» and not in «new».
After «iteration» it discards «old» with values that not used in that «iteration».

ligser on 10 Nov 2017

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.