Spacy: `Doc.to_bytes` causing thinc AssertionError

Created on 21 Mar 2019 · 11Comments · Source: explosion/spaCy

How to reproduce the behaviour

The below function works in Spark with Spacy 2.1.0, but crashes with Spacy 2.1.1. Rolling back to 2.1.0 gets rid of the error.

import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import BinaryType
import spacy

nlp = None

def get_spacy(spacy_model='en', spacy_disable=('parser',)):
    """
    Args:
        spacy_model (str, optional, defaults to 'en'): Name of Spacy model.
            This argument is used only if the Spacy model
            hasn't yet been initialised globally.
        spacy_disable (iterable of str, optional, default to ('parser',):
            Components of the Spacy pipeline that should be disabled.
            When unneeded components are disabled, Spacy can process documents faster.

    Returns:
        Language: Spacy object

    """

    global nlp
    if nlp is None:
        nlp = spacy.load(spacy_model, disable=spacy_disable)
    return nlp


@pandas_udf(returnType=BinaryType())
def get_docs_bytes(series: pd.Series) -> pd.Series:
    nlp = get_spacy()   
    preprocessed = series.apply(preprocess_text)  # this just does some string preprocessing
    docs = nlp.pipe(preprocessed, n_threads=cpu_count(), batch_size=10000)
    docs_bytes = pd.Series([doc.to_bytes() for doc in docs])  # this is where the error happens
    return docs_bytes

Here is the error:

"/mnt/yarn/usercache/hadoop/appcache/application_1553100215590_0001/container_1553100215590_0001_01_000001/nlp_insights.zip/nlp_insights/core.py", line 116, in get_docs_bytes
  File "/mnt/yarn/usercache/hadoop/appcache/application_1553100215590_0001/container_1553100215590_0001_01_000001/nlp_insights.zip/nlp_insights/core.py", line 116, in <listcomp>
  File "/usr/local/lib64/python3.6/site-packages/spacy/language.py", line 711, in pipe
    for doc in docs:
  File "/usr/local/lib64/python3.6/site-packages/spacy/language.py", line 898, in _pipe
    for doc in docs:
  File "/usr/local/lib64/python3.6/site-packages/spacy/language.py", line 898, in _pipe
    for doc in docs:
  File "nn_parser.pyx", line 221, in pipe
  File "/usr/local/lib64/python3.6/site-packages/spacy/util.py", line 457, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "pipes.pyx", line 380, in pipe
  File "pipes.pyx", line 392, in spacy.pipeline.pipes.Tagger.predict
  File "/usr/local/lib64/python3.6/site-packages/thinc/neural/_classes/model.py", line 165, in __call__
    return self.predict(x)
  File "/usr/local/lib64/python3.6/site-packages/thinc/neural/_classes/feed_forward.py", line 40, in predict
    X = layer(X)
  File "/usr/local/lib64/python3.6/site-packages/thinc/neural/_classes/model.py", line 165, in __call__
    return self.predict(x)
  File "/usr/local/lib64/python3.6/site-packages/thinc/api.py", line 280, in predict
    return layer.ops.unflatten(X, lengths, pad=pad)
  File "ops.pyx", line 138, in thinc.neural.ops.Ops.unflatten
AssertionError

Your Environment

Python Version Used: 3.6.5
spaCy Version Used: 2.1.1
Environment Information: PySpark 2.4.0 in Amazon EMR 5.21.0

Please let me know if you need more information.

bug 🔮 thinc

Source

notnami

👍1

Most helpful comment

Here is an easy way to recreate this error:

nlp = spacy.load('en_core_web_sm')
list(nlp.pipe(['hi', '']))

spaCy version: 2.1.2
Platform: Darwin-17.7.0-x86_64-i386-64bit
Python version: 3.6.5

jeffoneill on 23 Mar 2019

👍2

All 11 comments

thinc's version (7.0.4) is the same for both spacy 2.1.0 and 2.1.1.
Could this bug fix have caused it?

mwakaba2 on 21 Mar 2019

👍1

@mwakaba2 I think you're right, thanks for the help. If you have time to make a PR that would be great.

honnibal on 21 Mar 2019

@honnibal thanks, I'll give it a try. 👍

mwakaba2 on 21 Mar 2019

@mwakaba2 I'm digging a bit deeper into this and I'm no longer sure that's the problem.

@notnami I think the doc.to_bytes() is a red herring here, because what's actually failing is the prediction in nlp.pipe(). It's a generator, so it's only getting called as the data streams through the loop below in your code.

Obviously a naked assert isn't great. Here's the function and the error we're hitting:

    def unflatten(self, X, lengths, pad=0):
        unflat = []
        pad = int(pad)
        for length in lengths:
            length = int(length)
            if pad >= 1 and length != 0:
                X = X[pad:]
            unflat.append(X[:length])
            X = X[length:]
        if pad >= 1 and length != 0:
            X = X[pad:]
        assert len(X) == 0
        assert len(unflat) == len(lengths)
        return unflat

The function performs a common data transformation for Thinc: it converts from a flattened representation, where we concatenate the arrays together and we keep the lengths separately, to a normal list of arrays.

We're failing at assert len(X) == 0 check, which makes sure we've actually consumed all the data from X. This suggests we're somehow ending up with a lengths array that's wrong for the data array. It's not obvious to me how that could happen. Perhaps an empty string? I thought we tested for that, but maybe not in the tagger.

It would be very helpful to know if it still fails if you just call nlp(text) instead of piping, and if so, which doc it fails on.

Edit1: Actually I bet it won't fail on nlp.__call__(). It will need a batch of examples for the lengths to be misalgined. If there's only one doc in the batch, the unflattening is trivial.

Edit2: I'm having trouble reproducing this. I've tried empty strings, and it all seems to work fine. Could you try to set a low batch size so that you can isolate the failing data?

Edit3: Your batch size is super high. Maybe something silently fails due to that? Btw the n_threads argument is deprecated, we only single-thread now. So, you can set it lower and use multi-processing.

honnibal on 23 Mar 2019

Btw, you might find the code in https://github.com/explosion/spaCy/blob/master/spacy/tokens/_serialize.py useful. We haven't finalised this yet, because backwards incompatibilities in serialization really suck once you've parsed a lot of text. But I've been using it a bit and it works quite well.

honnibal on 23 Mar 2019

Here is an easy way to recreate this error:

nlp = spacy.load('en_core_web_sm')
list(nlp.pipe(['hi', '']))

spaCy version: 2.1.2
Platform: Darwin-17.7.0-x86_64-i386-64bit
Python version: 3.6.5

jeffoneill on 23 Mar 2019

👍2

It might in fact be an empty string because when we drop all empty strings, the error no longer appears.

Anton
On Mar 23, 2019, 14:46 -0500, jeffoneill notifications@github.com, wrote:

Here is an easy way to recreate this error:
nlp = spacy.load('en_core_web_sm')
list(nlp.pipe(['hi', '']))

• spaCy version: 2.1.2
• Platform: Darwin-17.7.0-x86_64-i386-64bit
• Python version: 3.6.5

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

notnami on 23 Mar 2019

Hi @honnibal, yes you're right. I'm able to reproduce the error with jeffoneill's example.
I can close this ticket and create a new one for the thinc/empty string issue?

mwakaba2 on 25 Mar 2019

@honnibal So here's an update. I think the issue is in this thinc function ops.pyx in thinc.neural.ops.Ops.unflatten().

I looked at the X and lengths parameters when I pass in the following to nlp.pipe

    sentences = ["hi", ""]
    docs = nlp.pipe(sentences, batch_size=2)
    docs_list = list(docs)

layer.ops.unflatten(X, lengths, pad=pad) is getting the following inputs

X.shape --> (9, 96)
lengths --> [1, 0]
pad --> 4

The error happens on the following line

print(len(X), len(unflat), len(lengths)) # This outputs 4, 2, 2
assert len(X) == 0

I started working on a test in the thinc repository. Hopefully I can get a PR ready today.

Note
I tried the following sentences

sentences = ["hi", ""] # error
sentences_2 = ["", "hi"] # no error
sentences_3 = [ "hi", "", "yo"] # error

This error doesn't show up when the first sentence is an empty string.

mwakaba2 on 25 Mar 2019

Should now be fixed, with v7.0.5 of Thinc.

honnibal on 10 Jul 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

lock[bot] on 9 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Custom component not executing when calling nlp.pipe

enerrio · 3Comments

`is_stop` depends on capitalisation

peterroelants · 3Comments

TypeError when calling similarity() with a loop including a single-letter token

norrishd · 3Comments

How to train the NER to recognize addresses

bebelbop · 3Comments

Compare operator (==) behaves unexpectedly on spacy tokens

ank-26 · 3Comments