The below function works in Spark with Spacy 2.1.0, but crashes with Spacy 2.1.1. Rolling back to 2.1.0 gets rid of the error.
import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import BinaryType
import spacy
nlp = None
def get_spacy(spacy_model='en', spacy_disable=('parser',)):
"""
Args:
spacy_model (str, optional, defaults to 'en'): Name of Spacy model.
This argument is used only if the Spacy model
hasn't yet been initialised globally.
spacy_disable (iterable of str, optional, default to ('parser',):
Components of the Spacy pipeline that should be disabled.
When unneeded components are disabled, Spacy can process documents faster.
Returns:
Language: Spacy object
"""
global nlp
if nlp is None:
nlp = spacy.load(spacy_model, disable=spacy_disable)
return nlp
@pandas_udf(returnType=BinaryType())
def get_docs_bytes(series: pd.Series) -> pd.Series:
nlp = get_spacy()
preprocessed = series.apply(preprocess_text) # this just does some string preprocessing
docs = nlp.pipe(preprocessed, n_threads=cpu_count(), batch_size=10000)
docs_bytes = pd.Series([doc.to_bytes() for doc in docs]) # this is where the error happens
return docs_bytes
Here is the error:
"/mnt/yarn/usercache/hadoop/appcache/application_1553100215590_0001/container_1553100215590_0001_01_000001/nlp_insights.zip/nlp_insights/core.py", line 116, in get_docs_bytes
File "/mnt/yarn/usercache/hadoop/appcache/application_1553100215590_0001/container_1553100215590_0001_01_000001/nlp_insights.zip/nlp_insights/core.py", line 116, in <listcomp>
File "/usr/local/lib64/python3.6/site-packages/spacy/language.py", line 711, in pipe
for doc in docs:
File "/usr/local/lib64/python3.6/site-packages/spacy/language.py", line 898, in _pipe
for doc in docs:
File "/usr/local/lib64/python3.6/site-packages/spacy/language.py", line 898, in _pipe
for doc in docs:
File "nn_parser.pyx", line 221, in pipe
File "/usr/local/lib64/python3.6/site-packages/spacy/util.py", line 457, in minibatch
batch = list(itertools.islice(items, int(batch_size)))
File "pipes.pyx", line 380, in pipe
File "pipes.pyx", line 392, in spacy.pipeline.pipes.Tagger.predict
File "/usr/local/lib64/python3.6/site-packages/thinc/neural/_classes/model.py", line 165, in __call__
return self.predict(x)
File "/usr/local/lib64/python3.6/site-packages/thinc/neural/_classes/feed_forward.py", line 40, in predict
X = layer(X)
File "/usr/local/lib64/python3.6/site-packages/thinc/neural/_classes/model.py", line 165, in __call__
return self.predict(x)
File "/usr/local/lib64/python3.6/site-packages/thinc/api.py", line 280, in predict
return layer.ops.unflatten(X, lengths, pad=pad)
File "ops.pyx", line 138, in thinc.neural.ops.Ops.unflatten
AssertionError
Please let me know if you need more information.
thinc's version (7.0.4) is the same for both spacy 2.1.0 and 2.1.1.
Could this bug fix have caused it?
@mwakaba2 I think you're right, thanks for the help. If you have time to make a PR that would be great.
@honnibal thanks, I'll give it a try. 👍
@mwakaba2 I'm digging a bit deeper into this and I'm no longer sure that's the problem.
@notnami I think the doc.to_bytes() is a red herring here, because what's actually failing is the prediction in nlp.pipe(). It's a generator, so it's only getting called as the data streams through the loop below in your code.
Obviously a naked assert isn't great. Here's the function and the error we're hitting:
def unflatten(self, X, lengths, pad=0):
unflat = []
pad = int(pad)
for length in lengths:
length = int(length)
if pad >= 1 and length != 0:
X = X[pad:]
unflat.append(X[:length])
X = X[length:]
if pad >= 1 and length != 0:
X = X[pad:]
assert len(X) == 0
assert len(unflat) == len(lengths)
return unflat
The function performs a common data transformation for Thinc: it converts from a flattened representation, where we concatenate the arrays together and we keep the lengths separately, to a normal list of arrays.
We're failing at assert len(X) == 0 check, which makes sure we've actually consumed all the data from X. This suggests we're somehow ending up with a lengths array that's wrong for the data array. It's not obvious to me how that could happen. Perhaps an empty string? I thought we tested for that, but maybe not in the tagger.
It would be very helpful to know if it still fails if you just call nlp(text) instead of piping, and if so, which doc it fails on.
Edit1: Actually I bet it won't fail on nlp.__call__(). It will need a batch of examples for the lengths to be misalgined. If there's only one doc in the batch, the unflattening is trivial.
Edit2: I'm having trouble reproducing this. I've tried empty strings, and it all seems to work fine. Could you try to set a low batch size so that you can isolate the failing data?
Edit3: Your batch size is super high. Maybe something silently fails due to that? Btw the n_threads argument is deprecated, we only single-thread now. So, you can set it lower and use multi-processing.
Btw, you might find the code in https://github.com/explosion/spaCy/blob/master/spacy/tokens/_serialize.py useful. We haven't finalised this yet, because backwards incompatibilities in serialization really suck once you've parsed a lot of text. But I've been using it a bit and it works quite well.
Here is an easy way to recreate this error:
nlp = spacy.load('en_core_web_sm')
list(nlp.pipe(['hi', '']))
It might in fact be an empty string because when we drop all empty strings, the error no longer appears.
Anton
On Mar 23, 2019, 14:46 -0500, jeffoneill notifications@github.com, wrote:
Here is an easy way to recreate this error:
nlp = spacy.load('en_core_web_sm')
list(nlp.pipe(['hi', '']))• spaCy version: 2.1.2
• Platform: Darwin-17.7.0-x86_64-i386-64bit
• Python version: 3.6.5—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
Hi @honnibal, yes you're right. I'm able to reproduce the error with jeffoneill's example.
I can close this ticket and create a new one for the thinc/empty string issue?
@honnibal So here's an update. I think the issue is in this thinc function ops.pyx in thinc.neural.ops.Ops.unflatten().
I looked at the X and lengths parameters when I pass in the following to nlp.pipe
sentences = ["hi", ""]
docs = nlp.pipe(sentences, batch_size=2)
docs_list = list(docs)
layer.ops.unflatten(X, lengths, pad=pad) is getting the following inputs
X.shape --> (9, 96)
lengths --> [1, 0]
pad --> 4
The error happens on the following line
print(len(X), len(unflat), len(lengths)) # This outputs 4, 2, 2
assert len(X) == 0
I started working on a test in the thinc repository. Hopefully I can get a PR ready today.
Note
I tried the following sentences
sentences = ["hi", ""] # error
sentences_2 = ["", "hi"] # no error
sentences_3 = [ "hi", "", "yo"] # error
This error doesn't show up when the first sentence is an empty string.
Should now be fixed, with v7.0.5 of Thinc.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Here is an easy way to recreate this error: