I'm getting the following error when trying to parse empty strings in spaCy 2.0.0a6:
(env) ~ 禄 python [18:39:06]
Python 3.6.2 (v3.6.2:5fd33b5926, Jul 16 2017, 20:11:06)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import en_core_web_sm
>>> en_nlp = en_core_web_sm.load()
>>> doc = en_nlp('')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/melanietosik/env/lib/python3.6/site-packages/spacy/language.py", line 274, in __call__
doc = proc(doc)
File "spacy/pipeline.pyx", line 256, in spacy.pipeline.NeuralTagger.__call__ (spacy/pipeline.cpp:12954)
File "spacy/pipeline.pyx", line 268, in spacy.pipeline.NeuralTagger.predict (spacy/pipeline.cpp:13618)
File "/Users/melanietosik/env/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 161, in __call__
return self.predict(x)
File "/Users/melanietosik/env/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 125, in predict
y, _ = self.begin_update(X)
File "/Users/melanietosik/env/lib/python3.6/site-packages/thinc/api.py", line 235, in begin_update
drop=drop)
File "/Users/melanietosik/env/lib/python3.6/site-packages/thinc/neural/_classes/feed_forward.py", line 36, in begin_update
X, inc_layer_grad = layer.begin_update(X, drop=drop)
File "/Users/melanietosik/env/lib/python3.6/site-packages/thinc/neural/_classes/maxout.py", line 69, in begin_update
best__bo, which__bo = self.ops.maxout(output__boc)
File "thinc/neural/ops.pyx", line 357, in thinc.neural.ops.NumpyOps.maxout (thinc/neural/ops.cpp:13789)
IndexError: Out of bounds on buffer access (axis 0)
I had previously (successfully) installed both spaCy and the English language model with pip:
$ pip install spacy-nightly==2.0.0a6
$ pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0-alpha/en_core_web_sm-2.0.0-alpha.tar.gz
For reference, here's the full list of installed modules in my current environment:
(env) ~ 禄 pip freeze [18:42:27]
certifi==2017.7.27.1
chainer==1.24.0
chardet==3.0.4
cymem==1.31.2
cytoolz==0.8.2
dill==0.2.7.1
en-core-web-sm==2.0.0a0
filelock==2.0.11
ftfy==4.4.3
html5lib==0.999999999
idna==2.5
msgpack-numpy==0.4.1
msgpack-python==0.4.8
murmurhash==0.28.0
nose==1.3.7
numpy==1.13.1
pathlib==1.0.1
plac==0.9.6
preshed==1.0.0
protobuf==3.3.0
regex==2017.4.5
requests==2.18.3
six==1.10.0
spacy-nightly==2.0.0a6
termcolor==1.1.0
thinc==6.8.0
toolz==0.8.2
tqdm==4.15.0
ujson==1.35
urllib3==1.22
wcwidth==0.1.7
webencodings==0.5.1
wrapt==1.10.10
I can easily check for empty strings before parsing but you might still want to consider failing more gracefully maybe, so just a heads up.
2.0.0a6Darwin-16.7.0-x86_64-i386-64bit3.6.2macOS Sierra (10.12.6)$ virtualenv -p python3 envI thought we had tests for the empty string -- failing here is definitely an error. Thanks for the report!
Just had a quick look and I think we're currently only testing empty strings in the tokenizer, not the model pipeline. So this should definitely be updated!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
I thought we had tests for the empty string -- failing here is definitely an error. Thanks for the report!