Spacy: Issue with handling empty strings in spaCy 2.0.0a6

Created on 5 Aug 2017 · 3Comments · Source: explosion/spaCy

I'm getting the following error when trying to parse empty strings in spaCy 2.0.0a6:

(env) ~ » python                                                                          [18:39:06]
Python 3.6.2 (v3.6.2:5fd33b5926, Jul 16 2017, 20:11:06)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import en_core_web_sm
>>> en_nlp = en_core_web_sm.load()
>>> doc = en_nlp('')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/melanietosik/env/lib/python3.6/site-packages/spacy/language.py", line 274, in __call__
    doc = proc(doc)
  File "spacy/pipeline.pyx", line 256, in spacy.pipeline.NeuralTagger.__call__ (spacy/pipeline.cpp:12954)
  File "spacy/pipeline.pyx", line 268, in spacy.pipeline.NeuralTagger.predict (spacy/pipeline.cpp:13618)
  File "/Users/melanietosik/env/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 161, in __call__
    return self.predict(x)
  File "/Users/melanietosik/env/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 125, in predict
    y, _ = self.begin_update(X)
  File "/Users/melanietosik/env/lib/python3.6/site-packages/thinc/api.py", line 235, in begin_update
    drop=drop)
  File "/Users/melanietosik/env/lib/python3.6/site-packages/thinc/neural/_classes/feed_forward.py", line 36, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/Users/melanietosik/env/lib/python3.6/site-packages/thinc/neural/_classes/maxout.py", line 69, in begin_update
    best__bo, which__bo = self.ops.maxout(output__boc)
  File "thinc/neural/ops.pyx", line 357, in thinc.neural.ops.NumpyOps.maxout (thinc/neural/ops.cpp:13789)
IndexError: Out of bounds on buffer access (axis 0)

I had previously (successfully) installed both spaCy and the English language model with pip:

$ pip install spacy-nightly==2.0.0a6
$ pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0-alpha/en_core_web_sm-2.0.0-alpha.tar.gz

For reference, here's the full list of installed modules in my current environment:

(env) ~ » pip freeze                                                                      [18:42:27]
certifi==2017.7.27.1
chainer==1.24.0
chardet==3.0.4
cymem==1.31.2
cytoolz==0.8.2
dill==0.2.7.1
en-core-web-sm==2.0.0a0
filelock==2.0.11
ftfy==4.4.3
html5lib==0.999999999
idna==2.5
msgpack-numpy==0.4.1
msgpack-python==0.4.8
murmurhash==0.28.0
nose==1.3.7
numpy==1.13.1
pathlib==1.0.1
plac==0.9.6
preshed==1.0.0
protobuf==3.3.0
regex==2017.4.5
requests==2.18.3
six==1.10.0
spacy-nightly==2.0.0a6
termcolor==1.1.0
thinc==6.8.0
toolz==0.8.2
tqdm==4.15.0
ujson==1.35
urllib3==1.22
wcwidth==0.1.7
webencodings==0.5.1
wrapt==1.10.10

I can easily check for empty strings before parsing but you might still want to consider failing more gracefully maybe, so just a heads up.

Environment information

spaCy version: 2.0.0a6
Platform: Darwin-16.7.0-x86_64-i386-64bit
Python version: 3.6.2
OS: macOS Sierra (10.12.6)
Environment: created with $ virtualenv -p python3 env

bug tests

Source

melanietosik

Most helpful comment

I thought we had tests for the empty string -- failing here is definitely an error. Thanks for the report!

honnibal on 5 Aug 2017

👍2

All 3 comments

I thought we had tests for the empty string -- failing here is definitely an error. Thanks for the report!

honnibal on 5 Aug 2017

👍2

Just had a quick look and I think we're currently only testing empty strings in the tokenizer, not the model pipeline. So this should definitely be updated!