Spacy: Issue with handling empty strings in spaCy 2.0.0a6

Created on 5 Aug 2017  路  3Comments  路  Source: explosion/spaCy

I'm getting the following error when trying to parse empty strings in spaCy 2.0.0a6:

(env) ~ 禄 python                                                                          [18:39:06]
Python 3.6.2 (v3.6.2:5fd33b5926, Jul 16 2017, 20:11:06)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import en_core_web_sm
>>> en_nlp = en_core_web_sm.load()
>>> doc = en_nlp('')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/melanietosik/env/lib/python3.6/site-packages/spacy/language.py", line 274, in __call__
    doc = proc(doc)
  File "spacy/pipeline.pyx", line 256, in spacy.pipeline.NeuralTagger.__call__ (spacy/pipeline.cpp:12954)
  File "spacy/pipeline.pyx", line 268, in spacy.pipeline.NeuralTagger.predict (spacy/pipeline.cpp:13618)
  File "/Users/melanietosik/env/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 161, in __call__
    return self.predict(x)
  File "/Users/melanietosik/env/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 125, in predict
    y, _ = self.begin_update(X)
  File "/Users/melanietosik/env/lib/python3.6/site-packages/thinc/api.py", line 235, in begin_update
    drop=drop)
  File "/Users/melanietosik/env/lib/python3.6/site-packages/thinc/neural/_classes/feed_forward.py", line 36, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/Users/melanietosik/env/lib/python3.6/site-packages/thinc/neural/_classes/maxout.py", line 69, in begin_update
    best__bo, which__bo = self.ops.maxout(output__boc)
  File "thinc/neural/ops.pyx", line 357, in thinc.neural.ops.NumpyOps.maxout (thinc/neural/ops.cpp:13789)
IndexError: Out of bounds on buffer access (axis 0)

I had previously (successfully) installed both spaCy and the English language model with pip:

$ pip install spacy-nightly==2.0.0a6
$ pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0-alpha/en_core_web_sm-2.0.0-alpha.tar.gz

For reference, here's the full list of installed modules in my current environment:

(env) ~ 禄 pip freeze                                                                      [18:42:27]
certifi==2017.7.27.1
chainer==1.24.0
chardet==3.0.4
cymem==1.31.2
cytoolz==0.8.2
dill==0.2.7.1
en-core-web-sm==2.0.0a0
filelock==2.0.11
ftfy==4.4.3
html5lib==0.999999999
idna==2.5
msgpack-numpy==0.4.1
msgpack-python==0.4.8
murmurhash==0.28.0
nose==1.3.7
numpy==1.13.1
pathlib==1.0.1
plac==0.9.6
preshed==1.0.0
protobuf==3.3.0
regex==2017.4.5
requests==2.18.3
six==1.10.0
spacy-nightly==2.0.0a6
termcolor==1.1.0
thinc==6.8.0
toolz==0.8.2
tqdm==4.15.0
ujson==1.35
urllib3==1.22
wcwidth==0.1.7
webencodings==0.5.1
wrapt==1.10.10

I can easily check for empty strings before parsing but you might still want to consider failing more gracefully maybe, so just a heads up.

Environment information

  • spaCy version: 2.0.0a6
  • Platform: Darwin-16.7.0-x86_64-i386-64bit
  • Python version: 3.6.2
  • OS: macOS Sierra (10.12.6)
  • Environment: created with $ virtualenv -p python3 env
bug tests

Most helpful comment

I thought we had tests for the empty string -- failing here is definitely an error. Thanks for the report!

All 3 comments

I thought we had tests for the empty string -- failing here is definitely an error. Thanks for the report!

Just had a quick look and I think we're currently only testing empty strings in the tokenizer, not the model pipeline. So this should definitely be updated!

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings