Spacy: vector_norm throws error for unusual text in sentence with more than one word.

Created on 19 Nov 2019 · 8Comments · Source: explosion/spaCy

How to reproduce the behaviour

I found this bug that I'm not sure if it resides in spacy or cupy. It only appears on GPU instances and when you try to get vectors from a multiple word document containing non standard words. Any help tracking it down with a potential fix would be fantastic.

import en_core_web_md
import spacy
spacy.prefer_gpu()
nlp = en_core_web_md.load()

doc = nlp("somerandomword")
doc.vector_norm
# works

doc = nlp("somerandomword.")
doc.vector_norm
# throws type error

doc = nlp("The somerandomword")
doc.vector_norm
# throws type error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-96-d19c7b8f8943> in <module>()
----> 1 doc.vector_norm
doc.pyx in spacy.tokens.doc.Doc.vector_norm.__get__()
doc.pyx in __iter__()
cupy/core/core.pyx in cupy.core.core.ndarray.__add__()
cupy/core/_kernel.pyx in cupy.core._kernel.ufunc.__call__()
cupy/core/_kernel.pyx in cupy.core._kernel._preprocess_args()
TypeError: Unsupported type <class 'numpy.ndarray'>

Your Environment

Operating System: Ubuntu 18.04.3
Python Version Used: 3.6.8
spaCy Version Used: 2.2.2
Environment Information: Running on Google Colab but also experienced it on other GPU instances.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

!pip install spacy==2.2.2
!pip install chainer
!pip install thinc_gpu_ops thinc
!python -m spacy download en_core_web_md

bug feat / vectors

Source

mmaybeno

Most helpful comment

Fixed by #4680.

adrianeboyd on 10 Dec 2019

🚀1 ❤1 🎉1 😄1 👍1

All 8 comments

For context. I found this cupy PR that was merged recently and thought that would fix it, but building from source didn't appear to fix this issue.
https://github.com/cupy/cupy/pull/2611

mmaybeno on 19 Nov 2019

I think it has to do with this part specifically.
sum(t.vector for t in self)
https://github.com/explosion/spaCy/blob/master/spacy/tokens/doc.pyx#L439

mmaybeno on 19 Nov 2019

Found it. There is a case where a token will have an empty numpy.ndarray instead of a cupy.core.core.ndarray, which makes incompatible types when you try and sum them.

[type(t.vector) for t in nlp("somerandomword.")]
# [numpy.ndarray, cupy.core.core.ndarray]

mmaybeno on 19 Nov 2019

Vocab's get_vector defaults to a numpy array, so if the word does not exist it will stay a zero numpy array. I think this is the bug. https://github.com/explosion/spaCy/blob/master/spacy/vocab.pyx#L364

mmaybeno on 19 Nov 2019

Attempting to create a PR for this fix but unsure on how to test it since it's cupy related.

mmaybeno on 20 Nov 2019

Fixed by #4680.

adrianeboyd on 10 Dec 2019

🚀1 ❤1 🎉1 😄1 👍1

Awesome!