Spacy: vector_norm throws error for unusual text in sentence with more than one word.

Created on 19 Nov 2019  路  8Comments  路  Source: explosion/spaCy

How to reproduce the behaviour

I found this bug that I'm not sure if it resides in spacy or cupy. It only appears on GPU instances and when you try to get vectors from a multiple word document containing non standard words. Any help tracking it down with a potential fix would be fantastic.

import en_core_web_md
import spacy
spacy.prefer_gpu()
nlp = en_core_web_md.load()

doc = nlp("somerandomword")
doc.vector_norm
# works

doc = nlp("somerandomword.")
doc.vector_norm
# throws type error

doc = nlp("The somerandomword")
doc.vector_norm
# throws type error
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-96-d19c7b8f8943> in <module>()
----> 1 doc.vector_norm
doc.pyx in spacy.tokens.doc.Doc.vector_norm.__get__()
doc.pyx in __iter__()
cupy/core/core.pyx in cupy.core.core.ndarray.__add__()
cupy/core/_kernel.pyx in cupy.core._kernel.ufunc.__call__()
cupy/core/_kernel.pyx in cupy.core._kernel._preprocess_args()
TypeError: Unsupported type <class 'numpy.ndarray'>

Your Environment

  • Operating System: Ubuntu 18.04.3
  • Python Version Used: 3.6.8
  • spaCy Version Used: 2.2.2
  • Environment Information: Running on Google Colab but also experienced it on other GPU instances.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
!pip install spacy==2.2.2
!pip install chainer
!pip install thinc_gpu_ops thinc
!python -m spacy download en_core_web_md 
bug feat / vectors

Most helpful comment

Fixed by #4680.

All 8 comments

For context. I found this cupy PR that was merged recently and thought that would fix it, but building from source didn't appear to fix this issue.
https://github.com/cupy/cupy/pull/2611

I think it has to do with this part specifically.
sum(t.vector for t in self)
https://github.com/explosion/spaCy/blob/master/spacy/tokens/doc.pyx#L439

Found it. There is a case where a token will have an empty numpy.ndarray instead of a cupy.core.core.ndarray, which makes incompatible types when you try and sum them.

[type(t.vector) for t in nlp("somerandomword.")]
# [numpy.ndarray, cupy.core.core.ndarray]

Vocab's get_vector defaults to a numpy array, so if the word does not exist it will stay a zero numpy array. I think this is the bug. https://github.com/explosion/spaCy/blob/master/spacy/vocab.pyx#L364

Attempting to create a PR for this fix but unsure on how to test it since it's cupy related.

Fixed by #4680.

Awesome!

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings