I found this bug that I'm not sure if it resides in spacy or cupy. It only appears on GPU instances and when you try to get vectors from a multiple word document containing non standard words. Any help tracking it down with a potential fix would be fantastic.
import en_core_web_md
import spacy
spacy.prefer_gpu()
nlp = en_core_web_md.load()
doc = nlp("somerandomword")
doc.vector_norm
# works
doc = nlp("somerandomword.")
doc.vector_norm
# throws type error
doc = nlp("The somerandomword")
doc.vector_norm
# throws type error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-96-d19c7b8f8943> in <module>()
----> 1 doc.vector_norm
doc.pyx in spacy.tokens.doc.Doc.vector_norm.__get__()
doc.pyx in __iter__()
cupy/core/core.pyx in cupy.core.core.ndarray.__add__()
cupy/core/_kernel.pyx in cupy.core._kernel.ufunc.__call__()
cupy/core/_kernel.pyx in cupy.core._kernel._preprocess_args()
TypeError: Unsupported type <class 'numpy.ndarray'>
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 34C P8 26W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
!pip install spacy==2.2.2
!pip install chainer
!pip install thinc_gpu_ops thinc
!python -m spacy download en_core_web_md
For context. I found this cupy PR that was merged recently and thought that would fix it, but building from source didn't appear to fix this issue.
https://github.com/cupy/cupy/pull/2611
I think it has to do with this part specifically.
sum(t.vector for t in self)
https://github.com/explosion/spaCy/blob/master/spacy/tokens/doc.pyx#L439
Found it. There is a case where a token will have an empty numpy.ndarray instead of a cupy.core.core.ndarray, which makes incompatible types when you try and sum them.
[type(t.vector) for t in nlp("somerandomword.")]
# [numpy.ndarray, cupy.core.core.ndarray]
Vocab's get_vector defaults to a numpy array, so if the word does not exist it will stay a zero numpy array. I think this is the bug. https://github.com/explosion/spaCy/blob/master/spacy/vocab.pyx#L364
Attempting to create a PR for this fix but unsure on how to test it since it's cupy related.
Fixed by #4680.
Awesome!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Fixed by #4680.