I am using spacy with GPU and multiprocessing like this :
if __name__ == '__main__':
multiprocessing.set_start_method('spawn')
spacy.require_gpu()
nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger', 'textcat'])
for _ in nlp.pipe(docs, batch_size=16, n_process=2):
pass
and I am getting this as an error :
Traceback (most recent call last):
File "benchmark.py", line 117, in <module>
for _ in nlp.pipe(docs, batch_size=16, n_process=2):
File "/usr/local/lib/python3.6/dist-packages/spacy/language.py", line 816, in pipe
for doc in docs:
File "/usr/local/lib/python3.6/dist-packages/spacy/language.py", line 859, in _multiprocessing_pipe
proc.start()
File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "vectors.pyx", line 129, in spacy.vectors.Vectors.__reduce__
File "vectors.pyx", line 464, in spacy.vectors.Vectors.to_bytes
File "/usr/local/lib/python3.6/dist-packages/spacy/util.py", line 625, in to_bytes
serialized[key] = getter()
File "vectors.pyx", line 458, in spacy.vectors.Vectors.to_bytes.serialize_weights
File "/usr/local/lib/python3.6/dist-packages/srsly/_msgpack_api.py", line 16, in msgpack_dumps
return msgpack.dumps(data, use_bin_type=True)
File "/usr/local/lib/python3.6/dist-packages/srsly/msgpack/__init__.py", line 40, in packb
return Packer(**kwargs).pack(o)
File "_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
File "_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
File "_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
File "_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'cupy.core.core.ndarray' object
NOTE : if I am using n_process=1 the code runs fine.
If I am not using the gpu the code also runs fine.
It is the combination of GPU and multiprocessing that leads to this error.
INFO :
Python 3.6.9
Cupy : cupy-cuda100==7.0.0rc1
Spacy : spacy==2.2.3
This is possibly related to https://github.com/explosion/spacy-transformers/issues/99?
It might be related (the callstack look very similar) though in my example I am just trying to use the 'standard' Spacy pipeline not trying any custom operations like calling to_bytes() function explicitly.
You can fix the immediate problem above for small models with something like this in vectors.pyx:
def to_bytes(self, **kwargs):
"""Serialize the current state to a binary string.
exclude (list): String names of serialization fields to exclude.
RETURNS (bytes): The serialized form of the `Vectors` object.
DOCS: https://spacy.io/api/vectors#to_bytes
"""
def serialize_weights():
if hasattr(self.data, "to_bytes"):
return self.data.to_bytes()
else:
if isinstance(Model.ops, CupyOps):
data = self.data.get()
else:
data = self.data
return srsly.msgpack_dumps(data)
But in general, this is similar to the problems with underscore and spawn. There is global vector state that's not shared with the subprocesses. For models with vectors this does not work with or without GPU:
import spacy
import multiprocessing
if __name__ == '__main__':
texts = ["This is a sentence."] * 10
multiprocessing.set_start_method('spawn')
nlp = spacy.load('en_core_web_md', disable=['parser', 'tagger', 'textcat'])
for _ in nlp.pipe(texts, batch_size=16, n_process=2):
pass
Error:
OSError: [E050] Can't find model 'en_core_web_md.vectors'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
Mentioned PR doesn't seem to fix this issue.
I am still getting serialize error when multiprocessing with GPU.
Ah, I think this was closed by mistake. One issue is that vectors weren't available when multiprocessing (on CPU or GPU) and the other is that the GPU vectors couldn't be serialized while multiprocessing. #5081 fixed the first problem and a PR to srsly (https://github.com/explosion/srsly/pull/21) should fix the second problem. We need to provide a new release of srsly to fix this.
srsly v1.0.2 should fix this I think - @fredybotas can you double check ?
Sure, I can confirm that srsly v1.0.2 fixes the issue.
Thanks, happy to hear it!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Ah, I think this was closed by mistake. One issue is that vectors weren't available when multiprocessing (on CPU or GPU) and the other is that the GPU vectors couldn't be serialized while multiprocessing. #5081 fixed the first problem and a PR to
srsly(https://github.com/explosion/srsly/pull/21) should fix the second problem. We need to provide a new release ofsrslyto fix this.