Spacy: GPU and multiprocessing leads to `TypeError: can not serialize 'cupy.core.core.ndarray' object`

Created on 27 Nov 2019  路  9Comments  路  Source: explosion/spaCy

I am using spacy with GPU and multiprocessing like this :

if __name__ == '__main__':
    multiprocessing.set_start_method('spawn')
    spacy.require_gpu()
    nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger', 'textcat'])
    for _ in nlp.pipe(docs, batch_size=16, n_process=2):
        pass 

and I am getting this as an error :

Traceback (most recent call last): File "benchmark.py", line 117, in <module> for _ in nlp.pipe(docs, batch_size=16, n_process=2): File "/usr/local/lib/python3.6/dist-packages/spacy/language.py", line 816, in pipe for doc in docs: File "/usr/local/lib/python3.6/dist-packages/spacy/language.py", line 859, in _multiprocessing_pipe proc.start() File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start self._popen = self._Popen(self) File "/usr/lib/python3.6/multiprocessing/context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__ super().__init__(process_obj) File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__ self._launch(process_obj) File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/usr/lib/python3.6/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) File "vectors.pyx", line 129, in spacy.vectors.Vectors.__reduce__ File "vectors.pyx", line 464, in spacy.vectors.Vectors.to_bytes File "/usr/local/lib/python3.6/dist-packages/spacy/util.py", line 625, in to_bytes serialized[key] = getter() File "vectors.pyx", line 458, in spacy.vectors.Vectors.to_bytes.serialize_weights File "/usr/local/lib/python3.6/dist-packages/srsly/_msgpack_api.py", line 16, in msgpack_dumps return msgpack.dumps(data, use_bin_type=True) File "/usr/local/lib/python3.6/dist-packages/srsly/msgpack/__init__.py", line 40, in packb return Packer(**kwargs).pack(o) File "_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack File "_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack File "_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack File "_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack TypeError: can not serialize 'cupy.core.core.ndarray' object
NOTE : if I am using n_process=1 the code runs fine.
If I am not using the gpu the code also runs fine.
It is the combination of GPU and multiprocessing that leads to this error.

INFO :
Python 3.6.9
Cupy : cupy-cuda100==7.0.0rc1
Spacy : spacy==2.2.3

bug more-info-needed scaling

Most helpful comment

Ah, I think this was closed by mistake. One issue is that vectors weren't available when multiprocessing (on CPU or GPU) and the other is that the GPU vectors couldn't be serialized while multiprocessing. #5081 fixed the first problem and a PR to srsly (https://github.com/explosion/srsly/pull/21) should fix the second problem. We need to provide a new release of srsly to fix this.

All 9 comments

It might be related (the callstack look very similar) though in my example I am just trying to use the 'standard' Spacy pipeline not trying any custom operations like calling to_bytes() function explicitly.

You can fix the immediate problem above for small models with something like this in vectors.pyx:

    def to_bytes(self, **kwargs):
        """Serialize the current state to a binary string.

        exclude (list): String names of serialization fields to exclude.
        RETURNS (bytes): The serialized form of the `Vectors` object.

        DOCS: https://spacy.io/api/vectors#to_bytes
        """
        def serialize_weights():
            if hasattr(self.data, "to_bytes"):
                return self.data.to_bytes()
            else:
                if isinstance(Model.ops, CupyOps):
                    data = self.data.get()
                else:
                    data = self.data
                return srsly.msgpack_dumps(data)

But in general, this is similar to the problems with underscore and spawn. There is global vector state that's not shared with the subprocesses. For models with vectors this does not work with or without GPU:

import spacy
import multiprocessing

if __name__ == '__main__':
    texts = ["This is a sentence."] * 10
    multiprocessing.set_start_method('spawn')
    nlp = spacy.load('en_core_web_md', disable=['parser', 'tagger', 'textcat'])
    for _ in nlp.pipe(texts, batch_size=16, n_process=2):
        pass

Error:

OSError: [E050] Can't find model 'en_core_web_md.vectors'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Mentioned PR doesn't seem to fix this issue.

I am still getting serialize error when multiprocessing with GPU.

Ah, I think this was closed by mistake. One issue is that vectors weren't available when multiprocessing (on CPU or GPU) and the other is that the GPU vectors couldn't be serialized while multiprocessing. #5081 fixed the first problem and a PR to srsly (https://github.com/explosion/srsly/pull/21) should fix the second problem. We need to provide a new release of srsly to fix this.

srsly v1.0.2 should fix this I think - @fredybotas can you double check ?

Sure, I can confirm that srsly v1.0.2 fixes the issue.

Thanks, happy to hear it!

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings