Spacy: Multiprocessing in pipe() not working with custom attributes

Created on 13 Jan 2020  路  14Comments  路  Source: explosion/spaCy

I am using custom attributes and a custom pipeline for the first time, so it might very well be that there is a mistake on my part. However, the code works fine when not using the n_process argument.

The problem: using nlp.pipe(text, n_process=2) will throw an AttributeError complaining that I am assiging a value to an unregistered extension attribute. The error is not thrown without the n_process argument.

Trace:

Process Process-1:
Traceback (most recent call last):
  File "C:\Python\Python37\Lib\multiprocessing\process.py", line 297, in _bootstrap
    self.run()
  File "C:\Python\Python37\Lib\multiprocessing\process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\bmvroy\.virtualenvs\spacy_conll-PtwHJ_vN\lib\site-packages\spacy\language.py", line 1124, in _apply_pipes
    sender.send([doc.to_bytes() for doc in docs])
  File "C:\Users\bmvroy\.virtualenvs\spacy_conll-PtwHJ_vN\lib\site-packages\spacy\language.py", line 1124, in <listcomp>
    sender.send([doc.to_bytes() for doc in docs])
  File "nn_parser.pyx", line 248, in pipe
  File "C:\Users\bmvroy\.virtualenvs\spacy_conll-PtwHJ_vN\lib\site-packages\spacy\util.py", line 481, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "C:\Users\bmvroy\.virtualenvs\spacy_conll-PtwHJ_vN\lib\site-packages\spacy\language.py", line 1106, in _pipe
    doc = proc(doc, **kwargs)
  File "C:\Users\bmvroy\.PyCharm2019.2\config\scratches\scratch_33.py", line 15, in __call__
    sent._.set('my_ext', sent_ext)
  File "C:\Users\bmvroy\.virtualenvs\spacy_conll-PtwHJ_vN\lib\site-packages\spacy\tokens\underscore.py", line 71, in set
    return self.__setattr__(name, value)
  File "C:\Users\bmvroy\.virtualenvs\spacy_conll-PtwHJ_vN\lib\site-packages\spacy\tokens\underscore.py", line 63, in __setattr__
    raise AttributeError(Errors.E047.format(name=name))
AttributeError: [E047] Can't assign a value to unregistered extension attribute 'my_ext'. Did you forget to call the `set_extension` method?

My guess would be that n_process creates new processes that recreate the nlp instance, but does not reinitialize its pipes - but that's just a guess.

How to reproduce the behaviour

import spacy
from spacy.tokens import Span, Doc

class CustomPipe:
    name = 'my_pipe'

    def __init__(self):
        Span.set_extension('my_ext', getter=self._get_my_ext)
        Doc.set_extension('my_ext', default=None)

    def __call__(self, doc):
        gathered_ext = []
        for sent in doc.sents:
            sent_ext = self._get_my_ext(sent)
            sent._.set('my_ext', sent_ext)
            gathered_ext.append(sent_ext)

        doc._.set('my_ext', '\n'.join(gathered_ext))

        return doc

    @staticmethod
    def _get_my_ext(span):
        return str(span.end)


if __name__ == '__main__':
    nlp = spacy.load('en_core_web_sm')
    custom_component = CustomPipe()
    nlp.add_pipe(custom_component, after='parser')

    text = ['I like bananas.', 'Do you like them?', 'No, I prefer wasabi.']
    # works without 'n_process' 
    for doc in nlp.pipe(text, n_process=2):
        print(doc)

Your Environment

  • spaCy version: 2.2.3
  • Platform: Windows-10-10.0.18362-SP0
  • Python version: 3.7.3
  • Models: en
bug compat feat / doc scaling

All 14 comments

I think the not-particularly-satisfying solution is to check that the extensions are registered in __call__, see: https://github.com/explosion/spaCy/issues/4737#issuecomment-561053480

You can check whether they're already there (if not Token.has_extension...) so that if you're using linux, you don't notice much difference.

Ah yes, I thought the issue was familiar but I couldn't find it. Apparently the linked issue was closed, but I think this issue should stay open. It would be nice to get a permanent fix for this.

Not verify satisfying, indeed, but if it works, I am satisfied with a dirty workaround for now. I'm not sure whether checking if not Token.has_extension first makes a difference? How would the behaviour differ, then, between linux and windows? Thanks.

If they've been set before in __init__, you'll get errors unless you have force=True, so you want to check before setting them again. (I'm assuming checking with if for an existing extension is faster than setting with force=True, but I haven't actually tested it.)

If they've been set before in __init__, you'll get errors unless you have force=True, so you want to check before setting them again. (I'm assuming checking with if for an existing extension is faster than setting with force=True, but I haven't actually tested it.)

Seems like using force=True or manually using if not Token.has_extension is actually practically the same in terms of if-checking, so I'll stick with force=True. Thanks, didn't know that force existed.

https://github.com/explosion/spaCy/blob/f2d224756b95e6351b4dbff3367a6f823156c010/spacy/tokens/token.pyx#L52-L53

I am really curious whether the underlying issue can actually be fixed, i.e. that also on Windows the Token.set_extension() in __init__ would work, and what is actually causing it.

Can replicate this on my system (though as a unit test, the code hangs), definitely looks like a bug on Windows.

Can replicate this on my system (though as a unit test, the code hangs), definitely looks like a bug on Windows.

Hm, that's odd. Just now, I copy-pasted the snippet from my OP into a new environment and it runs fine (when you leave out the n_process argument), and throws an error when including the argument. You are right though, in that when the error is thrown the interpreter doesn't exit(). My guess is that the error is thrown in a child process, but that the other process(es) or the main process then just blocks - something along those lines.

EDIT: Oops, I was replying to your previous entry about hanging and reproducibility.

Yep I assume it's blocking on a child process, but I could still reproduce the error so there's a chance of debugging ;-)

@svlandeg: I don't think you need to do any particular windows debugging here. You can see the same behavior in linux if you use spawn instead of fork.

I don't see an obvious way to fix this without completely changing how custom extensions are implemented. Maybe useful warnings are possible, though?

So do I understand it correctly that it's a Windows-specific bug in spaCy because Windows defaults to spawn, and Linux to fork?

I'd say that it looks like a windows-specific bug because of the multiprocessing defaults, but it's really an issue with spawn and spacy's global variables.

Ok I looked into this some more, and from what I've read, it's bad practice to rely on the current state to be transferred to the workers (and it'll only work on Linux, not macOS or Windows). Instead, the required state should be transferred, cf https://docs.python.org/3/library/multiprocessing.html#programming-guidelines:

On Unix using the fork start method, a child process can make use of a shared resource created in a parent process using a global resource. However, it is better to pass the object as an argument to the constructor for the child process.
Apart from making the code (potentially) compatible with Windows and the other start methods this also ensures that as long as the child process is still alive the object will not be garbage collected in the parent process. This might be important if some resource is freed when the object is garbage collected in the parent process.

Fixed in PR https://github.com/explosion/spaCy/pull/5006 by specifically loading the Underscore "state".

Thanks again for the report and the helpful code snippet, @BramVanroy !

Ah, that's great! I feared this issue was going to be ignored because "meh, Windows". I am glad it did get more attention. Thanks @svlandeg! You can close this if you want (I didn't yet because the tests in the PR failed but feel free).

No that's fine, this issue will close automatically if/when the PR gets merged ;-)

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tonywangcn picture tonywangcn  路  3Comments

enerrio picture enerrio  路  3Comments

besirkurtulmus picture besirkurtulmus  路  3Comments

peterroelants picture peterroelants  路  3Comments

nadachaabani1 picture nadachaabani1  路  3Comments