Allennlp: spacy Tokens are not pickleable, which breaks the multiprocess dataset reader

Created on 10 Oct 2018 · 9Comments · Source: allenai/allennlp

the multiprocess dataset reader uses multiple processes to generate allennlp Instances and stick them on an output queue (which implicitly involves pickling them)

it turns out (have I ever mentioned how much I hate debugging multiprocessing code) that spacy Tokens are not pickleable, which causes the queue.put() operation to silently fail, which causes the multiprocess dataset reader to wait forever on an empty queue.

we didn't discover this when we implemented it because the test case used the SequenceTaggingDatasetReader, which assumes pre-tokenized text and so never generates spacy tokens.

I ended up with the following ugly workaround in my dataset reader:

tokenized = [Token(token.text) for token in self._tokenizer.tokenize(sentence)]

(that's the allennlp Token replacement there)

anyway, I don't know if there is a good solution other than not using spacy tokens in this case, but I'm opening an issue in case anyone else runs into this.

In [1]: import spacy

In [2]: nlp = spacy.load('en_core_web_sm')

In [3]: doc = nlp("can't pickle this")

In [4]: token = doc[0]

In [5]: import pickle

In [6]: pickle.dumps(token)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-81a5065b9c9e> in <module>()
----> 1 pickle.dumps(token)

~/anaconda3/envs/allennlp/lib/python3.6/site-packages/spacy/tokens/token.cpython-36m-darwin.so in spacy.tokens.token.Token.__reduce_cython__()

TypeError: no default __reduce__ due to non-trivial __cinit__

cf https://github.com/explosion/spaCy/issues/606

Source

joelgrus

Most helpful comment

at the very least I think we should add a unspacify_tokens: bool = False parameter to the SpacyWordSplitter

joelgrus on 10 Oct 2018

👍6

All 9 comments

we could modify the SpacyWordSplitter to return our Token type (this would make reasoning about the code simpler), is there ever a good reason to hang onto the spacy type?

joelgrus on 10 Oct 2018

On a somewhat unrelated note: Is there any reason to not support multiprocess reader through experiment config? That would make it easier for people to use it and such issues can be known earlier. Something like ....

"dataset_reader": {
    "type": "...",
    ...
    "multiprocess": {
        "num_workers": 5,
        "epochs_per_read": 3
    }
},

.... ?

HarshTrivedi on 10 Oct 2018

I vaguely remember using dill in the past for something like this. I'm not sure if it actually solves the problem, though, especially as you say this happens implicitly.

As for just returning our tokens, the drawbacks are (1) the extra time and space for constructing new objects and copying stuff over, and (2) you lose information that someone might want access to at some point (say if they're implementing their own TokenIndexer). The benefits might outweigh those drawbacks, but those are the drawbacks I can think of.

And @HarshTrivedi, can you open a new issue for that? I agree with you that something like that would be nice, but it should have its own issue for tracking it.

matt-gardner on 10 Oct 2018

Sure!

HarshTrivedi on 10 Oct 2018

it should already be supported through experiment config -- you just have to wrap your dataset reader config like

"dataset_reader": {
    "type": "multiprocess",
    "base_reader" : { ... base reader config goes here ... },
    "num_workers": 4
}

if that doesn't work then open a bug

joelgrus on 10 Oct 2018

@joelgrus Ahh great! Didn't realize that earlier! Thanks!

HarshTrivedi on 10 Oct 2018

at the very least I think we should add a unspacify_tokens: bool = False parameter to the SpacyWordSplitter

joelgrus on 10 Oct 2018

👍6

Hi, @joelgrus

It seems like the solution now (allennlp Tokens) still can't not fix this bug when i want to use not only dependency label (dep_) but also dependency link of SpaCy Token, for example, head, children attributes of SpaCy Token, which is incompatible with allennlp Tokens.