the multiprocess dataset reader uses multiple processes to generate allennlp Instances and stick them on an output queue (which implicitly involves pickling them)
it turns out (have I ever mentioned how much I hate debugging multiprocessing code) that spacy Tokens are not pickleable, which causes the queue.put() operation to silently fail, which causes the multiprocess dataset reader to wait forever on an empty queue.
we didn't discover this when we implemented it because the test case used the SequenceTaggingDatasetReader, which assumes pre-tokenized text and so never generates spacy tokens.
I ended up with the following ugly workaround in my dataset reader:
tokenized = [Token(token.text) for token in self._tokenizer.tokenize(sentence)]
(that's the allennlp Token replacement there)
anyway, I don't know if there is a good solution other than not using spacy tokens in this case, but I'm opening an issue in case anyone else runs into this.
In [1]: import spacy
In [2]: nlp = spacy.load('en_core_web_sm')
In [3]: doc = nlp("can't pickle this")
In [4]: token = doc[0]
In [5]: import pickle
In [6]: pickle.dumps(token)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-81a5065b9c9e> in <module>()
----> 1 pickle.dumps(token)
~/anaconda3/envs/allennlp/lib/python3.6/site-packages/spacy/tokens/token.cpython-36m-darwin.so in spacy.tokens.token.Token.__reduce_cython__()
TypeError: no default __reduce__ due to non-trivial __cinit__
we could modify the SpacyWordSplitter to return our Token type (this would make reasoning about the code simpler), is there ever a good reason to hang onto the spacy type?
On a somewhat unrelated note: Is there any reason to not support multiprocess reader through experiment config? That would make it easier for people to use it and such issues can be known earlier. Something like ....
"dataset_reader": {
"type": "...",
...
"multiprocess": {
"num_workers": 5,
"epochs_per_read": 3
}
},
.... ?
I vaguely remember using dill in the past for something like this. I'm not sure if it actually solves the problem, though, especially as you say this happens implicitly.
As for just returning our tokens, the drawbacks are (1) the extra time and space for constructing new objects and copying stuff over, and (2) you lose information that someone might want access to at some point (say if they're implementing their own TokenIndexer). The benefits might outweigh those drawbacks, but those are the drawbacks I can think of.
And @HarshTrivedi, can you open a new issue for that? I agree with you that something like that would be nice, but it should have its own issue for tracking it.
Sure!
it should already be supported through experiment config -- you just have to wrap your dataset reader config like
"dataset_reader": {
"type": "multiprocess",
"base_reader" : { ... base reader config goes here ... },
"num_workers": 4
}
if that doesn't work then open a bug
@joelgrus Ahh great! Didn't realize that earlier! Thanks!
at the very least I think we should add a unspacify_tokens: bool = False parameter to the SpacyWordSplitter
Hi, @joelgrus
It seems like the solution now (allennlp Tokens) still can't not fix this bug when i want to use not only dependency label (dep_) but also dependency link of SpaCy Token, for example, head, children attributes of SpaCy Token, which is incompatible with allennlp Tokens.
@charosen, please open a new issue instead of commenting on an old one - it is much easier for us to track and get you the help you need.
Most helpful comment
at the very least I think we should add a
unspacify_tokens: bool = Falseparameter to theSpacyWordSplitter