@joelgrus, I am unable to see the gains by using the MultiProcessDatasetReader. In fact it's taking 7x more time to dry-run in this particular case. I could be doing something wrong here, so please let me know if so.
I have created a minimally reproducible branch here. It runs ESIM textual entailment (one already present in allennlp) by reading SNLI train dataset
Edit: Actually esim is only for borrowing config file, I am only doing dry-run here.
DatasetReader config is as follows:
"dataset_reader": {
"type": "multiprocess",
"base_reader" : {
"type": "snli",
"token_indexers": {
"tokens": {
"type": "single_id",
"lowercase_tokens": true
}
}
},
"num_workers": 10,
"epochs_per_read": 3
},
To reproduce, just run following on that branch:
time python allennlp/run.py dry-run experiment_single_process.jsonnet -s single_process_serialization
114.71s user 8.85s system 85% cpu 2:24.73 total
vs
time python allennlp/run.py dry-run experiment_multi_process.jsonnet -s multi_process_serialization
734.79s user 119.35s system 134% cpu 10:33.51 total
Btw, I did change snli reader to use allennlp Token instead of spacy tokens as in #1887. You can check that in last commit on that branch.
@HarshTrivedi how large was your original dataset? We have not tested this code thoroughly, but we're going through it soon so, if there are issues, we might sort them out soon.
@schmmd My original dataset is about ~2M pairs of sentences. I wasn't able to see gain in that so I made this small case with only 50K entailment pairs (from snli) - with and without multiprocess-reader enabled. This is easily reproducible w/o any part of my code ... It might help debug the problem.
Thanks @HarshTrivedi for putting together the repro case. That said, troubleshooting this issue isn't a high priority for our team right now--but if there are any issues we should run into them with some of our current work for the quarter.
I'm hitting this now with the Calypso port. Investigating.
One issue is that you'll also likely need the multiprocess iterator. Add something like the following to your config:
"iterator": {
"type": "multiprocess",
"iterator": {
"type": "basic",
"batch_size": 32
},
"num_workers": 8,
},
I'm still studying this code to fully understand the interaction between the reader and iterator. Should have more for you soon.
@brendan-ai2 Thanks for looking into this! Can you confirm that you could reproduce the issue? Unfortunately, I am out of town till Thursday at least and won't be able try your suggestions till then.
Btw, I am not using MultiProcessDatasetReader now. I tried using MultiprocessIterator alone and with MultiProcessDatasetReader. If I remember correctly, One / some of these 3 things did save me time, but don't remember what exactly. However, it was yet not clear why using only MultiProcessDatasetReader should take up so much more time.
In my particular case, since dataset fits in memory, changing datasetreader to use multiprocessing map turned out to work well ( @matt-gardner suggested it in #1891 ).
@HarshTrivedi, I have not attempted to reproduce with your exact code, but I have reproduced with my own which is fairly similar. My suspicion for dry-run in particular is simply that we don't take advantage of multiprocessing when building the vocab. So we pay overhead to put the instances in a synchronized queue, but we count the tokens in a single thread.
Quick update, I've written a version of the MultiProcessDatasetReader and Vocabulary that truly parallelizes counting the tokens. (Token count dictionaries are accumulated for each shard in parallel and then merged afterwards.) Now I'm looking into the indexing. This could be a bit trickier. Overall, a fix for this is likely to somewhat involved.
is there such a strong need to parallelize counting the tokens? it only happens once?
I thought the main idea of the multiprocess stuff was to parallelize the tensor generation (which happens over and over again), but I could be missing something.
Yeah, I guess I'd like to see absolute numbers on the performance difference before being sure we should invest heavily in fixing the vocab creation part. If it really is a large bottleneck, it might be worth it, I just don't know how bad the problem is.
The token counting was proceeding at a speed that it would have taken about 5-6 hours to complete. Processing the shards independently brought that down to about 30 minutes. Yes, in the ideal case that cost is borne only once, but it's still a pretty unfortunate user experience and will be incurred whenever the dataset changes or a vocab parameter is tweaked.
I'm still trying to grok how the current setup is intending to parallelize the tensor generation. My concern is that the queue is going to be a bottleneck even when it's properly exposed as a queue instead of an iterable. I'll be able to verify this once my branch is cleaned up. Effectively I need to pass the underlying queue through everywhere and remove any code that combines datasets as iterables naively.
@brendan-ai2 Any updates on this?
Hi @rangwani-harsh, thanks for the ping. For now I've discarded my code that parallelizes counting the tokens. Getting an API for that that fits with the existing AllenNLP codebase is challenging. Instead I've pre-generated a vocabulary (which can be done with dry-run or make-vocab) and I'm pointing my model at it during training time with the directory_path option. See https://github.com/allenai/allennlp/blob/master/training_config/bidirectional_lm.jsonnet#L55 for an example.
I'd suggest taking the same approach for now. So, when pre-generating the vocabulary _don't_ use the MultiprocessDatasetReader. Just use your underlying reader. Then, during training, you can use MultiprocessDatasetReader along with the MultiprocessIterator. One thing to keep in mind when using the multiprocess tools is that your bottleneck may very well not be related to data loading. For instance, in the language modeling use case I was working on it was more important to pack the batches with a variable number of sentences to maximize utilization of GPU memory.
I hope that helps. Let me know if you have any followup questions.
Hi,
The main issue here is that using dry-run or make-vocab doesn't make use of the laziness in the dataset reader. So I can't prepare my vocab with dry-run or make-vocab commands. As my model cannot fit into the memory.
I am trying to get through it by setting the num_epochs = 0 while training.
But I feel that laziness should be used anywhere we are using the dataset_reader and should not depend on what we are trying to do (i.e. train or evaluate or dry-run).
In the PR that introduced lazy dataset reader, there is an assumption that laziness is only required for training. Can you please share some insight on it @joelgrus. And is it not possible for the dry-run and make-vocab to work with laziness.
Thanks
@brendan-ai2 I still can't get through as allennlp train breaks down when we specify 0 as num_of_epochs. This can be fixed if we just introduce this check if trainer._num_epochs :
before loading the best weights.
With the merge yesterday, can this be closed? @DeNeutoy, @brendan-ai2?
@matt-gardner, yes we should close this. https://github.com/allenai/allennlp/pull/3119 addressed one fundamental problem with the multiprocess data/iterator setup, but we're trying to move away from that anyway with @DeNeutoy's new iterator PR: https://github.com/allenai/allennlp/pull/3386. Of course, that PR is still in flight, but it's the likely path forward. We're not intending to invest more effort into MultiprocessDatasetReader.
Superseding issue: https://github.com/allenai/allennlp/issues/3079