Allennlp: Unable to see the gains by using `MultiProcessDatasetReader`

Created on 10 Oct 2018 · 18Comments · Source: allenai/allennlp

@joelgrus, I am unable to see the gains by using the MultiProcessDatasetReader. In fact it's taking 7x more time to dry-run in this particular case. I could be doing something wrong here, so please let me know if so.

I have created a minimally reproducible branch here. It runs ESIM textual entailment (one already present in allennlp) by reading SNLI train dataset

top 50 K pairs (single process) vs
top 50 K pairs (multi process where each file contains 10K splits)

Edit: Actually esim is only for borrowing config file, I am only doing dry-run here.

DatasetReader config is as follows:

    "dataset_reader": {
        "type": "multiprocess",
        "base_reader" : {
            "type": "snli",
            "token_indexers": {
                "tokens": {
                    "type": "single_id",
                    "lowercase_tokens": true
                }
            }
        },
        "num_workers": 10,
        "epochs_per_read": 3
    },

To reproduce, just run following on that branch:

time python allennlp/run.py dry-run experiment_single_process.jsonnet -s single_process_serialization

114.71s user 8.85s system 85% cpu 2:24.73 total

time python allennlp/run.py dry-run experiment_multi_process.jsonnet -s multi_process_serialization

734.79s user 119.35s system 134% cpu 10:33.51 total

Btw, I did change snli reader to use allennlp Token instead of spacy tokens as in #1887. You can check that in last commit on that branch.

Source

HarshTrivedi

All 18 comments

@HarshTrivedi how large was your original dataset? We have not tested this code thoroughly, but we're going through it soon so, if there are issues, we might sort them out soon.

schmmd on 12 Oct 2018

@schmmd My original dataset is about ~2M pairs of sentences. I wasn't able to see gain in that so I made this small case with only 50K entailment pairs (from snli) - with and without multiprocess-reader enabled. This is easily reproducible w/o any part of my code ... It might help debug the problem.

HarshTrivedi on 13 Oct 2018

Thanks @HarshTrivedi for putting together the repro case. That said, troubleshooting this issue isn't a high priority for our team right now--but if there are any issues we should run into them with some of our current work for the quarter.

schmmd on 15 Oct 2018

👍1

I'm hitting this now with the Calypso port. Investigating.

brendan-ai2 on 30 Oct 2018

👍1

One issue is that you'll also likely need the multiprocess iterator. Add something like the following to your config:

  "iterator": {
    "type": "multiprocess",
    "iterator": {
      "type": "basic",
      "batch_size": 32
    },
    "num_workers": 8,
  },

I'm still studying this code to fully understand the interaction between the reader and iterator. Should have more for you soon.

brendan-ai2 on 31 Oct 2018

@brendan-ai2 Thanks for looking into this! Can you confirm that you could reproduce the issue? Unfortunately, I am out of town till Thursday at least and won't be able try your suggestions till then.

Btw, I am not using MultiProcessDatasetReader now. I tried using MultiprocessIterator alone and with MultiProcessDatasetReader. If I remember correctly, One / some of these 3 things did save me time, but don't remember what exactly. However, it was yet not clear why using only MultiProcessDatasetReader should take up so much more time.

In my particular case, since dataset fits in memory, changing datasetreader to use multiprocessing map turned out to work well ( @matt-gardner suggested it in #1891 ).

HarshTrivedi on 31 Oct 2018

@HarshTrivedi, I have not attempted to reproduce with your exact code, but I have reproduced with my own which is fairly similar. My suspicion for dry-run in particular is simply that we don't take advantage of multiprocessing when building the vocab. So we pay overhead to put the instances in a synchronized queue, but we count the tokens in a single thread.

brendan-ai2 on 2 Nov 2018

👍1

Quick update, I've written a version of the MultiProcessDatasetReader and Vocabulary that truly parallelizes counting the tokens. (Token count dictionaries are accumulated for each shard in parallel and then merged afterwards.) Now I'm looking into the indexing. This could be a bit trickier. Overall, a fix for this is likely to somewhat involved.

brendan-ai2 on 3 Nov 2018

👍1

is there such a strong need to parallelize counting the tokens? it only happens once?

I thought the main idea of the multiprocess stuff was to parallelize the tensor generation (which happens over and over again), but I could be missing something.

joelgrus on 3 Nov 2018

Yeah, I guess I'd like to see absolute numbers on the performance difference before being sure we should invest heavily in fixing the vocab creation part. If it really is a large bottleneck, it might be worth it, I just don't know how bad the problem is.

matt-gardner on 3 Nov 2018

The token counting was proceeding at a speed that it would have taken about 5-6 hours to complete. Processing the shards independently brought that down to about 30 minutes. Yes, in the ideal case that cost is borne only once, but it's still a pretty unfortunate user experience and will be incurred whenever the dataset changes or a vocab parameter is tweaked.

I'm still trying to grok how the current setup is intending to parallelize the tensor generation. My concern is that the queue is going to be a bottleneck even when it's properly exposed as a queue instead of an iterable. I'll be able to verify this once my branch is cleaned up. Effectively I need to pass the underlying queue through everywhere and remove any code that combines datasets as iterables naively.

brendan-ai2 on 5 Nov 2018

@brendan-ai2 Any updates on this?

rangwani-harsh on 14 Dec 2018

Hi @rangwani-harsh, thanks for the ping. For now I've discarded my code that parallelizes counting the tokens. Getting an API for that that fits with the existing AllenNLP codebase is challenging. Instead I've pre-generated a vocabulary (which can be done with dry-run or make-vocab) and I'm pointing my model at it during training time with the directory_path option. See https://github.com/allenai/allennlp/blob/master/training_config/bidirectional_lm.jsonnet#L55 for an example.

I'd suggest taking the same approach for now. So, when pre-generating the vocabulary _don't_ use the MultiprocessDatasetReader. Just use your underlying reader. Then, during training, you can use MultiprocessDatasetReader along with the MultiprocessIterator. One thing to keep in mind when using the multiprocess tools is that your bottleneck may very well not be related to data loading. For instance, in the language modeling use case I was working on it was more important to pack the batches with a variable number of sentences to maximize utilization of GPU memory.

I hope that helps. Let me know if you have any followup questions.

brendan-ai2 on 14 Dec 2018

Hi,
The main issue here is that using dry-run or make-vocab doesn't make use of the laziness in the dataset reader. So I can't prepare my vocab with dry-run or make-vocab commands. As my model cannot fit into the memory.
I am trying to get through it by setting the num_epochs = 0 while training.
But I feel that laziness should be used anywhere we are using the dataset_reader and should not depend on what we are trying to do (i.e. train or evaluate or dry-run).

In the PR that introduced lazy dataset reader, there is an assumption that laziness is only required for training. Can you please share some insight on it @joelgrus. And is it not possible for the dry-run and make-vocab to work with laziness.

Thanks

rangwani-harsh on 15 Dec 2018

@brendan-ai2 I still can't get through as allennlp train breaks down when we specify 0 as num_of_epochs. This can be fixed if we just introduce this check if trainer._num_epochs :
before loading the best weights.

rangwani-harsh on 15 Dec 2018

With the merge yesterday, can this be closed? @DeNeutoy, @brendan-ai2?

matt-gardner on 17 Dec 2019

@matt-gardner, yes we should close this. https://github.com/allenai/allennlp/pull/3119 addressed one fundamental problem with the multiprocess data/iterator setup, but we're trying to move away from that anyway with @DeNeutoy's new iterator PR: https://github.com/allenai/allennlp/pull/3386. Of course, that PR is still in flight, but it's the likely path forward. We're not intending to invest more effort into MultiprocessDatasetReader.

brendan-ai2 on 17 Dec 2019

Superseding issue: https://github.com/allenai/allennlp/issues/3079

brendan-ai2 on 17 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings