Transformers: Parallel data preprocessing for distillation

Created on 29 Oct 2019  路  5Comments  路  Source: huggingface/transformers

馃殌 Feature

Use the multiprocessing.Pool function to parallelize the text tokenization and uint16 conversion in transformers/examples/distillation/scripts/binarized_data.py.

Motivation

I tried to preprocess a 2.6 GB txt file using the python script, but the expected time is 2.4 hours. I tried to parallelize it myself and the total time decreased to 10 minutes on my server.

Additional context

My code is something like this:

def process_data(text):
    return tokenizer.encode(f'{bos} {text.strip()} {sep}')

pool = Pool()
rslt = pool.map(process_data, data)
rslt_ = pool.map(np.uint16, rslt)
wontfix

Most helpful comment

@VictorSanh At first reading I thought the suggestion was to implement a default multiprocessing encoding for tokenizers. That would seem like a large change that needs a lot of testing across multiple platforms (note the different between fork and spawn) as well as a possible reproducibility issue when retrieving results from different threads, and thus different batch orders. Of course these problems could be mitigated but it seemed like a lot of work to suddenly overhaul all tokenizers in this way.

Now that it's clear that it's only for the distillation script, I'm sure there's no big issue here even though I would like to see this implemented in a deterministic way, i.e. order of return values should always be identical.

All 5 comments

What is your suggestion, then? Adding a mp_encode function? Perhaps this is something that should stay at the user's side.

Hello @jianwolf,
Yes indeed, I've never taken the time to do it (mainly because most of the I do pre-processing are one-shot: I launch it before leaving the office 馃槾).
If you feel like opening a pull request with your suggestion, I would happy to add it.

@BramVanroy do you see any drawbacks of having parallelized pre-processing by default?

I tried to integrate your few lines and had this error:

  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'main.<locals>.process_data'

It seems like process_data should be outside of the main, that shouldn't be too complicated.

(Also, how many parallel processes/cpus do you have on your server for this order of magnitude in reduction?)

Victor

@VictorSanh At first reading I thought the suggestion was to implement a default multiprocessing encoding for tokenizers. That would seem like a large change that needs a lot of testing across multiple platforms (note the different between fork and spawn) as well as a possible reproducibility issue when retrieving results from different threads, and thus different batch orders. Of course these problems could be mitigated but it seemed like a lot of work to suddenly overhaul all tokenizers in this way.

Now that it's clear that it's only for the distillation script, I'm sure there's no big issue here even though I would like to see this implemented in a deterministic way, i.e. order of return values should always be identical.

Hi! Yeah I will create a pull request for this code! On my machine there are 80 CPU threads available!

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings