Hi,
I want to fine-tune the gpt2 model with a very large corpus (~9GB text data)
However, the tokenization of run_lm_finetuning.py takes forever (what is not surprising with a 9GB text file)
My question is: is there any way to speed up the tokenization like multiprocessing, or do I have to break up my training file and train with a sample?
Best regards
Hi, with the current implementation of the run_lm_finetuning.py
file there is no way to speed up the tokenization. It is an example to showcase how to use the library and is therefore not completely optimized especially concerning the data pre-processing.
You could modify the script a bit to setup multiprocessing and tokenize the whole dataset at once. You could then re-use these features and fine-tune your model using these.
Perhaps something can be done with Dataloader's num_workzrs and collate_fn.
@EndruK I'm actually working on applying multiprocessing
to parallelize the tokenization process of transformers
workflows as well. I can share my fork with you as soon I get this started.
Nice,
I'm also working on a multiprocessing approach.
Looking forward to share it when its done.
@BramVanroy How are you thinking about using collate_fn
? The bottleneck from my understanding is at the tokenization and numericalization step which is before the data is converted to a tensor, and so speedup will have to be implemented pre-Dataloader.
Well, since collate_fn
is basically a callback between loading the data and returning the data. I admit I haven't looked into this in detail, but from my brief reading into it, it should be possible to do some processing in there. Something like this (pseudo-code, un-tested)
def collate_fn(batch):
tokens = [tokenizer.tokenize(text) for text in batch]
ids = [[tokenizer.convert_tokens_to_ids(tok) for tok in seq] for seq in tokens]
return ids
See this section for more information. A typical use-case for collate_fn, according to the documentation, is padding a sequence up to some max_len. Therefore I'd think that it's also useful for tokenisation and other things.
Got it yes this makes sense
Would love to see the multiprocessing fork as well
Hi @enzoampil @BramVanroy , I need to speed up the tokenization process, too. I'm not a pytorch guy and not sure the things you mentioned. Could you please provide a little more ? Thanks!
I haven't done anything like this since I didn't have a performance issue, but theoretically you can add a custom collate function to your Dataloader. A batch will then be passed to that collate_fn and the result will be returned. The following is an example, but it's untested.
def tokenize(batch):
sentences, labels = batch
input_ids = torch.Tensor([tokenizer.encode(s) for s in sentences])
# generate masks ...
# add padding ...
return input_ids, mask_ids, labels
DataLoader(dataset, batch_size=64, collate_fn=tokenize, num_workers=4)
Of course it depends on your dataset what will be fed to the collate_fn.
Rapids AI CuDF GPU data science library?
Rapids AI CuDF GPU data science library?
Perhaps elaborate on how this is useful in this context?
Rapids AI CuDF GPU data science library?
https://github.com/rapidsai/cudfPerhaps elaborate on how this is useful in this context?
GPU-accelerated word tokenization. Expand on this basic example:
https://medium.com/rapids-ai/show-me-the-word-count-3146e1173801
High-speed data loading & processing of textual dataframes on GPU with CUDA. Moving panda dfs to GPU is several lines of code or perhaps data loading straight to GPU. Stand-alone string library cuStrings & python-wrapper nvStrings are available: https://github.com/rapidsai/custrings
I should mention that I'm trying to finetune distilgpt2 on my 880MB dataset and in this sense I use run_lm_finetuning.py
. It takes so many times to tokenize and I could say that it stucks here. It's been 20 hours and I'm still waiting. I know there is something wrong and It shouldn't have taken this much time because I tokenized 470MB dataset before via gpt2-simple and it took less than 5 mins.
I run run_lm_finetuning.py
with a truncated 1 MB version of my dataset and It took ~1 mins. But when I tried a 50MB version, it's already exceeded 30 mins. That means, there is something causing tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
to run in exponentially much more time.
I should mention that I'm trying to finetune distilgpt2 on my 880MB dataset and in this sense I use
run_lm_finetuning.py
. It takes so many times to tokenize and I could say that it stucks here. It's been 20 hours and I'm still waiting. I know there is something wrong and It shouldn't have taken this much time because I tokenized 470MB dataset before via gpt2-simple and it took less than 5 mins.I run
run_lm_finetuning.py
with a truncated 1 MB version of my dataset and It took ~1 mins. But when I tried a 50MB version, it's already exceeded 30 mins. That means, there is something causingtokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
to run in exponentially much more time.
Do you perhaps have any strange data? Sentences that are particularly long or contain strange characters, stuff like that?
What should be the most strange characters? I scanned for non-ascii chars and found nothing. It's full of ascii chars and I think that makes it usual :) . (Btw, the dataset just consists of emails.)
Any other suggestions? Because that is too annoying.
Hm, no. No idea. You can try profiling and see where it goes wrong.
I dug into transformers
codebase and found the problem:
https://github.com/huggingface/transformers/blob/master/transformers/tokenization_utils.py#L644
That for loop lasts almost forever. Seems like It just splits the text into tokens. How could we optimize it?
Okay, here is more details. This function takes so many time:
https://github.com/huggingface/transformers/blob/155c782a2ccd103cf63ad48a2becd7c76a7d2115/transformers/tokenization_gpt2.py#L183
That means, BPE takes a long time. Here is a quick benchmark in my 4th gen i7 CPU:
0 0.002872943878173828
100 0.2857849597930908
200 0.46935296058654785
300 0.7295417785644531
400 0.8204867839813232
500 0.965552806854248
600 1.0516178607940674
700 1.1927227973937988
800 1.3081107139587402
900 1.354628086090088
1000 1.4476778507232666
the first column is the iteration number and the second one is elapsed time. 1000 iteration takes 1.44 seconds. If we think that I have 2068444 tokens, it'll last ~50 hours. Isn't there anyone tried to train such a big (?) dataset?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Please check out our tokenizers
repo. We rebuilt the tokenizers from scratch in Rust for performance and extensibility.
Feedback (and contributions) welcome 🤗
I used multiprocessing to tokenize my dataset, and after adding tokens in vocab it took nearly 6hrs to tokenize ~2 million sentences, while without adding vocab It took only 2 min.
@DarshanPatel11 Can you share the code how you did it?
@DarshanPatel11 Can you share the code how you did it?
What exactly you need the code for?
For multiprocessing here is the code:
https://www.ppaste.org/XbVqp6VzJ
Btw, Now you should use FastTokenizers only, they are insanely fast.
@DarshanPatel11 what do you mean by "adding tokens in vocab"?
@DarshanPatel11 what do you mean by "adding tokens in vocab"?
By "adding tokens in vocab", I meant Adding my custom domain-specific words into the existing vocabulary.
@DarshanPatel11 Running into the same problem. It is odd that using the default tokenizer seems to be much faster than using the same tokenizer, but with an expanded vocabulary.
I should mention that I'm trying to finetune distilgpt2 on my 880MB dataset and in this sense I use
run_lm_finetuning.py
. It takes so many times to tokenize and I could say that it stucks here. It's been 20 hours and I'm still waiting. I know there is something wrong and It shouldn't have taken this much time because I tokenized 470MB dataset before via gpt2-simple and it took less than 5 mins.I run
run_lm_finetuning.py
with a truncated 1 MB version of my dataset and It took ~1 mins. But when I tried a 50MB version, it's already exceeded 30 mins. That means, there is something causingtokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
to run in exponentially much more time.
my training file has a size of around 880 MB but when I'm training a tokenizer (BPE), it getting halt, and Killed is coming on the terminal. Any suggestion?
I had a similar experience with XLM-R Tokenizer:
I wanted to make the XLM-R Longformer according to https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb, I was working with a train text file around 1GB. The issue was that tokenization got stuck at some point and even after several days there was no sign of progress. According to my tracking it got stuck in the _split_on_token function_ in the _split_ here tokenization_utils.py#L287 even though there should not be any of the special tokens in my text. At the end I have processed the text line by line (like in the minimal example below) which did the trick for me.
Note: The conversion guide above requires version 3.0.2 of transformers, but same thing seems to happen also using the new version, see the minimal example for illustration: https://colab.research.google.com/drive/1gIfcQ4XcWCRrPfGCGF8rHR6UViZAgoIS?usp=sharing
At first, it seemed to me that it is just incredibly slow. But I am still suspicious that something is off. Any explanation/comment on that would be appreciated! :)
Most helpful comment
@EndruK I'm actually working on applying
multiprocessing
to parallelize the tokenization process oftransformers
workflows as well. I can share my fork with you as soon I get this started.