Transformers: tokenization slow

Created on 24 Oct 2019 · 29Comments · Source: huggingface/transformers

❓ Questions & Help

Hi,
I want to fine-tune the gpt2 model with a very large corpus (~9GB text data)
However, the tokenization of run_lm_finetuning.py takes forever (what is not surprising with a 9GB text file)
My question is: is there any way to speed up the tokenization like multiprocessing, or do I have to break up my training file and train with a sample?

Best regards

wontfix

Source

EndruK

👍1

Most helpful comment

@EndruK I'm actually working on applying multiprocessing to parallelize the tokenization process of transformers workflows as well. I can share my fork with you as soon I get this started.

enzoampil on 29 Oct 2019

👍4

All 29 comments

Hi, with the current implementation of the run_lm_finetuning.py file there is no way to speed up the tokenization. It is an example to showcase how to use the library and is therefore not completely optimized especially concerning the data pre-processing.

You could modify the script a bit to setup multiprocessing and tokenize the whole dataset at once. You could then re-use these features and fine-tune your model using these.

LysandreJik on 24 Oct 2019

Perhaps something can be done with Dataloader's num_workzrs and collate_fn.

BramVanroy on 29 Oct 2019

@EndruK I'm actually working on applying multiprocessing to parallelize the tokenization process of transformers workflows as well. I can share my fork with you as soon I get this started.

enzoampil on 29 Oct 2019

👍4

Nice,
I'm also working on a multiprocessing approach.
Looking forward to share it when its done.

EndruK on 29 Oct 2019

@BramVanroy How are you thinking about using collate_fn? The bottleneck from my understanding is at the tokenization and numericalization step which is before the data is converted to a tensor, and so speedup will have to be implemented pre-Dataloader.

enzoampil on 30 Oct 2019

Well, since collate_fn is basically a callback between loading the data and returning the data. I admit I haven't looked into this in detail, but from my brief reading into it, it should be possible to do some processing in there. Something like this (pseudo-code, un-tested)

def collate_fn(batch):
    tokens = [tokenizer.tokenize(text) for text in batch]
    ids = [[tokenizer.convert_tokens_to_ids(tok) for tok in seq] for seq in tokens]
    return ids

See this section for more information. A typical use-case for collate_fn, according to the documentation, is padding a sequence up to some max_len. Therefore I'd think that it's also useful for tokenisation and other things.

BramVanroy on 30 Oct 2019

Got it yes this makes sense

enzoampil on 31 Oct 2019

Would love to see the multiprocessing fork as well

Santosh-Gupta on 3 Nov 2019

👍1

Hi @enzoampil @BramVanroy , I need to speed up the tokenization process, too. I'm not a pytorch guy and not sure the things you mentioned. Could you please provide a little more ? Thanks!

iedmrc on 13 Nov 2019

I haven't done anything like this since I didn't have a performance issue, but theoretically you can add a custom collate function to your Dataloader. A batch will then be passed to that collate_fn and the result will be returned. The following is an example, but it's untested.

def tokenize(batch):
    sentences, labels = batch
    input_ids = torch.Tensor([tokenizer.encode(s) for s in sentences])
    # generate masks ...
    # add padding ...
    return input_ids, mask_ids, labels

DataLoader(dataset, batch_size=64, collate_fn=tokenize, num_workers=4)

Of course it depends on your dataset what will be fed to the collate_fn.

BramVanroy on 13 Nov 2019

👍1

Rapids AI CuDF GPU data science library?

https://github.com/rapidsai/cudf

ahotrod on 13 Nov 2019

Rapids AI CuDF GPU data science library?

https://github.com/rapidsai/cudf

Perhaps elaborate on how this is useful in this context?

BramVanroy on 13 Nov 2019

Rapids AI CuDF GPU data science library?
https://github.com/rapidsai/cudf

Perhaps elaborate on how this is useful in this context?

GPU-accelerated word tokenization. Expand on this basic example:
https://medium.com/rapids-ai/show-me-the-word-count-3146e1173801

High-speed data loading & processing of textual dataframes on GPU with CUDA. Moving panda dfs to GPU is several lines of code or perhaps data loading straight to GPU. Stand-alone string library cuStrings & python-wrapper nvStrings are available: https://github.com/rapidsai/custrings

ahotrod on 13 Nov 2019

I should mention that I'm trying to finetune distilgpt2 on my 880MB dataset and in this sense I use run_lm_finetuning.py. It takes so many times to tokenize and I could say that it stucks here. It's been 20 hours and I'm still waiting. I know there is something wrong and It shouldn't have taken this much time because I tokenized 470MB dataset before via gpt2-simple and it took less than 5 mins.

I run run_lm_finetuning.py with a truncated 1 MB version of my dataset and It took ~1 mins. But when I tried a 50MB version, it's already exceeded 30 mins. That means, there is something causing tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text)) to run in exponentially much more time.

iedmrc on 13 Nov 2019

👀1 👍1

I should mention that I'm trying to finetune distilgpt2 on my 880MB dataset and in this sense I use run_lm_finetuning.py. It takes so many times to tokenize and I could say that it stucks here. It's been 20 hours and I'm still waiting. I know there is something wrong and It shouldn't have taken this much time because I tokenized 470MB dataset before via gpt2-simple and it took less than 5 mins.

I run run_lm_finetuning.py with a truncated 1 MB version of my dataset and It took ~1 mins. But when I tried a 50MB version, it's already exceeded 30 mins. That means, there is something causing tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text)) to run in exponentially much more time.

Do you perhaps have any strange data? Sentences that are particularly long or contain strange characters, stuff like that?

BramVanroy on 13 Nov 2019

What should be the most strange characters? I scanned for non-ascii chars and found nothing. It's full of ascii chars and I think that makes it usual :) . (Btw, the dataset just consists of emails.)
Any other suggestions? Because that is too annoying.

iedmrc on 13 Nov 2019

Hm, no. No idea. You can try profiling and see where it goes wrong.

BramVanroy on 13 Nov 2019

I dug into transformers codebase and found the problem:

https://github.com/huggingface/transformers/blob/master/transformers/tokenization_utils.py#L644

That for loop lasts almost forever. Seems like It just splits the text into tokens. How could we optimize it?

iedmrc on 13 Nov 2019

Okay, here is more details. This function takes so many time:
https://github.com/huggingface/transformers/blob/155c782a2ccd103cf63ad48a2becd7c76a7d2115/transformers/tokenization_gpt2.py#L183
That means, BPE takes a long time. Here is a quick benchmark in my 4th gen i7 CPU:

0     0.002872943878173828
100     0.2857849597930908
200     0.46935296058654785
300     0.7295417785644531
400     0.8204867839813232
500     0.965552806854248
600     1.0516178607940674
700     1.1927227973937988
800     1.3081107139587402
900     1.354628086090088
1000     1.4476778507232666

the first column is the iteration number and the second one is elapsed time. 1000 iteration takes 1.44 seconds. If we think that I have 2068444 tokens, it'll last ~50 hours. Isn't there anyone tried to train such a big (?) dataset?

iedmrc on 13 Nov 2019

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 13 Jan 2020

Please check out our tokenizers repo. We rebuilt the tokenizers from scratch in Rust for performance and extensibility.

Feedback (and contributions) welcome 🤗

julien-c on 13 Jan 2020

🎉2

I used multiprocessing to tokenize my dataset, and after adding tokens in vocab it took nearly 6hrs to tokenize ~2 million sentences, while without adding vocab It took only 2 min.

DarshanPatel11 on 5 Feb 2020

👍3

@DarshanPatel11 Can you share the code how you did it?

Gurpreethgnis on 28 Jun 2020

@DarshanPatel11 Can you share the code how you did it?

What exactly you need the code for?
For multiprocessing here is the code:
https://www.ppaste.org/XbVqp6VzJ

Btw, Now you should use FastTokenizers only, they are insanely fast.

DarshanPatel11 on 29 Jun 2020

👍1

@DarshanPatel11 what do you mean by "adding tokens in vocab"?

amacfie on 19 Aug 2020

@DarshanPatel11 what do you mean by "adding tokens in vocab"?

By "adding tokens in vocab", I meant Adding my custom domain-specific words into the existing vocabulary.

DarshanPatel11 on 16 Sep 2020

@DarshanPatel11 Running into the same problem. It is odd that using the default tokenizer seems to be much faster than using the same tokenizer, but with an expanded vocabulary.

mitchelldehaven on 20 Sep 2020

👍2

I should mention that I'm trying to finetune distilgpt2 on my 880MB dataset and in this sense I use run_lm_finetuning.py. It takes so many times to tokenize and I could say that it stucks here. It's been 20 hours and I'm still waiting. I know there is something wrong and It shouldn't have taken this much time because I tokenized 470MB dataset before via gpt2-simple and it took less than 5 mins.

I run run_lm_finetuning.py with a truncated 1 MB version of my dataset and It took ~1 mins. But when I tried a 50MB version, it's already exceeded 30 mins. That means, there is something causing tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text)) to run in exponentially much more time.

my training file has a size of around 880 MB but when I'm training a tokenizer (BPE), it getting halt, and Killed is coming on the terminal. Any suggestion?

yadavpp on 12 Oct 2020

I had a similar experience with XLM-R Tokenizer:
I wanted to make the XLM-R Longformer according to https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb, I was working with a train text file around 1GB. The issue was that tokenization got stuck at some point and even after several days there was no sign of progress. According to my tracking it got stuck in the _split_on_token function_ in the _split_ here tokenization_utils.py#L287 even though there should not be any of the special tokens in my text. At the end I have processed the text line by line (like in the minimal example below) which did the trick for me.

Note: The conversion guide above requires version 3.0.2 of transformers, but same thing seems to happen also using the new version, see the minimal example for illustration: https://colab.research.google.com/drive/1gIfcQ4XcWCRrPfGCGF8rHR6UViZAgoIS?usp=sharing

At first, it seemed to me that it is just incredibly slow. But I am still suspicious that something is off. Any explanation/comment on that would be appreciated! :)