Transformers: GPT-2 Training on non-english text

Created on 3 Oct 2019  ·  35Comments  ·  Source: huggingface/transformers

❓ Questions & Help

I wish to train a GPT-2 in different languages, like Portuguese and maybe some programming languages like C++ (and play with token predictions).

But I could not find any examples of how to take an X dataset (like c++ source files), create the tokens from it and train a GPT-2 to predict new tokens from the knowledge of this X dataset.

Is this even possible? (if yes, how could one do this?)

Thanks!

wontfix

Most helpful comment

I'm training Russian GPT-2 at the moment. I've tried to make Readme useful.

All 35 comments

Hi! By "GPT-2 training" two different methods can be understood: training from scratch, and fine-tuning.

If you're looking at training GPT-2 on a different language such as Portuguese, then training from scratch seems necessary. You could use the language modeling finetuning example as a start, but please be aware that training such a language model from scratch takes a humongous amount of power and data, which would cost a lot. I can point you to this issue which discusses training such a model on French.

If you're looking at training your model on programming languages that have a lot of overlapping vocabulary with English (say Python with a lot of documentation), maybe you could fine-tune the original GPT-2 to your dataset (still using the lm finetuning example), but I'm not sure of the results.

I'm training Russian GPT-2 at the moment. I've tried to make Readme useful.

If you're looking at training GPT-2 on a different language such as Portuguese, then training from scratch seems necessary.

It is definitely not necessary to start from scratch. I'd argue the opposite, it'd be useful to start with pre-trained GPT-2 even if you replacing the whole vocabulary (English -> Portuguese).

Alright @mgrankin, that's good to know, thanks!

If you're looking at training GPT-2 on a different language such as Portuguese, then training from scratch seems necessary.

It is definitely not necessary to start from scratch. I'd argue the opposite, it'd be useful to start with pre-trained GPT-2 even if you replacing the whole vocabulary (English -> Portuguese).

But in $$$ terms it would still be closer to training from scratch than fine-tuning, right?

I'm training Russian GPT-2 at the moment. I've tried to make Readme useful.

Thank you @mgrankin for sharing your steps. I plan to do the same for Hindi language.
How much is it costing you to train?

How much is it costing you to train?

It’s hard to tell overall cost because the training is in the process. I’ve got a workstation with 4 Titan RTX and I don’t use cloud GPUs at the moment. I use one GPU per model. The training already lasted about two weeks now and gpt2-medium gives me perplexity = 21 on my validation set.

Since PyTorch 1.3 was released recently with TPU support I’m thinking of trying to use TPU to speed up the training. I will update the repo in the next few days in case of success.

But in $$$ terms it would still be closer to training from scratch than fine-tuning, right?

Actually, in terms of quality it would be great if somebody try to train GPT2 on Portuguese from scratch vs fine-tune from pretrained English model. My guess that fine-tuning is better is based on intuition that non-random weights could be reused. Also, English is probably the most resourceful language and WebText is a great dataset. If you can build dataset with same or better quality you can give it a shot and train GPT-2 from scratch.

In terms of money it should be way cheaper to fine-tune. But I will say that with confidence then I'll finish the Russian GPT-2.

Thanks for the answer @mgrankin i'm anxious to see your results!

I'm training Russian GPT-2 at the moment. I've tried to make Readme useful.

@mgrankin Could you explain to me how you trained your model from scratch with BERT?

I would like to train BERT from scratch for a textual base in PT-BR (8GB data). Is it possible to use the run_lm_finetuning.py code to perform this process without using the multi-language bert model?

I already have a vocab.txt for the PT-BR base and I don't want to load initial weights.

Is there any script or tutorial to perform this process step by step?

Hi, I also have a repo which allows to train gpt-2 language model on non-english text with a custom BPE tokenizer. But it uses a different gpt-2 implementation so currently it's unable to use pre-trained GPT-2 (although a conversion script should be possible, because it's a port of original TF implementation). Here is the link https://github.com/lopuhin/transformer-lm

Hello, this thread is what I'm looking (with the one about GPT-2 and BERT into French) for but I'm not sure I found the answer to my questions:

  • how long does it take to go through GPT-2 on non-english text?
  • what configuration of GPUs?
  • what size of corpus?

Many thanks in advance for your answers!

Why don't you use CamemBERT model, which is dedicated to French language? It's available in HuggingFace's Transformers too (since few days ago, so try out :D)! If you want absolutely to use GPT2 model, I can answer to you too!

Hello, this thread is what I'm looking (with the one about GPT-2 and BERT into French) for but I'm not sure I found the answer to my questions:

  • how long does it take to go through GPT-2 on non-english text?
  • what configuration of GPUs?
  • what size of corpus?

Many thanks in advance for your answers!

Hi @piegu, please do not post the same message in two issues (that are linked with one another)

Hi @piegu, please do not post the same message in two issues (that are linked with one another)

Hello @julien-c. Ok but then I have to update in this thread my question to French and Portuguese (same 3 questions about fine-tuning GPT-2 and BERT). Thank you.

Why don't you use CamemBERT model, which is dedicated to French language? It's available in HuggingFace's Transformers too (since few days ago, so try out :D)! If you want absolutely to use GPT2 model, I can answer to you too!

Thanks @TheEdoardo93. For sure I will test CamemBERT but it does not answer my 3 questions :-) Great if you can answer about GPT-2 at least. Thank you.

Hi @nikhilno1 , Did you manage to train it on Hindi?

Hi @GladiatorX, No I didn't. Life got in the way. :)
Would you like to work on it together?

@mgrankin Out of curiosity, how did you collect your 230 GB Russian dataset?
I would love to do something similar for another language, and I'm looking for tips

@BoxxiDev you can use something like a scraper/crawler like Scrapy (or something like it) on a russian site, and then you can use something like AWS Comprehend to get the language (or make a language detector yourself) and filter only Russian results.

to get tons of data use some distributed scraper on a cloud service like AWS.

@BoxxiDev Library projects have been working in Russia for a very long time, and they publish a torrent file with all the contents in fb2. example

Hi @nikhilno1 , +1,
Did you manage to train it on Hindi?

If you're looking at training GPT-2 on a different language such as Portuguese, then training from scratch seems necessary.

It is definitely not necessary to start from scratch. I'd argue the opposite, it'd be useful to start with pre-trained GPT-2 even if you replacing the whole vocabulary (English -> Portuguese).

@mgrankin you say that it is not necessary to train from scratch, but assumed the vocabulary will not overlap (let's say English and Russian), how you do it?

Also someone else is talking about BERT based models (like the French model CamemBERT), but those models are [MASK] token based models, so it would need a different approach for text generation à la GPT-2

@loretoparisi

By using progressive unfreezing. This's a technique from Transfer Learning. First, you freeze all layers and unfreeze only those layers that you expect to change the most - the embeddings and adjacent to the embeddings, you train them, you unfreeze a bit more layers, repeat. I’d advise taking a course.fast.ai course to be very comfortable with the concept.

You can look at the code here.

@mgrankin thank you, in the meanwhile I'm following this approach BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

By using progressive unfreezing. This's a technique from Transfer Learning. First, you freeze all layers and unfreeze only those layers that you expect to change the most - the embeddings and adjacent to the embeddings, you train them, you unfreeze a bit more layers, repeat. I’d advise taking a course.fast.ai course to be very comfortable with the concept.

You can look at the code here.

Hi Mikhail. In your (great) code, you unfreeze groups of 3 layers (see code and below). There is a specific reason or it is the result of your tests? Thanks.

need_grads = set(flat[:i_start+args.unfreeze_level*3]) | set(flat[-(i_end+args.unfreeze_level*3):])

@piegu that's a heuristic, feel free to experiment with the number.

Hi @nikhilno1 , +1,
Did you manage to train it on Hindi?

Starting it now. Let me know if you want to work together.

Hi @nikhilno1 , +1,
Did you manage to train it on Hindi?

Starting it now. Let me know if you want to work together.

@nikhilno1 Im interested to do this for tamil, were you able to figure our hindi ?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Any official tutorial already available?
we really would like to see a way to train this on my language, using the
official framework.

On Thu, 4 Jun 2020 at 21:05, stale[bot] notifications@github.com wrote:

This issue has been automatically marked as stale because it has not had
recent activity. It will be closed if no further activity occurs. Thank you
for your contributions.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/1407#issuecomment-639182158,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ACXKCCPVIM5QT3YDAT45D6DRVAZEXANCNFSM4I44TIKA
.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Hi @nikhilno1 , +1,
Did you manage to train it on Hindi?

Starting it now. Let me know if you want to work together.

Were you able to train GPT-2 on Hindi?

Hi @nikhilno1 , +1,
Did you manage to train it on Hindi?

Starting it now. Let me know if you want to work together.

@nikhilno1 Im interested to do this for tamil, were you able to figure our hindi ?

Did you try Tamil?

Has anyone tried using GPT on multiscript text like Tamil + Devanagari + roman script text? The language of Whatsapps or Twitter msgs of Indian people.

Was this page helpful?
0 / 5 - 0 ratings