Transformers: GPT-2 Training on non-english text

Created on 3 Oct 2019 · 35Comments · Source: huggingface/transformers

❓ Questions & Help

I wish to train a GPT-2 in different languages, like Portuguese and maybe some programming languages like C++ (and play with token predictions).

But I could not find any examples of how to take an X dataset (like c++ source files), create the tokens from it and train a GPT-2 to predict new tokens from the knowledge of this X dataset.

Is this even possible? (if yes, how could one do this?)

Thanks!

wontfix

Source

angelorodem

👍8

Most helpful comment

I'm training Russian GPT-2 at the moment. I've tried to make Readme useful.

mgrankin on 7 Oct 2019

👍24 🎉2

All 35 comments

Hi! By "GPT-2 training" two different methods can be understood: training from scratch, and fine-tuning.

If you're looking at training GPT-2 on a different language such as Portuguese, then training from scratch seems necessary. You could use the language modeling finetuning example as a start, but please be aware that training such a language model from scratch takes a humongous amount of power and data, which would cost a lot. I can point you to this issue which discusses training such a model on French.

If you're looking at training your model on programming languages that have a lot of overlapping vocabulary with English (say Python with a lot of documentation), maybe you could fine-tune the original GPT-2 to your dataset (still using the lm finetuning example), but I'm not sure of the results.

LysandreJik on 3 Oct 2019

👍4

I'm training Russian GPT-2 at the moment. I've tried to make Readme useful.

mgrankin on 7 Oct 2019

👍24 🎉2

If you're looking at training GPT-2 on a different language such as Portuguese, then training from scratch seems necessary.

It is definitely not necessary to start from scratch. I'd argue the opposite, it'd be useful to start with pre-trained GPT-2 even if you replacing the whole vocabulary (English -> Portuguese).

mgrankin on 7 Oct 2019

Alright @mgrankin, that's good to know, thanks!

LysandreJik on 7 Oct 2019

If you're looking at training GPT-2 on a different language such as Portuguese, then training from scratch seems necessary.

It is definitely not necessary to start from scratch. I'd argue the opposite, it'd be useful to start with pre-trained GPT-2 even if you replacing the whole vocabulary (English -> Portuguese).

But in $$$ terms it would still be closer to training from scratch than fine-tuning, right?

nikhilno1 on 15 Oct 2019

I'm training Russian GPT-2 at the moment. I've tried to make Readme useful.

Thank you @mgrankin for sharing your steps. I plan to do the same for Hindi language.
How much is it costing you to train?

nikhilno1 on 15 Oct 2019

How much is it costing you to train?

It’s hard to tell overall cost because the training is in the process. I’ve got a workstation with 4 Titan RTX and I don’t use cloud GPUs at the moment. I use one GPU per model. The training already lasted about two weeks now and gpt2-medium gives me perplexity = 21 on my validation set.

Since PyTorch 1.3 was released recently with TPU support I’m thinking of trying to use TPU to speed up the training. I will update the repo in the next few days in case of success.

mgrankin on 15 Oct 2019

But in $$$ terms it would still be closer to training from scratch than fine-tuning, right?

Actually, in terms of quality it would be great if somebody try to train GPT2 on Portuguese from scratch vs fine-tune from pretrained English model. My guess that fine-tuning is better is based on intuition that non-random weights could be reused. Also, English is probably the most resourceful language and WebText is a great dataset. If you can build dataset with same or better quality you can give it a shot and train GPT-2 from scratch.

In terms of money it should be way cheaper to fine-tune. But I will say that with confidence then I'll finish the Russian GPT-2.

mgrankin on 15 Oct 2019

Thanks for the answer @mgrankin i'm anxious to see your results!

angelorodem on 16 Oct 2019

I'm training Russian GPT-2 at the moment. I've tried to make Readme useful.

@mgrankin Could you explain to me how you trained your model from scratch with BERT?

I would like to train BERT from scratch for a textual base in PT-BR (8GB data). Is it possible to use the run_lm_finetuning.py code to perform this process without using the multi-language bert model?

I already have a vocab.txt for the PT-BR base and I don't want to load initial weights.

Is there any script or tutorial to perform this process step by step?

calusbr on 17 Oct 2019

Hi, I also have a repo which allows to train gpt-2 language model on non-english text with a custom BPE tokenizer. But it uses a different gpt-2 implementation so currently it's unable to use pre-trained GPT-2 (although a conversion script should be possible, because it's a port of original TF implementation). Here is the link https://github.com/lopuhin/transformer-lm

lopuhin on 21 Oct 2019

👀3

Hello, this thread is what I'm looking (with the one about GPT-2 and BERT into French) for but I'm not sure I found the answer to my questions:

how long does it take to go through GPT-2 on non-english text?
what configuration of GPUs?
what size of corpus?

Many thanks in advance for your answers!

piegu on 26 Nov 2019

Why don't you use CamemBERT model, which is dedicated to French language? It's available in HuggingFace's Transformers too (since few days ago, so try out :D)! If you want absolutely to use GPT2 model, I can answer to you too!

Hello, this thread is what I'm looking (with the one about GPT-2 and BERT into French) for but I'm not sure I found the answer to my questions:

how long does it take to go through GPT-2 on non-english text?

what configuration of GPUs?

what size of corpus?

Many thanks in advance for your answers!

TheEdoardo93 on 26 Nov 2019

Hi @piegu, please do not post the same message in two issues (that are linked with one another)

julien-c on 26 Nov 2019

Hi @piegu, please do not post the same message in two issues (that are linked with one another)

Hello @julien-c. Ok but then I have to update in this thread my question to French and Portuguese (same 3 questions about fine-tuning GPT-2 and BERT). Thank you.

piegu on 26 Nov 2019

Why don't you use CamemBERT model, which is dedicated to French language? It's available in HuggingFace's Transformers too (since few days ago, so try out :D)! If you want absolutely to use GPT2 model, I can answer to you too!

Thanks @TheEdoardo93. For sure I will test CamemBERT but it does not answer my 3 questions :-) Great if you can answer about GPT-2 at least. Thank you.

piegu on 26 Nov 2019

👍1

Hi @nikhilno1 , Did you manage to train it on Hindi?

GladiatorX on 29 Nov 2019

Hi @GladiatorX, No I didn't. Life got in the way. :)
Would you like to work on it together?

nikhilno1 on 29 Nov 2019

👍1

@mgrankin Out of curiosity, how did you collect your 230 GB Russian dataset?
I would love to do something similar for another language, and I'm looking for tips

BoxxiDev on 29 Nov 2019

@BoxxiDev you can use something like a scraper/crawler like Scrapy (or something like it) on a russian site, and then you can use something like AWS Comprehend to get the language (or make a language detector yourself) and filter only Russian results.

to get tons of data use some distributed scraper on a cloud service like AWS.

angelorodem on 30 Nov 2019

@BoxxiDev Library projects have been working in Russia for a very long time, and they publish a torrent file with all the contents in fb2. example

mgrankin on 30 Nov 2019

Hi @nikhilno1 , +1,
Did you manage to train it on Hindi?

chiragsanghvi10 on 31 Dec 2019

If you're looking at training GPT-2 on a different language such as Portuguese, then training from scratch seems necessary.

It is definitely not necessary to start from scratch. I'd argue the opposite, it'd be useful to start with pre-trained GPT-2 even if you replacing the whole vocabulary (English -> Portuguese).

@mgrankin you say that it is not necessary to train from scratch, but assumed the vocabulary will not overlap (let's say English and Russian), how you do it?

Also someone else is talking about BERT based models (like the French model CamemBERT), but those models are [MASK] token based models, so it would need a different approach for text generation à la GPT-2

loretoparisi on 14 Jan 2020

@loretoparisi

By using progressive unfreezing. This's a technique from Transfer Learning. First, you freeze all layers and unfreeze only those layers that you expect to change the most - the embeddings and adjacent to the embeddings, you train them, you unfreeze a bit more layers, repeat. I’d advise taking a course.fast.ai course to be very comfortable with the concept.

You can look at the code here.

mgrankin on 15 Jan 2020

👍7

@mgrankin thank you, in the meanwhile I'm following this approach BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

loretoparisi on 15 Jan 2020

👍2

By using progressive unfreezing. This's a technique from Transfer Learning. First, you freeze all layers and unfreeze only those layers that you expect to change the most - the embeddings and adjacent to the embeddings, you train them, you unfreeze a bit more layers, repeat. I’d advise taking a course.fast.ai course to be very comfortable with the concept.

You can look at the code here.

Hi Mikhail. In your (great) code, you unfreeze groups of 3 layers (see code and below). There is a specific reason or it is the result of your tests? Thanks.

need_grads = set(flat[:i_start+args.unfreeze_level*3]) | set(flat[-(i_end+args.unfreeze_level*3):])

piegu on 16 Jan 2020

👀1 👍1

@piegu that's a heuristic, feel free to experiment with the number.

mgrankin on 16 Jan 2020

👍1

Hi @nikhilno1 , +1,
Did you manage to train it on Hindi?

Starting it now. Let me know if you want to work together.

nikhilno1 on 15 Feb 2020

Hi @nikhilno1 , +1,
Did you manage to train it on Hindi?

Starting it now. Let me know if you want to work together.

@nikhilno1 Im interested to do this for tamil, were you able to figure our hindi ?

octalpixel on 5 Apr 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 5 Jun 2020

Any official tutorial already available?
we really would like to see a way to train this on my language, using the
official framework.

On Thu, 4 Jun 2020 at 21:05, stale[bot] notifications@github.com wrote:

This issue has been automatically marked as stale because it has not had
recent activity. It will be closed if no further activity occurs. Thank you
for your contributions.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/1407#issuecomment-639182158,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ACXKCCPVIM5QT3YDAT45D6DRVAZEXANCNFSM4I44TIKA
.

angelorodem on 5 Jun 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 4 Aug 2020

Hi @nikhilno1 , +1,
Did you manage to train it on Hindi?

Starting it now. Let me know if you want to work together.

Were you able to train GPT-2 on Hindi?

parthplc on 16 Sep 2020

Hi @nikhilno1 , +1,
Did you manage to train it on Hindi?

Starting it now. Let me know if you want to work together.

@nikhilno1 Im interested to do this for tamil, were you able to figure our hindi ?

Did you try Tamil?

Dineshkumar-Ponnusamy on 14 Oct 2020

Has anyone tried using GPT on multiscript text like Tamil + Devanagari + roman script text? The language of Whatsapps or Twitter msgs of Indian people.