Transformers: Repository with recipes how to pretrain model from scratch on my own data

Created on 11 Feb 2020 · 22Comments · Source: huggingface/transformers

🚀 Feature request

It would very useful to have documentation on how to train different models, not necessarily with the use of transformers, but with use external libs (like original BERT, fairseq, etc)

Maybe another repository with readmes or docs with recipes from those who already pretrain their model in order to reproduce procedure for other languages or domain.
There are many external resources (blogs, articles in arxiv) but without any details and very often they are not reproducible.

Motivation

Have a proven recipe for training the model. Make it easy for others to train a custom model. The community will easily train language or domain-specific models.
More models available in transformers library.

There are many issues related to this:

Documentation wontfix

Source

ksopyla

👍7

Most helpful comment

I am a new contributor and thought this might be a reasonable issue to start with.

I'm happy to add an additional example of using bert rather than roberta to pretrain the LM.

Please let me know if this would be helpful and/or if starting elsewhere would be better

ghost on 26 Feb 2020

👍6

All 22 comments

Hi @ksopyla that's a great – but very broad – question.

We just wrote a blogpost that might be helpful: https://huggingface.co/blog/how-to-train

The post itself is on GitHub so feel free to improve/edit it too.

julien-c on 14 Feb 2020

Thank you @julien-c. It will help to add new models to transformer model repository :)

ksopyla on 15 Feb 2020

Hi,
the blogpost is nice but it is NOT an end to end solution. I've been trying to learn how to use the huggingface "ecosystem" to build a LM model from scratch on a novel dataset, and the blogpost is not enough. Adding a jupyter notebook to the blog post would make it very easy for users to learn how to run things end to end. (VS "put in a Dataset type here" and "then run one of the scripts"). :)

ddofer on 19 Feb 2020

👍2

@ddofer You are right, this is in process of being addressed at https://github.com/huggingface/blog/issues/3

Feel free to help :)

julien-c on 20 Feb 2020

@julien-c Is it possible to do another example using bert to pretrain the LM instead of roberta? I followed the steps, but it doesn't seem to work when I changed the model_type to bert.

yuanbit on 24 Feb 2020

I am a new contributor and thought this might be a reasonable issue to start with.

I'm happy to add an additional example of using bert rather than roberta to pretrain the LM.

Please let me know if this would be helpful and/or if starting elsewhere would be better

ghost on 26 Feb 2020

👍6

I am a new contributor and thought this might be a reasonable issue to start with.

I'm happy to add an additional example of using bert rather than roberta to pretrain the LM.

Please let me know if this would be helpful and/or if starting elsewhere would be better

Great that you want to contribute!; any help is welcome! Fine-tuning and pretraining BERT seems to be already covered in run_language_modeling.py though. So your contribution should differ significantly from this functionality. Perhaps it can be written in a more educational rather than production-ready way? That would definitely be useful - explaining all concepts from scratch and such. (But not an easy task.)

BramVanroy on 26 Feb 2020

First version of a notebook is up over at https://github.com/huggingface/blog/tree/master/notebooks
(thanks @aditya-malte for the help)

julien-c on 27 Feb 2020

👍2

I am a new contributor and thought this might be a reasonable issue to start with.
I'm happy to add an additional example of using bert rather than roberta to pretrain the LM.
Please let me know if this would be helpful and/or if starting elsewhere would be better

Great that you want to contribute!; any help is welcome! Fine-tuning and pretraining BERT seems to be already covered in run_language_modeling.py though. So your contribution should differ significantly from this functionality. Perhaps it can be written in a more educational rather than production-ready way? That would definitely be useful - explaining all concepts from scratch and such. (But not an easy task.)

I'll give it a shot :)

ghost on 4 Mar 2020

hey @laurenmoos,
A general community request is to work on a keras like wrapper for Transformers. It would be great if you could do that.

model=Roberta()
model.pretrain(lm_data)
model.finetune(final_data)
model.predict(XYZ)

aditya-malte on 5 Mar 2020

👍1

@aditya-malte I'd love to!

I will work on that and evaluate the request for additional documentation afterwards. Is there an issue to jump on?

ghost on 5 Mar 2020

Let me know if you’re interested. I’d be excited to collaborate!

aditya-malte on 5 Mar 2020

@aditya-malte yes!

ghost on 5 Mar 2020

Hi,

Did we make any progress on the feature discussed above? A keras like wrapper sounds awesome for Transformers. I would like to contribute in the development.

san7988 on 10 Apr 2020

First version of a notebook is up over at https://github.com/huggingface/blog/tree/master/notebooks
(thanks @aditya-malte for the help)

@julien-c Thanks for this. I have a question regarding special_tokens_map.json file. When I just use the vocab.json and merges.txt from the tokenizer, the run_language_modeling.py shows the following info message

05/01/2020 17:44:01 - INFO - transformers.tokenization_utils -   Didn't find file /<path-to-my-output-dir>/special_tokens_map.json. We won't load it.

In the tutorial this has not been mentioned. Should we create this mapping file too?

dashayushman on 1 May 2020

Hi @dashayushman,
The message you’ve shown is not an error/warning as such but is just an INFO message.
As far as I remember, the BPE model should work just fine with the vocab and merges file. You can ignore the message.
Thanks

aditya-malte on 1 May 2020

👍1

@julien-c @aditya-malte
from blog post:

If your dataset is very large, you can opt to load and tokenize examples on the fly, rather than as a preprocessing step.

how can I do that? Also, save the tokenized data?

008karan on 25 May 2020

👍1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 25 Jul 2020

Hi @BramVanroy @julien-c
Continuing #1999, it seems run_language_modeling.py is just for PyTorch and fine-tune a masked language model using Tensorflow doesn't have an example script yet. Any plan to make the Tensorflow version of the script or maybe how to modify the currentrun_language_modeling.py so it can be used for Tensorflow too? Thank you.

kevin-yauris on 1 Aug 2020

❤1 👍1

I would also like to see an example, how to train a language model (like BERT) from scratch with tensorflow on my own dataset, so i can finetune it later on a specific task.

Novaal on 23 Sep 2020

❤2

I would also like to see an example, how to train a language model (like BERT) from scratch with tensorflow on my own dataset, so i can finetune it later on a specific task.

ping @jplu ;)

julien-c on 1 Oct 2020

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 4 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings