It would very useful to have documentation on how to train different models, not necessarily with the use of transformers, but with use external libs (like original BERT, fairseq, etc)
Maybe another repository with readmes or docs with recipes from those who already pretrain their model in order to reproduce procedure for other languages or domain.
There are many external resources (blogs, articles in arxiv) but without any details and very often they are not reproducible.
Have a proven recipe for training the model. Make it easy for others to train a custom model. The community will easily train language or domain-specific models.
More models available in transformers library.
There are many issues related to this:
Hi @ksopyla that's a great – but very broad – question.
We just wrote a blogpost that might be helpful: https://huggingface.co/blog/how-to-train
The post itself is on GitHub so feel free to improve/edit it too.
Thank you @julien-c. It will help to add new models to transformer model repository :)
Hi,
the blogpost is nice but it is NOT an end to end solution. I've been trying to learn how to use the huggingface "ecosystem" to build a LM model from scratch on a novel dataset, and the blogpost is not enough. Adding a jupyter notebook to the blog post would make it very easy for users to learn how to run things end to end. (VS "put in a Dataset type here" and "then run one of the scripts"). :)
@ddofer You are right, this is in process of being addressed at https://github.com/huggingface/blog/issues/3
Feel free to help :)
@julien-c Is it possible to do another example using bert to pretrain the LM instead of roberta? I followed the steps, but it doesn't seem to work when I changed the model_type to bert.
I am a new contributor and thought this might be a reasonable issue to start with.
I'm happy to add an additional example of using bert rather than roberta to pretrain the LM.
Please let me know if this would be helpful and/or if starting elsewhere would be better
I am a new contributor and thought this might be a reasonable issue to start with.
I'm happy to add an additional example of using bert rather than roberta to pretrain the LM.
Please let me know if this would be helpful and/or if starting elsewhere would be better
Great that you want to contribute!; any help is welcome! Fine-tuning and pretraining BERT seems to be already covered in run_language_modeling.py though. So your contribution should differ significantly from this functionality. Perhaps it can be written in a more educational rather than production-ready way? That would definitely be useful - explaining all concepts from scratch and such. (But not an easy task.)
First version of a notebook is up over at https://github.com/huggingface/blog/tree/master/notebooks
(thanks @aditya-malte for the help)
I am a new contributor and thought this might be a reasonable issue to start with.
I'm happy to add an additional example of using bert rather than roberta to pretrain the LM.
Please let me know if this would be helpful and/or if starting elsewhere would be betterGreat that you want to contribute!; any help is welcome! Fine-tuning and pretraining BERT seems to be already covered in run_language_modeling.py though. So your contribution should differ significantly from this functionality. Perhaps it can be written in a more educational rather than production-ready way? That would definitely be useful - explaining all concepts from scratch and such. (But not an easy task.)
I'll give it a shot :)
hey @laurenmoos,
A general community request is to work on a keras like wrapper for Transformers. It would be great if you could do that.
model=Roberta()
model.pretrain(lm_data)
model.finetune(final_data)
model.predict(XYZ)
@aditya-malte I'd love to!
I will work on that and evaluate the request for additional documentation afterwards. Is there an issue to jump on?
Let me know if you’re interested. I’d be excited to collaborate!
@aditya-malte yes!
Hi,
Did we make any progress on the feature discussed above? A keras like wrapper sounds awesome for Transformers. I would like to contribute in the development.
First version of a notebook is up over at https://github.com/huggingface/blog/tree/master/notebooks
(thanks @aditya-malte for the help)
@julien-c Thanks for this. I have a question regarding special_tokens_map.json
file. When I just use the vocab.json
and merges.txt
from the tokenizer, the run_language_modeling.py
shows the following info message
05/01/2020 17:44:01 - INFO - transformers.tokenization_utils - Didn't find file /<path-to-my-output-dir>/special_tokens_map.json. We won't load it.
In the tutorial this has not been mentioned. Should we create this mapping file too?
Hi @dashayushman,
The message you’ve shown is not an error/warning as such but is just an INFO message.
As far as I remember, the BPE model should work just fine with the vocab and merges file. You can ignore the message.
Thanks
@julien-c @aditya-malte
from blog post:
If your dataset is very large, you can opt to load and tokenize examples on the fly, rather than as a preprocessing step.
how can I do that? Also, save the tokenized data?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi @BramVanroy @julien-c
Continuing #1999, it seems run_language_modeling.py
is just for PyTorch and fine-tune a masked language model using Tensorflow doesn't have an example script yet. Any plan to make the Tensorflow version of the script or maybe how to modify the currentrun_language_modeling.py
so it can be used for Tensorflow too? Thank you.
I would also like to see an example, how to train a language model (like BERT) from scratch with tensorflow on my own dataset, so i can finetune it later on a specific task.
I would also like to see an example, how to train a language model (like BERT) from scratch with tensorflow on my own dataset, so i can finetune it later on a specific task.
ping @jplu ;)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
I am a new contributor and thought this might be a reasonable issue to start with.
I'm happy to add an additional example of using bert rather than roberta to pretrain the LM.
Please let me know if this would be helpful and/or if starting elsewhere would be better