Hello,
I'm looking into the great repo, and I'm wondering if there is a feature that could allow me to train a, let's say, gpt2 model on a custom dataset of sequences.
Is it already provided in your codebase and features ? Otherwise I'll tinker with code on my own.
Thanks in advance and again, great job for the repo which is super useful.
This depends on the model you're interested in. For GPT2, for example, there's a class called GPT2LMHeadModel that you could use for pretraining with minimal modifications. For XLNet, the implementation in this repo is missing some key functionality (the permutation generation function and an analogue of the dataset record generator) which you'd have to implement yourself. For the BERT model in this repo, there appears to be a class explicitly designed for this (BertForPreTraining).
Hi, we don't provide efficient scripts for training from scratch but you can have a look at what Microsoft did for instance: https://azure.microsoft.com/en-us/blog/microsoft-makes-it-easier-to-build-popular-language-representation-model-bert-at-large-scale/
They shared all the recipes they used for training a full-scale Bert based on this library. Kudos to them!
I'd like to see efficient scripts for training from scratch too please. The Azure repo looks interesting, but looks very Azure-specific, and also bert specific. Would be nice to have training scripts within the hugging face repo itself.
(In addition to being able to train standard BERT etc on proprietary data, it would also be nice to be able to easily experiment with training from scratch using variations of the standard BERT etc models, using the existing public datasets).
@hughperkins
I wrote this post when I modified code to run on (custom) IMDB dataset for BERT model: https://medium.com/dsnet/running-pytorch-transformers-on-custom-datasets-717fd9e10fe2
Not sure if this helps you.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi, we don't provide efficient scripts for training from scratch but you can have a look at what Microsoft did for instance: https://azure.microsoft.com/en-us/blog/microsoft-makes-it-easier-to-build-popular-language-representation-model-bert-at-large-scale/
They shared all the recipes they used for training a full-scale Bert based on this library. Kudos to them!
@thomwolf Indeed this seems very Azure specific and not very helpful. What would be helpful is showing minimal scripts for training transformers, say GPT2, on custom datasets from scratch. Training from scratch is basic requisite functionality for this library to be used in fundamental research as opposed to tweaking / fine-tuning existing results.
Most helpful comment
@thomwolf Indeed this seems very Azure specific and not very helpful. What would be helpful is showing minimal scripts for training transformers, say GPT2, on custom datasets from scratch. Training from scratch is basic requisite functionality for this library to be used in fundamental research as opposed to tweaking / fine-tuning existing results.