Transformers: Is pytorch-transformers useful for training from scratch on a custom dataset?

Created on 1 Aug 2019 · 6Comments · Source: huggingface/transformers

Hello,

I'm looking into the great repo, and I'm wondering if there is a feature that could allow me to train a, let's say, gpt2 model on a custom dataset of sequences.

Is it already provided in your codebase and features ? Otherwise I'll tinker with code on my own.

Thanks in advance and again, great job for the repo which is super useful.

wontfix

Source

Caselles

👍6 ❤3

Most helpful comment

Hi, we don't provide efficient scripts for training from scratch but you can have a look at what Microsoft did for instance: https://azure.microsoft.com/en-us/blog/microsoft-makes-it-easier-to-build-popular-language-representation-model-bert-at-large-scale/

They shared all the recipes they used for training a full-scale Bert based on this library. Kudos to them!

@thomwolf Indeed this seems very Azure specific and not very helpful. What would be helpful is showing minimal scripts for training transformers, say GPT2, on custom datasets from scratch. Training from scratch is basic requisite functionality for this library to be used in fundamental research as opposed to tweaking / fine-tuning existing results.

MishaLaskin on 13 Nov 2019

👍3

All 6 comments

This depends on the model you're interested in. For GPT2, for example, there's a class called GPT2LMHeadModel that you could use for pretraining with minimal modifications. For XLNet, the implementation in this repo is missing some key functionality (the permutation generation function and an analogue of the dataset record generator) which you'd have to implement yourself. For the BERT model in this repo, there appears to be a class explicitly designed for this (BertForPreTraining).

brendanxwhitaker on 4 Aug 2019

Hi, we don't provide efficient scripts for training from scratch but you can have a look at what Microsoft did for instance: https://azure.microsoft.com/en-us/blog/microsoft-makes-it-easier-to-build-popular-language-representation-model-bert-at-large-scale/

They shared all the recipes they used for training a full-scale Bert based on this library. Kudos to them!

thomwolf on 5 Aug 2019

I'd like to see efficient scripts for training from scratch too please. The Azure repo looks interesting, but looks very Azure-specific, and also bert specific. Would be nice to have training scripts within the hugging face repo itself.

(In addition to being able to train standard BERT etc on proprietary data, it would also be nice to be able to easily experiment with training from scratch using variations of the standard BERT etc models, using the existing public datasets).

hughperkins on 8 Aug 2019

@hughperkins
I wrote this post when I modified code to run on (custom) IMDB dataset for BERT model: https://medium.com/dsnet/running-pytorch-transformers-on-custom-datasets-717fd9e10fe2
Not sure if this helps you.

nikhilno1 on 21 Aug 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 20 Oct 2019

Hi, we don't provide efficient scripts for training from scratch but you can have a look at what Microsoft did for instance: https://azure.microsoft.com/en-us/blog/microsoft-makes-it-easier-to-build-popular-language-representation-model-bert-at-large-scale/

They shared all the recipes they used for training a full-scale Bert based on this library. Kudos to them!

MishaLaskin on 13 Nov 2019

👍3

Was this page helpful?

0 / 5 - 0 ratings