Transformers: pre-training a BERT from scratch

Created on 17 Mar 2019 · 26Comments · Source: huggingface/transformers

I am wondering whether I can train a new BERT from scratch with this pytorch BERT.

Discussion wontfix

Source

chiyuzhang94

Most helpful comment

NVidia recently released TF and PyTorch code to pretrain Bert from scratch. I wrapped it in a script to launch on multiple machines on AWS here. Currently I'm still figuring out why the 64-GPU AWS throughput is 2x worse than what they are getting locally

yaroslavvb on 28 Apr 2019

👍7 👀1 🚀1

All 26 comments

We can't now. The code is still incomplete. Is it possible recently? I really want to help but not familiar with tensorflow.

supersweak on 18 Mar 2019

A related issue is #376.

However, pytorch-pretraned-BERT was mostly designed to provide easy and fast access to pretrained models.

If you want to train a BERT model from scratch you will need a more robust code base for training and data-processing than the simple examples that are provided in this repo.

I would probably advise to move to a more integrated codebase like the nice XLM repo of @glample and @aconneau.

thomwolf on 19 Mar 2019

👍2

I've been able to use the codebase for this, and didn't see much issues, however I might be overlooking something. If you construct and initialize a new model instead of loading from pretrained, you can use the simple_lm_finetuning script to train on new data.

Thomas, did you have any specific other issues in mind?

mttk on 26 Apr 2019

yaroslavvb on 28 Apr 2019

👍7 👀1 🚀1

Thanks @yaroslavvb!

thomwolf on 28 Apr 2019

Thanks! @yaroslavvb

YangWang92 on 30 Apr 2019

@yaroslavvb this article explains why cloud computing can have inconsistent throughput. I think it's a great read, and I've been working on setting up my own rig.

I see in the script that you're using 8 GPUs. have long is the pretraining taking with that? I'm not sure whether to go with gcloud TPUs or AWS. the Bert readme said that a single TPU will take up to 2 weeks to finish pretaining..

jrc2139 on 6 May 2019

@yaroslavvb hi, did you train bert successfully? I trained it with https://github.com/NVIDIA/Megatron-LM/scripts/pretrain_bert_tfrecords_distributed.sh on 2 machines with 16 GPUS, but when it was sotpped after ' > number of parameters: 336226108' and i got nothing else after that, the GPU-Util is 0%.

MarvinLong on 10 May 2019

@MarvinLong yes, I was able to launch it on multiple machines and observe the model training, and it's about 600ms per step. I did not try training it to completion as the scaling efficiency on p3dn instances on AWS is only about 50% because of NCCL bug currently. I'm wondering if your machines can't communicate to each other on the right ports. @jrc2139 I have not observed inconsistent throughput, I've used this codebase to train imagenet in 19 minutes on 64 GPUs on AWS p3 instances.

yaroslavvb on 23 May 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 22 Jul 2019

I've been able to use the codebase for this, and didn't see much issues, however I might be overlooking something. If you construct and initialize a new model instead of loading from pretrained, you can use the simple_lm_finetuning script to train on new data.

Thomas, did you have any specific other issues in mind?

I'm trying to train on my own custom data and I'm a bit confused about how to "construct and initialize a new model"—i.e., when not working with pretrained models. Any help appreciated.

jbmaxwell on 23 Jul 2019

👍3

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 21 Sep 2019

@yaroslavvb Hi, I can launch Megatron-LM to pretrain bert, but my MLM loss stay around 6.8. How about you? Can you pretrain BERT successfully?

JF-D on 21 Apr 2020

👀1

@yaroslavvb Hi, I can launch Megatron-LM to pretrain bert, but my MLM loss stay around 6.8. How about you? Can you pretrain BERT successfully?

I was able to pre-train using this repo [https://github.com/google-research/bert]. However, even with one million step, the MLM accuracy was 64.69% and it's loss was 2.4. I am eager to know if someone else has pre-trained and got MLM accuracy higher than this.

ibrahimishag on 23 Apr 2020

@yaroslavvb Hi, I can launch Megatron-LM to pretrain bert, but my MLM loss stay around 6.8. How about you? Can you pretrain BERT successfully?

I was able to pre-train using this repo [https://github.com/google-research/bert]. However, even with one million step, the MLM accuracy was 64.69% and it's loss was 2.4. I am eager to know if someone else has pre-trained and got MLM accuracy higher than this.

According to the pretrian log from gloun-nlphttps://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/bert_base_pretrain.log, your MLM accuracy seems right though with a higher loss. I think you can try to check it with fintuning.

JF-D on 23 Apr 2020

👍1

@ibrahimishag I want to know if you pretrain your BERT with Bookscorpus. I cannot find a copy of that. For my pretraining, my bert loss is decreasing so so slowly after removing clip-grad-norm. There must be something wrong with me.

JF-D on 23 Apr 2020

@JF-D I pre-trained on other domain-specific corpus.

ibrahimishag on 27 Apr 2020

👍2

Can someone please specify why Thomas mention/refers XLM repo from facebook? Is there any fault from huggingface? I thought I would just use hugging face repo without using "pretrained paramater" they generously provided for us.

Just struggling with Facebook repo"span bert" and seems it is hard to even run this due to distributed launch issue. Hope it is ok to use hugging face's one to reproduce paper result

muiPomeranian on 19 May 2020

Is it possible to train from scratch using the run_language_modeling.py code? does hugging face support training from scratch. I looked at this example https://huggingface.co/blog/how-to-train but this thread is hitting that training from scratch is not currently supported.

kaoutar55 on 1 Jun 2020

👍5

Any update on training from scratch BERT-like models with huggingface?

UmarSpa on 8 Oct 2020

Yes this has been supported for close to a year now ;)

julien-c on 8 Oct 2020

@julien-c Thanks. I really appreacite the prompt response.

Is there any tutorial/example specifically for BERT (/ALBERT) pretraining ?

UmarSpa on 8 Oct 2020

👍4

Pretraining from scratch is a very rigid demand for users.

guotong1988 on 27 Oct 2020

@julien-c Thanks. I really appreacite the prompt response.

Is there any tutorial/example specifically for BERT (/ALBERT) pretraining ?

wait example

Crescentz on 16 Nov 2020

This is all there is to pretraining:

import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from pathlib import Path
from transformers import BertTokenizer
from tokenizers.processors import BertProcessing
from transformers import RobertaConfig
from transformers import RobertaForMaskedLM
from transformers import LineByLineTextDataset
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
import torch

tokenizer = BertTokenizer('./data/vocab.txt')

tokens = tokenizer.encode("b140 m33 c230")
print('token ids: {}'.format(tokens))

config = RobertaConfig(
    vocab_size=1458,
    max_position_embeddings=130,
    hidden_size=384,
    intermediate_size=1536,
    num_attention_heads=4,
    num_hidden_layers=4,
    type_vocab_size=1,
)

# FROM SCRATCH
model = RobertaForMaskedLM(config=config)

# CONTINUE TRAINING -- i.e., just load your saved model using "from_pretrained"
# model = RobertaForMaskedLM.from_pretrained('./trained_model')

print(model.num_parameters())

# We should save this dataset since it's a bit slow to build each time
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./data/my_data.txt",
    block_size=128,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

training_args = TrainingArguments(
    output_dir="./out/my_run",
    overwrite_output_dir=True,
    num_train_epochs=100,
    per_device_train_batch_size=128,
    save_steps=100,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    prediction_loss_only=True,
)

trainer.train()

trainer.save_model("./trained_model")

Note that this is a small model, with a specialized, fixed vocabulary, so I'm using the old BERT tokenizer I had working from a previous project. For "real" languages you'd use one of the RobertaTokenizer options.

I'm just getting back to this project after being away for a while, and I'm noticing I'm getting a warning about switching to the Datasets Library. I'll do that at some point, but it's working for now so I won't mess with it.
Also, I'm curious if anyone can tell me how to set the maximum length of inputs, so that longer inputs truncate?

UPDATE: Duh, sorry, looks like tokenizer.encode() takes max_length and truncation parameters. Simple.

jbmaxwell on 16 Nov 2020

One question; I'm noticing that creating the dataset...

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./data/my_data.txt",
    block_size=128,
)

...is taking a long time. Is it possible to save that as a file, to avoid the wait when I (re)run training?

jbmaxwell on 16 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

fp16+xlnet did not gain any speed increase

fyubang · 3Comments

_load_from_state_dict() takes 7 positional arguments but 8 were given

guanlongtianzi · 3Comments

GPT2 tokenizer is so slow because of sum()

iedmrc · 3Comments

Tokenizer not found after conversion from TF checkpoint to PyTorch

HansBambel · 3Comments

Finetuning OpenAI GPT-2 for another language.

0x01h · 3Comments