Transformers: pre-training a BERT from scratch

Created on 17 Mar 2019  路  26Comments  路  Source: huggingface/transformers

I am wondering whether I can train a new BERT from scratch with this pytorch BERT.

Discussion wontfix

Most helpful comment

NVidia recently released TF and PyTorch code to pretrain Bert from scratch. I wrapped it in a script to launch on multiple machines on AWS here. Currently I'm still figuring out why the 64-GPU AWS throughput is 2x worse than what they are getting locally

All 26 comments

We can't now. The code is still incomplete. Is it possible recently? I really want to help but not familiar with tensorflow.

A related issue is #376.

However, pytorch-pretraned-BERT was mostly designed to provide easy and fast access to pretrained models.

If you want to train a BERT model from scratch you will need a more robust code base for training and data-processing than the simple examples that are provided in this repo.

I would probably advise to move to a more integrated codebase like the nice XLM repo of @glample and @aconneau.

I've been able to use the codebase for this, and didn't see much issues, however I might be overlooking something. If you construct and initialize a new model instead of loading from pretrained, you can use the simple_lm_finetuning script to train on new data.

Thomas, did you have any specific other issues in mind?

NVidia recently released TF and PyTorch code to pretrain Bert from scratch. I wrapped it in a script to launch on multiple machines on AWS here. Currently I'm still figuring out why the 64-GPU AWS throughput is 2x worse than what they are getting locally

Thanks @yaroslavvb!

Thanks! @yaroslavvb

@yaroslavvb this article explains why cloud computing can have inconsistent throughput. I think it's a great read, and I've been working on setting up my own rig.

I see in the script that you're using 8 GPUs. have long is the pretraining taking with that? I'm not sure whether to go with gcloud TPUs or AWS. the Bert readme said that a single TPU will take up to 2 weeks to finish pretaining..

@yaroslavvb hi, did you train bert successfully? I trained it with https://github.com/NVIDIA/Megatron-LM/scripts/pretrain_bert_tfrecords_distributed.sh on 2 machines with 16 GPUS, but when it was sotpped after ' > number of parameters: 336226108' and i got nothing else after that, the GPU-Util is 0%.

@MarvinLong yes, I was able to launch it on multiple machines and observe the model training, and it's about 600ms per step. I did not try training it to completion as the scaling efficiency on p3dn instances on AWS is only about 50% because of NCCL bug currently. I'm wondering if your machines can't communicate to each other on the right ports. @jrc2139 I have not observed inconsistent throughput, I've used this codebase to train imagenet in 19 minutes on 64 GPUs on AWS p3 instances.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

I've been able to use the codebase for this, and didn't see much issues, however I might be overlooking something. If you construct and initialize a new model instead of loading from pretrained, you can use the simple_lm_finetuning script to train on new data.

Thomas, did you have any specific other issues in mind?

I'm trying to train on my own custom data and I'm a bit confused about how to "construct and initialize a new model"鈥攊.e., when not working with pretrained models. Any help appreciated.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@yaroslavvb Hi, I can launch Megatron-LM to pretrain bert, but my MLM loss stay around 6.8. How about you? Can you pretrain BERT successfully?

@yaroslavvb Hi, I can launch Megatron-LM to pretrain bert, but my MLM loss stay around 6.8. How about you? Can you pretrain BERT successfully?

I was able to pre-train using this repo [https://github.com/google-research/bert]. However, even with one million step, the MLM accuracy was 64.69% and it's loss was 2.4. I am eager to know if someone else has pre-trained and got MLM accuracy higher than this.

@yaroslavvb Hi, I can launch Megatron-LM to pretrain bert, but my MLM loss stay around 6.8. How about you? Can you pretrain BERT successfully?

I was able to pre-train using this repo [https://github.com/google-research/bert]. However, even with one million step, the MLM accuracy was 64.69% and it's loss was 2.4. I am eager to know if someone else has pre-trained and got MLM accuracy higher than this.

According to the pretrian log from gloun-nlphttps://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/bert_base_pretrain.log, your MLM accuracy seems right though with a higher loss. I think you can try to check it with fintuning.

@ibrahimishag I want to know if you pretrain your BERT with Bookscorpus. I cannot find a copy of that. For my pretraining, my bert loss is decreasing so so slowly after removing clip-grad-norm. There must be something wrong with me.

@JF-D I pre-trained on other domain-specific corpus.

Can someone please specify why Thomas mention/refers XLM repo from facebook? Is there any fault from huggingface? I thought I would just use hugging face repo without using "pretrained paramater" they generously provided for us.

Just struggling with Facebook repo"span bert" and seems it is hard to even run this due to distributed launch issue. Hope it is ok to use hugging face's one to reproduce paper result

Is it possible to train from scratch using the run_language_modeling.py code? does hugging face support training from scratch. I looked at this example https://huggingface.co/blog/how-to-train but this thread is hitting that training from scratch is not currently supported.

Any update on training from scratch BERT-like models with huggingface?

Yes this has been supported for close to a year now ;)

@julien-c Thanks. I really appreacite the prompt response.

Is there any tutorial/example specifically for BERT (/ALBERT) pretraining ?

Pretraining from scratch is a very rigid demand for users.

@julien-c Thanks. I really appreacite the prompt response.

Is there any tutorial/example specifically for BERT (/ALBERT) pretraining ?

wait example

This is all there is to pretraining:

import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from pathlib import Path
from transformers import BertTokenizer
from tokenizers.processors import BertProcessing
from transformers import RobertaConfig
from transformers import RobertaForMaskedLM
from transformers import LineByLineTextDataset
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
import torch

tokenizer = BertTokenizer('./data/vocab.txt')

tokens = tokenizer.encode("b140 m33 c230")
print('token ids: {}'.format(tokens))

config = RobertaConfig(
    vocab_size=1458,
    max_position_embeddings=130,
    hidden_size=384,
    intermediate_size=1536,
    num_attention_heads=4,
    num_hidden_layers=4,
    type_vocab_size=1,
)

# FROM SCRATCH
model = RobertaForMaskedLM(config=config)

# CONTINUE TRAINING -- i.e., just load your saved model using "from_pretrained"
# model = RobertaForMaskedLM.from_pretrained('./trained_model')

print(model.num_parameters())

# We should save this dataset since it's a bit slow to build each time
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./data/my_data.txt",
    block_size=128,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

training_args = TrainingArguments(
    output_dir="./out/my_run",
    overwrite_output_dir=True,
    num_train_epochs=100,
    per_device_train_batch_size=128,
    save_steps=100,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    prediction_loss_only=True,
)

trainer.train()

trainer.save_model("./trained_model")

Note that this is a small model, with a specialized, fixed vocabulary, so I'm using the old BERT tokenizer I had working from a previous project. For "real" languages you'd use one of the RobertaTokenizer options.

I'm just getting back to this project after being away for a while, and I'm noticing I'm getting a warning about switching to the Datasets Library. I'll do that at some point, but it's working for now so I won't mess with it.
Also, I'm curious if anyone can tell me how to set the maximum length of inputs, so that longer inputs truncate?

UPDATE: Duh, sorry, looks like tokenizer.encode() takes max_length and truncation parameters. Simple.

One question; I'm noticing that creating the dataset...

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./data/my_data.txt",
    block_size=128,
)

...is taking a long time. Is it possible to save that as a file, to avoid the wait when I (re)run training?

Was this page helpful?
0 / 5 - 0 ratings