Transformers: training new BERT seems not working

Created on 18 Jan 2019  路  16Comments  路  Source: huggingface/transformers

I tried to train a BERT mode from scratch by "run_lm_finetuning.py" with toy training data (samples/sample.txt) by changing the following:

#model = BertForPreTraining.from_pretrained(args.bert_model)
bert_config = BertConfig.from_json_file('bert_config.json')
model = BertForPreTraining(bert_config)

where the json file comes from BERT-Base, Multilingual Cased

To check the correctness of training, I printed the scores of sequential relationship (for predicting next sentence tasks) in the "pytorch_pretrained_bert/modeling.py"
prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)
print(seq_relationship_score)

And the result was (just picking an example from a single batch).

Tensor([[-0.1078, -0.2696],
[-0.1425, -0.3207],
[-0.0179, -0.2271],
[-0.0260, -0.2963],
[-0.1410, -0.2506],
[-0.0566, -0.3013],
[-0.0874, -0.3330],
[-0.1568, -0.2580],
[-0.0144, -0.3072],
[-0.1527, -0.3178],
[-0.1288, -0.2998],
[-0.0439, -0.3267],
[-0.0641, -0.2566],
[-0.1496, -0.3696],
[ 0.0286, -0.2495],
[-0.0922, -0.3002]], device='cuda:0', grad_fn=AddmmBackward)

Notice since the scores for the first column were higher than for the second column, the result showed that the models predicted all batch as not next sentence or next sentence. And this result was universal for all batches. I feel this shouldn't be the case.

Most helpful comment

Hi guys,

see the paper for TPU training, an estimation is training time using GPUs is about a week using 64 GPUs

Btw, there is an article on this topic http://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/

I was wondering, maybe someone tried tweaking some parameters in the transformer, so that it could converge much faster (ofc, maybe at the expense of accuracy), i.e.:

  • Initializing the embedding layer with FastText / your embeddings of choice - in our tests it boosted accuracy and convergence with more plain models;
  • Using a more standard 200 or 300 dimension embedding instead of 768 (also tweaking the hidden size accordingly);

Personally for me the allure of transformer is not really about the state-of-the-art accuracy, but about having the same architecture applicable for any sort of NLP task (i.e. QA tasks or SQUAD like objectives may require a custom engineering or some non-transferrable models).

All 16 comments

Hi @UCJerryDong,

Training BERT from scratch takes a (very) long time (see the paper for TPU training, an estimation is training time using GPUs is about a week using 64 GPUs), this script is more for fine-tuning (using the pre-training objective) than to train from scratch.

Did you monitor the losses during training and wait for convergence?

Hi, I am trying to do something similar:) My guess is that sample.txt is too small.

@thomwolf Just to confirm, the above code should produce a new BERT model from scratch that's based on the existing vocab file right? Thanks!

It seems to be problematic to generate new samples every epoch, at least for such a small corpus.
The model convergenced for me with --num_train_epochs 50.0, if I reuse the same train_dataset by adding train_dataset = [train_dataset[i] for i in range(len(train_dataset))] in the code.

Hi @thomwolf,

I trained the model for an hour but the loss is always around 0.6-0.8 and never converges. I know it's computationally expensive to train the BERT; that's why I choose the very small dataset (sample.txt, which only has 36 lines).

The main issue is that I have tried the same dataset with the original tensorflow version BERT and it converges within 5 minutes:

next_sentence_accuracy = 1.0
next_sentence_loss = 0.00012585879

That's why I'm wondering if something is wrong with the model. I have also checked the output of each forward step, and found out that the encoder_layers have similar row values, i.e. rows in the matrix "encoder_layers" are similar to each other.
encoded_layers = self.encoder(embedding_output, extended_attention_mask, output_all_encoded_layers=output_all_encoded_layers)

Ok, that's strange indeed. Can you share your code? I can have a look.

I haven't tried the pre-training script myself yet.

Thanks for helping! I have created a github repo with my modified code. Also, I have tried what @nhatchan suggests (thanks!) and it does work.

But I feel that shouldn't be the correct way for final solution as it stores every data on memory and it will require too much if training with real dataset.

Thank, I'll have a look. Can you also show me what you did with the Tensorflow model so I can compare the behaviors in the two cases?

I just follow the instructions under section Pre-training with BERT

But I feel that shouldn't be the correct way for final solution as it stores every data on memory and it will require too much if training with real dataset.

@UCJerryDong Yes, I just showed one of the differences from Tensorflow version, and that's why I didn't send a PR addressing this. I'm even not sure whether this affects the model performance when you train with real dataset or not.

Incidentally, I'm also trying to do something similar, with real data, but still losses seems higher than that of Tensorflow version. I suspect some of minor differences (like this, issues 195 and 38), but not yet figured it out.

Hi guys,

see the paper for TPU training, an estimation is training time using GPUs is about a week using 64 GPUs

Btw, there is an article on this topic http://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/

I was wondering, maybe someone tried tweaking some parameters in the transformer, so that it could converge much faster (ofc, maybe at the expense of accuracy), i.e.:

  • Initializing the embedding layer with FastText / your embeddings of choice - in our tests it boosted accuracy and convergence with more plain models;
  • Using a more standard 200 or 300 dimension embedding instead of 768 (also tweaking the hidden size accordingly);

Personally for me the allure of transformer is not really about the state-of-the-art accuracy, but about having the same architecture applicable for any sort of NLP task (i.e. QA tasks or SQUAD like objectives may require a custom engineering or some non-transferrable models).

HI锛孖 have a problem that which line code leet the pretrained model freezed(fine-turn) but no trainable

Hi @snakers4 and @BITLsy, please open new issues for your problems and discussion.

Hi @thomwolf Do you have any update on this? Is the issue resolved?

Hi @ntomita yes, this is just a differing behavior between the TensorFlow and PyTorch training code.

  • the original TensorFlow code does static masking in which the masking of the training dataset is computed once for all so you can quickly overfit on a small training set with a few epochs
  • in our code we use dynamic masking where the masking is generated on the fly so overfitting a single batch takes more epochs.

The recent RoBERTa paper (http://arxiv.org/abs/1907.11692) compares the two approaches (see section 4.1) and conclude that dynamic masking is comparable or slightly better than static masking (as expected I would say).

Hi @thomwolf that's awesome! I was working on pretraining a modified BERT model using this library with our own data for a quite while, struggled convergence, and wondering if I should try other libraries like original tf implementation or fairseq as other people reported slower convergence with this library. I use dynamic masking so what you're saying is reasonable. I also saw recently that MS azure group has successfully pretrained their models which are implemented with this library. Since you keep telling people that this library is not meant for pretraining I thought there are some critical bugs in models or optimization processes. I needed some confidence to keep working with this library so thanks for your follow-up!

No "critical bugs" indeed lol :-)
You can use this library as the basis for training from scratch (like Microsoft and NVIDIA did).
We just don't provide training scripts (at the current stage, maybe we'll add some later but I would like to keep them simple if we do).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yspaik picture yspaik  路  3Comments

siddsach picture siddsach  路  3Comments

0x01h picture 0x01h  路  3Comments

alphanlp picture alphanlp  路  3Comments

quocnle picture quocnle  路  3Comments