Transformers: xlm-roberta (large/base) : run_language_modeling.py cannot starting training

Created on 23 Apr 2020 · 5Comments · Source: huggingface/transformers

Hi HuggingFace, thank you very much for your great contribution.

❓ Questions & Help

My problem is : run_language_modeling.py takes abnormally long time for xlm-roberta-large & base _"before" start training_
. It got stuck at the following step for 7 hours (so I gave up eventually) :

transformers.data.datasets.language_modeling - Creating features from dataset file at ./

I have successfully running gpt2-large, distilbert-base-multilingual-cased using exactly the same command below (just change model) which start training within just 2-3 minutes. At first I thought that because of the big size of XLM-Roberta. However, as gpt2-large has similar size, is there somehow problem on finetuning with XLM-Roberta? (So maybe a bug in the current version)

I also tried to rerun the same command in another machine, but got the same stuck (which is not the case for gpt2-large, distilbert-base-multilingual-cased )

update the same thing happen to xlm-roberta-base

Command Details I used

Machine AWS p3.2xlarge (V100, 64GB Ram)
Training file size is around 60MB

!python transformers/examples/run_language_modeling.py
--model_type=xlm-roberta
--model_name_or_path=xlm-roberta-large
--do_train
--mlm
--per_gpu_train_batch_size=1
--gradient_accumulation_steps=8
--train_data_file={TRAIN_FILE}
--num_train_epochs=2
--block_size=225
--output_dir=output_lm
--save_total_limit=1
--save_steps=10000
--cache_dir=output_lm
--overwrite_cache
--overwrite_output_dir

Source

ratthachat

Most helpful comment

Have you tried launching a debugger to see exactly what takes a long time?

I would use vscode remote debugging.

julien-c on 23 Apr 2020

👍3

All 5 comments

Have you tried launching a debugger to see exactly what takes a long time?

I would use vscode remote debugging.

julien-c on 23 Apr 2020

👍3

I would guess that your tokenization process takes too long. If you're training a new LM from scratch, I would recommend using the fast Tokenizers library written in Rust. You can initialize a new ByteLevelBPETokenizer instance in your LineByLineTextDataset class and encode_batch your text with it.

mfilipav on 28 Apr 2020

👍1

Thanks you guys, I finally managed to finetune XLM-Roberta-Large, but have to wait for 11 hours, before the training start!

Since I did not want training from scratch, I took a tip from @mfilipav to convert pretrained tokenizer to fast-tokenizer (and since it's SentencePiece, I have to usesentencepiece_extractor.py ), and modify use_fast = True in run_language_modeling.py ... However, since it's still 11 hours of waiting, maybe this doesn't help.

UPDATED : By adding --line_by_line option, the training start very quickly, close the issue!

ratthachat on 30 Apr 2020

@ratthachat and how fast it became after enabling "--line_by_line true" ? I am waiting for almost 1 hour. My training set size is 11 gb and here goes my parameters
`export TRAIN_FILE=/hdd/sifat/NLP/intent_classification/bert_train.txt
export TEST_FILE=/hdd/sifat/NLP/intent_classification/data_corpus/test.txt

python examples/run_language_modeling.py
--output_dir ./bert_output
--model_type=bert
--model_name_or_path=bert-base-multilingual-cased
--mlm
--line_by_line true
--do_train
--train_data_file=$TRAIN_FILE
--do_eval
--eval_data_file=$TEST_FILE
--learning_rate 1e-4
--num_train_epochs 3
--save_total_limit 2
--save_steps 2000
--per_gpu_train_batch_size 5
--evaluate_during_training
--seed 42`