I am traing a roberta model and running the script examples/run_language_modeling.py
The following error occurs when i am trying to resume training.
Traceback (most recent call last):
File "examples/run_language_modeling.py", line 284, in
main()
File "examples/run_language_modeling.py", line 254, in main
trainer.train(model_path=model_path)
File "/home/socian-pc1/anaconda3/envs/XformerEnv/lib/python3.6/site-packages/transformers/trainer.py", line 326, in train
optimizer.step()
File "/home/socian-pc1/anaconda3/envs/XformerEnv/lib/python3.6/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
return wrapped(args, *kwargs)
File "/home/socian-pc1/anaconda3/envs/XformerEnv/lib/python3.6/site-packages/transformers/optimization.py", line 155, in step
exp_avg.mul_(beta1).add_(1.0 - beta1, grad)
RuntimeError: expected device cpu but got device cuda:0
My config
python examples/run_language_modeling.py \
--train_data_file $TRAIN_FILE \
--eval_data_file $TEST_FILE \
--output_dir ./MyRobertaOutput \
--model_name_or_path ./MyRoBERTa/checkpoint-570000 \
--config_name ../xformer_output \
--tokenizer_name ../xformer_output \
--mlm \
--do_train \
--do_eval \
--line_by_line \
--learning_rate 1e-5 \
--num_train_epochs 2 \
--save_total_limit 20 \
--save_steps 5000 \
--per_gpu_train_batch_size 6 \
--warmup_steps=10000 \
--logging_steps=100 \
--gradient_accumulation_steps=4 \
--seed 666 --block_size=512
Try initialize the model in the trainer script from transformers with self.model = model.cuda()
I am getting the same error. Is there work around ?
What @tebandesade mentioned didnt work out for.
I faced the same problem with RoBERTa pretraining, however inserting line model = model.cuda() before trainer in run_language_modeling.py file helped me
@tebandesade, Thank you!
Hello! I'm having trouble reproducing that on master. Do you mind installing from source and letting me know if you still have the issue? Thank you
Hi @LysandreJik, installing from source doesn't fix the issue, though @tebandesade's suggestion works fine.
@octalpixel try editing this line, it shall work;
https://github.com/huggingface/transformers/blob/62427d0815825436fa55b43725f44776e94abb65/src/transformers/trainer.py#L145
I am getting this error too.
I think the issue is optimizers are setup from self.model which is in cpu, but the model is moved to device afterwards. Which is why self.model = model.cuda() fixes the error.
Should be fixed on master thanks to @shaoyent . Give it a try and let us know:)
I am getting this error too.
I think the issue is optimizers are setup from
self.modelwhich is in cpu, but the model is moved to device afterwards. Which is whyself.model = model.cuda()fixes the error.
It works if you didn't install transformers package, or you will need to modify the installed package file.
Most helpful comment
I faced the same problem with RoBERTa pretraining, however inserting line model = model.cuda() before trainer in run_language_modeling.py file helped me
@tebandesade, Thank you!