Error message when fine-tuning BERT or XLNet on SQuAD1.1 or 2.0 with dual 1080Ti GPUs:
_"RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1"_
Model I am using: BERT & XLNet
Language I am using the model on: English
The problem arise when using:
The tasks I am working on is:
One shell script (there are others) that had worked before:
SQUAD_DIR=/media/dn/dssd/nlp/squad1.1
python ./run_squad.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--do_train \
--do_eval \
--do_lower_case \
--train_file=${SQUAD_DIR}/train-v1.1.json \
--predict_file=${SQUAD_DIR}/dev-v1.1.json \
--per_gpu_eval_batch_size=8 \
--per_gpu_train_batch_size=8 \
--gradient_accumulation_steps=1 \
--learning_rate=3e-5 \
--num_train_epochs=2 \
--max_seq_length=384 \
--doc_stride=128 \
--save_steps=2000 \
--output_dir=./runs/bert_base_squad1_dp_ft_3 \
Runs are in a dedicated environment with only the following packages:
python 3.7.4
pytorch 1.3.0, install includes cudatoolkit 10.1
tensorflow_gpu 2.0 and dependencies
apex 0.1
transformers 2.1.1
Complete terminal output:
Change the line in run_**.py
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
to
device = torch.device("cuda:0" if torch.cuda.is_available() and not args.no_cuda else "cpu").
In my environment, it works.
Change the line in run_**.py
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
to
device = torch.device("cuda:0" if torch.cuda.is_available() and not args.no_cuda else "cpu").In my environment, it works.
It seems that all GPUs will still be used even if we specify "cuda:0" here. But I am not sure how much the other GPUs contribute to the computation. In my case, I have 8-way 1080ti but the other 7 are hardly fully loaded.
Does anyone compare the training speed with/without this error?
In my case, the solution is changing
if args.n_gpu > 1:
model = torch.nn.DataParallel(model)
to
if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
model = torch.nn.DataParallel(model)
In my case, the solution is changing
if args.n_gpu > 1: model = torch.nn.DataParallel(model)
to
if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel): model = torch.nn.DataParallel(model)
changing this in evaluate function fixes the error, when i run with --evaluate_during_training
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
In my case, the solution is changing
if args.n_gpu > 1: model = torch.nn.DataParallel(model)
to
if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel): model = torch.nn.DataParallel(model)
Agree, also notice that
args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
is now multiplied by n_gpu again which is undesired
In my case, the solution is changing
if args.n_gpu > 1: model = torch.nn.DataParallel(model)
to
if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel): model = torch.nn.DataParallel(model)
Thanks! I have met the same error in evaluation function. It works for me.
In my case, the solution is changing
if args.n_gpu > 1: model = torch.nn.DataParallel(model)
to
if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel): model = torch.nn.DataParallel(model)
changing this in evaluate function fixes the error, when i run with
--evaluate_during_training
This solution fixed the issue for me. I am observing this while training a new LM using transformers 2.5.1. The issue happened during evaluation.
One more comment about this fixing. If you use a validation set with odd number of instances, it will raise an error on lineoutputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
, if using run_language_modeling.py. This happens because the parall gpu needs two instances to be fed into.
I dont know how to fix properly. All I do is add a copy of instance of the last one to meet the number requirement.
In my case, the solution is changing
if args.n_gpu > 1: model = torch.nn.DataParallel(model)
to
if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel): model = torch.nn.DataParallel(model)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
In my case, the solution is changing
to