Transformers: Fine-tuning with run_squad.py, Transformers 2.1.1 & PyTorch 1.3.0 Data Parallel Error

Created on 12 Oct 2019 · 11Comments · Source: huggingface/transformers

🐛 Bug

Error message when fine-tuning BERT or XLNet on SQuAD1.1 or 2.0 with dual 1080Ti GPUs:

_"RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1"_

Model I am using: BERT & XLNet

Language I am using the model on: English

The problem arise when using:

[X] my own modified scripts: example script file below which ran successfully under previous PyTorch, PyTorch-Transformers, & Transformers versions.

The tasks I am working on is:

[X] an official GLUE/SQUaD task: (give the name) SQuAD 1.1 & 2.0

To Reproduce

One shell script (there are others) that had worked before:

SQUAD_DIR=/media/dn/dssd/nlp/squad1.1

python ./run_squad.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--do_train \
--do_eval \
--do_lower_case \
--train_file=${SQUAD_DIR}/train-v1.1.json \
--predict_file=${SQUAD_DIR}/dev-v1.1.json \
--per_gpu_eval_batch_size=8 \
--per_gpu_train_batch_size=8 \
--gradient_accumulation_steps=1 \
--learning_rate=3e-5 \
--num_train_epochs=2 \
--max_seq_length=384 \
--doc_stride=128 \
--save_steps=2000 \
--output_dir=./runs/bert_base_squad1_dp_ft_3 \

Environment

OS: Ubuntu 18.04, Linux kernel 4.15.0-65-generic
Python version: 3.7.4
PyTorch version: 1.3.0
Transformers version: 2.1.1 built from latest source
Using GPU? NVIDIA 1080Ti x 2
Distributed or parallel setup? Data Parallel
Any other relevant information: Have had many successful SQuAD fine-tuning runs on PyTorch 1.2.0 with Pytorch-Transformers 1.2.0, maybe even Transformers 2.0.0, and Apex 0.1. New environment built with the latest versions (Pytorch 1.3.0, Transformers 2.1.1) spawns data parallel related error above

wontfix

Source

ahotrod

Most helpful comment

In my case, the solution is changing

if args.n_gpu > 1:
    model = torch.nn.DataParallel(model)

if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
    model = torch.nn.DataParallel(model)

shuaihuaiyi on 6 Dec 2019

👍15 🎉8

All 11 comments

Runs are in a dedicated environment with only the following packages:

python 3.7.4
pytorch 1.3.0, install includes cudatoolkit 10.1
tensorflow_gpu 2.0 and dependencies
apex 0.1
transformers 2.1.1

Complete terminal output:

output_term_ERROR.TXT

ahotrod on 12 Oct 2019

Change the line in run_**.py
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
to
device = torch.device("cuda:0" if torch.cuda.is_available() and not args.no_cuda else "cpu").

In my environment, it works.

h-sugi on 8 Nov 2019

👍1

Change the line in run_**.py
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
to
device = torch.device("cuda:0" if torch.cuda.is_available() and not args.no_cuda else "cpu").

In my environment, it works.

It seems that all GPUs will still be used even if we specify "cuda:0" here. But I am not sure how much the other GPUs contribute to the computation. In my case, I have 8-way 1080ti but the other 7 are hardly fully loaded.

Does anyone compare the training speed with/without this error?

loveritsu929 on 19 Nov 2019

In my case, the solution is changing

if args.n_gpu > 1:
    model = torch.nn.DataParallel(model)

if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
    model = torch.nn.DataParallel(model)

shuaihuaiyi on 6 Dec 2019

👍15 🎉8

In my case, the solution is changing

if args.n_gpu > 1:
    model = torch.nn.DataParallel(model)

if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
    model = torch.nn.DataParallel(model)

changing this in evaluate function fixes the error, when i run with --evaluate_during_training

shyamrallapalli on 15 Dec 2019

👍8

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 13 Feb 2020

In my case, the solution is changing

if args.n_gpu > 1:
    model = torch.nn.DataParallel(model)

if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
    model = torch.nn.DataParallel(model)

Agree, also notice that
args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu) is now multiplied by n_gpu again which is undesired

eyal-orbach on 14 Feb 2020

In my case, the solution is changing

if args.n_gpu > 1:
    model = torch.nn.DataParallel(model)

if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
    model = torch.nn.DataParallel(model)

Thanks! I have met the same error in evaluation function. It works for me.

zhanlaoban on 2 Mar 2020

👍2

In my case, the solution is changing
if args.n_gpu > 1:
    model = torch.nn.DataParallel(model)
to
if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
    model = torch.nn.DataParallel(model)
changing this in evaluate function fixes the error, when i run with --evaluate_during_training

This solution fixed the issue for me. I am observing this while training a new LM using transformers 2.5.1. The issue happened during evaluation.

maxsonate on 16 Mar 2020

One more comment about this fixing. If you use a validation set with odd number of instances, it will raise an error on lineoutputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels), if using run_language_modeling.py. This happens because the parall gpu needs two instances to be fed into.

I dont know how to fix properly. All I do is add a copy of instance of the last one to meet the number requirement.

In my case, the solution is changing

if args.n_gpu > 1:
    model = torch.nn.DataParallel(model)

if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
    model = torch.nn.DataParallel(model)

zixiliuUSC on 18 Apr 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 17 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings