Transformers: Fine-tuning with run_squad.py, Transformers 2.1.1 & PyTorch 1.3.0 Data Parallel Error

Created on 12 Oct 2019  路  11Comments  路  Source: huggingface/transformers

馃悰 Bug

Error message when fine-tuning BERT or XLNet on SQuAD1.1 or 2.0 with dual 1080Ti GPUs:

_"RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1"_

Model I am using: BERT & XLNet

Language I am using the model on: English

The problem arise when using:

  • [X] my own modified scripts: example script file below which ran successfully under previous PyTorch, PyTorch-Transformers, & Transformers versions.

The tasks I am working on is:

  • [X] an official GLUE/SQUaD task: (give the name) SQuAD 1.1 & 2.0

To Reproduce

One shell script (there are others) that had worked before:

SQUAD_DIR=/media/dn/dssd/nlp/squad1.1

python ./run_squad.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--do_train \
--do_eval \
--do_lower_case \
--train_file=${SQUAD_DIR}/train-v1.1.json \
--predict_file=${SQUAD_DIR}/dev-v1.1.json \
--per_gpu_eval_batch_size=8 \
--per_gpu_train_batch_size=8 \
--gradient_accumulation_steps=1 \
--learning_rate=3e-5 \
--num_train_epochs=2 \
--max_seq_length=384 \
--doc_stride=128 \
--save_steps=2000 \
--output_dir=./runs/bert_base_squad1_dp_ft_3 \

Environment

  • OS: Ubuntu 18.04, Linux kernel 4.15.0-65-generic
  • Python version: 3.7.4
  • PyTorch version: 1.3.0
  • Transformers version: 2.1.1 built from latest source
  • Using GPU? NVIDIA 1080Ti x 2
  • Distributed or parallel setup? Data Parallel
  • Any other relevant information: Have had many successful SQuAD fine-tuning runs on PyTorch 1.2.0 with Pytorch-Transformers 1.2.0, maybe even Transformers 2.0.0, and Apex 0.1. New environment built with the latest versions (Pytorch 1.3.0, Transformers 2.1.1) spawns data parallel related error above
wontfix

Most helpful comment

In my case, the solution is changing

if args.n_gpu > 1:
    model = torch.nn.DataParallel(model)

to

if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
    model = torch.nn.DataParallel(model)

All 11 comments

Runs are in a dedicated environment with only the following packages:

python 3.7.4
pytorch 1.3.0, install includes cudatoolkit 10.1
tensorflow_gpu 2.0 and dependencies
apex 0.1
transformers 2.1.1

Complete terminal output:

output_term_ERROR.TXT

Change the line in run_**.py
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
to
device = torch.device("cuda:0" if torch.cuda.is_available() and not args.no_cuda else "cpu").

In my environment, it works.

Change the line in run_**.py
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
to
device = torch.device("cuda:0" if torch.cuda.is_available() and not args.no_cuda else "cpu").

In my environment, it works.

It seems that all GPUs will still be used even if we specify "cuda:0" here. But I am not sure how much the other GPUs contribute to the computation. In my case, I have 8-way 1080ti but the other 7 are hardly fully loaded.

Does anyone compare the training speed with/without this error?

In my case, the solution is changing

if args.n_gpu > 1:
    model = torch.nn.DataParallel(model)

to

if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
    model = torch.nn.DataParallel(model)

In my case, the solution is changing

if args.n_gpu > 1:
    model = torch.nn.DataParallel(model)

to

if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
    model = torch.nn.DataParallel(model)

changing this in evaluate function fixes the error, when i run with --evaluate_during_training

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

In my case, the solution is changing

if args.n_gpu > 1:
    model = torch.nn.DataParallel(model)

to

if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
    model = torch.nn.DataParallel(model)

Agree, also notice that
args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu) is now multiplied by n_gpu again which is undesired

In my case, the solution is changing

if args.n_gpu > 1:
    model = torch.nn.DataParallel(model)

to

if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
    model = torch.nn.DataParallel(model)

Thanks! I have met the same error in evaluation function. It works for me.

In my case, the solution is changing

if args.n_gpu > 1:
    model = torch.nn.DataParallel(model)

to

if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
    model = torch.nn.DataParallel(model)

changing this in evaluate function fixes the error, when i run with --evaluate_during_training

This solution fixed the issue for me. I am observing this while training a new LM using transformers 2.5.1. The issue happened during evaluation.

One more comment about this fixing. If you use a validation set with odd number of instances, it will raise an error on lineoutputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels), if using run_language_modeling.py. This happens because the parall gpu needs two instances to be fed into.

I dont know how to fix properly. All I do is add a copy of instance of the last one to meet the number requirement.

In my case, the solution is changing

if args.n_gpu > 1:
    model = torch.nn.DataParallel(model)

to

if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
    model = torch.nn.DataParallel(model)

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lcswillems picture lcswillems  路  3Comments

delip picture delip  路  3Comments

adigoryl picture adigoryl  路  3Comments

yspaik picture yspaik  路  3Comments

alphanlp picture alphanlp  路  3Comments