Transformers: Pytorch 1.5 DataParallel

Created on 24 Apr 2020 · 22Comments · Source: huggingface/transformers

🐛 Bug

Information

Can't run forward in PyTorch 1.5.0, works fine in 1.4.0

Model I am using (Bert, XLNet ...): XLNet

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

[ ] the official example scripts: (give details below)
[X] my own modified scripts: (give details below)

Transformer + custom head + custom losses + differential learning rates, I don't think it matters.

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[X] my own task or dataset: (give details below)

Custom news classification

To reproduce

Steps to reproduce the behavior:

Install PyTorch 1.5.0
Run forward on xlnet

  File "transformers/modeling_xlnet.py", line 761, in forward
    dtype_float = next(self.parameters()).dtype
StopIteration

Expected behavior

Runs forward

Environment info

transformers version: 2.8.0
Platform: Ubuntu 18.04
Python version: Anaconda 3.7
PyTorch version (GPU?): 1.5, Yes
Tensorflow version (GPU?): N/A
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

PyTorch

Source

Rizhiy

👀15

Most helpful comment

Same problem here, running BERT.

torch==1.5.0
transformers==2.8.0

I'm running on GPUs, using export CUDA_VISIBLE_DEVICES=5,6,7 before running (I have 8 1080TIs on this server).

run_language_modeling.py --output_dir=models --model_type=bert --model_name_or_path=bert-base-uncased --do_train --train_data_file=Vol45.sample --mlm --save_steps-2000 --line_by_line --per_gpu_train_batch_size=8

Vol45.sample is a .txt with one doc per line

EDIT: It seems to work if I downgrade pytorch to 1.4

ArthurCamara on 1 May 2020

👍10

All 22 comments

I'm experiencing the same problem running transformers/examples/run_language_modeling.py with RoBERTa. Works well with PyTorch 1.4.0 tho.

miikargh on 27 Apr 2020

Also happens with RoBERTa, but only in distributed mode (only tested with DataParallel for now)

Evpok on 27 Apr 2020

@Rizhiy, do you mind putting a code example? I can't reproduce on master by doing an inference through the model. Thanks.

LysandreJik on 27 Apr 2020

@LysandreJik I will try to put one together but it's a bit weird. I only observed it when I use a transformer model via lightning in DataParallel mode so far

Evpok on 28 Apr 2020

Ah, got it

import transformers
import torch

m = transformers.AutoModel.from_pretrained("roberta-base")
m.to("cuda:0")
k = torch.nn.DataParallel(m, device_ids=[0,1])
k.forward(m.dummy_inputs['input_ids'])

gives

StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "<snip>/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "<snip>/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/<snip>/lib/python3.8/site-packages/transformers/modeling_bert.py", line 707, in forward
    attention_mask, input_shape, self.device
  File "<snip>/lib/python3.8/site-packages/transformers/modeling_utils.py", line 113, in device
    return next(self.parameters()).device
StopIteration

Using torch 1.5.0 and something like yesterday's transformers master

Evpok on 28 Apr 2020

The same for me and Bert

transformers - 2.3.0
pytorch - 1.5.0

movb on 28 Apr 2020

Evpok on 28 Apr 2020

Hello Everybody, do we have an update on this?. Today i managed to gather some more data to train a RoBERTa Model from scratch, i have been running experiementes in Pytorch 1.4, and i found this bug today that updated to Pytorch 1.5.

dehoyosb on 30 Apr 2020

Same problem here, running BERT.

torch==1.5.0
transformers==2.8.0

I'm running on GPUs, using export CUDA_VISIBLE_DEVICES=5,6,7 before running (I have 8 1080TIs on this server).

Vol45.sample is a .txt with one doc per line

EDIT: It seems to work if I downgrade pytorch to 1.4

ArthurCamara on 1 May 2020

👍10

Same here.
This might have to do with the first issue listed under Known Issues in the pytorch changelog of version 1.5, i.e. the recent change in torch.nn.parallel.replicate

maxidl on 2 May 2020

Same problem here, running BERT.
torch==1.5.0
transformers==2.8.0
I'm running on GPUs, using export CUDA_VISIBLE_DEVICES=5,6,7 before running (I have 8 1080TIs on this server).

run_language_modeling.py --output_dir=models --model_type=bert --model_name_or_path=bert-base-uncased --do_train --train_data_file=Vol45.sample --mlm --save_steps-2000 --line_by_line --per_gpu_train_batch_size=8

Vol45.sample is a .txt with one doc per line

EDIT: It seems to work if I downgrade pytorch to 1.4

Thanks. It also works for me!

torch==1.4.0
transformers==2.8.0

ThangPM on 5 May 2020

👍2

The same issue: #4189

erikchwang on 7 May 2020

Just to scope this bug a little bit better, all of you are using torch.nn.DataParallel (not DistributedDataParallel or single-GPU), correct?

julien-c on 8 May 2020

👍6

Just to scope this bug a little bit better, all of you are using torch.nn.DataParallel (not DistributedDataParallel or single-GPU), correct?

Sure, please use the following code to reproduce the error:

import torch, transformers
model = transformers.AutoModel.from_pretrained("bert-base-multilingual-cased")
model = torch.nn.DataParallel(model)
model = model.cuda()
input = torch.ones([16, 10], dtype=torch.long)
input = input.cuda()
model(input)

erikchwang on 8 May 2020

Just to scope this bug a little bit better, all of you are using torch.nn.DataParallel (not DistributedDataParallel or single-GPU), correct?

I was using the run_language_modeling.py script, which AFAIK uses torch.nn.DataParallel.

maxidl on 9 May 2020

👍1

This seems to be due to https://github.com/pytorch/pytorch/pull/33907

Still looking for the most correct fix on our side.

julien-c on 12 May 2020

👍3

Worked for me when downgraded to
torch==1.4.0

bayomim on 27 May 2020

Can you guys take a look at https://github.com/huggingface/transformers/issues/4657 and suggest what environment I should use. I've tried several with no luck.

CMobley7 on 29 May 2020

Can you install the repo from source and try again? There have been some issues with PyTorch upstream that Julien addressed here: #4300. So you can try with the latest master branch.

BramVanroy on 30 May 2020

👍1

Can confirm that installing from source (2.10) solves the issue.

ArthurCamara on 31 May 2020

👍1

Hello!

Just for the record, this seems to be solved with the latest release of transformers (3.0.1 and pytorch 1.5.1, cuda 10.1). At least the provided MWE does not fail.

Best,

geblanco on 6 Jul 2020

Hello!

Just for the record, this seems to be solved with the latest release of transformers (3.0.1 and pytorch 1.5.1, cuda 10.1). At least the provided MWE does not fail.

Best,

As per the previous comments: this was probably already fixed in 2.10.

BramVanroy on 6 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings