Transformers: Pytorch 1.5 DataParallel

Created on 24 Apr 2020  路  22Comments  路  Source: huggingface/transformers

馃悰 Bug

Information

Can't run forward in PyTorch 1.5.0, works fine in 1.4.0

Model I am using (Bert, XLNet ...): XLNet

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

  • [ ] the official example scripts: (give details below)
  • [X] my own modified scripts: (give details below)

Transformer + custom head + custom losses + differential learning rates, I don't think it matters.

The tasks I am working on is:

  • [ ] an official GLUE/SQUaD task: (give the name)
  • [X] my own task or dataset: (give details below)

Custom news classification

To reproduce

Steps to reproduce the behavior:

  1. Install PyTorch 1.5.0
  2. Run forward on xlnet
  3. 3.
  File "transformers/modeling_xlnet.py", line 761, in forward
    dtype_float = next(self.parameters()).dtype
StopIteration

Expected behavior

Runs forward

Environment info

  • transformers version: 2.8.0
  • Platform: Ubuntu 18.04
  • Python version: Anaconda 3.7
  • PyTorch version (GPU?): 1.5, Yes
  • Tensorflow version (GPU?): N/A
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes
PyTorch

Most helpful comment

Same problem here, running BERT.

torch==1.5.0
transformers==2.8.0

I'm running on GPUs, using export CUDA_VISIBLE_DEVICES=5,6,7 before running (I have 8 1080TIs on this server).

run_language_modeling.py --output_dir=models --model_type=bert --model_name_or_path=bert-base-uncased --do_train --train_data_file=Vol45.sample --mlm --save_steps-2000 --line_by_line --per_gpu_train_batch_size=8

Vol45.sample is a .txt with one doc per line

EDIT: It seems to work if I downgrade pytorch to 1.4

All 22 comments

I'm experiencing the same problem running transformers/examples/run_language_modeling.py with RoBERTa. Works well with PyTorch 1.4.0 tho.

Also happens with RoBERTa, but only in distributed mode (only tested with DataParallel for now)

@Rizhiy, do you mind putting a code example? I can't reproduce on master by doing an inference through the model. Thanks.

@LysandreJik I will try to put one together but it's a bit weird. I only observed it when I use a transformer model via lightning in DataParallel mode so far

Ah, got it

import transformers
import torch

m = transformers.AutoModel.from_pretrained("roberta-base")
m.to("cuda:0")
k = torch.nn.DataParallel(m, device_ids=[0,1])
k.forward(m.dummy_inputs['input_ids'])

gives

StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "<snip>/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "<snip>/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/<snip>/lib/python3.8/site-packages/transformers/modeling_bert.py", line 707, in forward
    attention_mask, input_shape, self.device
  File "<snip>/lib/python3.8/site-packages/transformers/modeling_utils.py", line 113, in device
    return next(self.parameters()).device
StopIteration

Using torch 1.5.0 and something like yesterday's transformers master

The same for me and Bert

transformers - 2.3.0
pytorch - 1.5.0

Hello Everybody, do we have an update on this?. Today i managed to gather some more data to train a RoBERTa Model from scratch, i have been running experiementes in Pytorch 1.4, and i found this bug today that updated to Pytorch 1.5.

Same problem here, running BERT.

torch==1.5.0
transformers==2.8.0

I'm running on GPUs, using export CUDA_VISIBLE_DEVICES=5,6,7 before running (I have 8 1080TIs on this server).

run_language_modeling.py --output_dir=models --model_type=bert --model_name_or_path=bert-base-uncased --do_train --train_data_file=Vol45.sample --mlm --save_steps-2000 --line_by_line --per_gpu_train_batch_size=8

Vol45.sample is a .txt with one doc per line

EDIT: It seems to work if I downgrade pytorch to 1.4

Same here.
This might have to do with the first issue listed under Known Issues in the pytorch changelog of version 1.5, i.e. the recent change in torch.nn.parallel.replicate

Same problem here, running BERT.

torch==1.5.0
transformers==2.8.0

I'm running on GPUs, using export CUDA_VISIBLE_DEVICES=5,6,7 before running (I have 8 1080TIs on this server).

run_language_modeling.py --output_dir=models --model_type=bert --model_name_or_path=bert-base-uncased --do_train --train_data_file=Vol45.sample --mlm --save_steps-2000 --line_by_line --per_gpu_train_batch_size=8

Vol45.sample is a .txt with one doc per line

EDIT: It seems to work if I downgrade pytorch to 1.4

Thanks. It also works for me!

torch==1.4.0
transformers==2.8.0

The same issue: #4189

Just to scope this bug a little bit better, all of you are using torch.nn.DataParallel (not DistributedDataParallel or single-GPU), correct?

Just to scope this bug a little bit better, all of you are using torch.nn.DataParallel (not DistributedDataParallel or single-GPU), correct?

Sure, please use the following code to reproduce the error:

import torch, transformers
model = transformers.AutoModel.from_pretrained("bert-base-multilingual-cased")
model = torch.nn.DataParallel(model)
model = model.cuda()
input = torch.ones([16, 10], dtype=torch.long)
input = input.cuda()
model(input)

Just to scope this bug a little bit better, all of you are using torch.nn.DataParallel (not DistributedDataParallel or single-GPU), correct?

I was using the run_language_modeling.py script, which AFAIK uses torch.nn.DataParallel.

This seems to be due to https://github.com/pytorch/pytorch/pull/33907

Still looking for the most correct fix on our side.

Worked for me when downgraded to
torch==1.4.0

Can you guys take a look at https://github.com/huggingface/transformers/issues/4657 and suggest what environment I should use. I've tried several with no luck.

Can you install the repo from source and try again? There have been some issues with PyTorch upstream that Julien addressed here: #4300. So you can try with the latest master branch.

Can confirm that installing from source (2.10) solves the issue.

Hello!

Just for the record, this seems to be solved with the latest release of transformers (3.0.1 and pytorch 1.5.1, cuda 10.1). At least the provided MWE does not fail.

Best,

Hello!

Just for the record, this seems to be solved with the latest release of transformers (3.0.1 and pytorch 1.5.1, cuda 10.1). At least the provided MWE does not fail.

Best,

As per the previous comments: this was probably already fixed in 2.10.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

zhezhaoa picture zhezhaoa  路  3Comments

HansBambel picture HansBambel  路  3Comments

siddsach picture siddsach  路  3Comments

lemonhu picture lemonhu  路  3Comments

HanGuo97 picture HanGuo97  路  3Comments