Can't run forward in PyTorch 1.5.0, works fine in 1.4.0
Model I am using (Bert, XLNet ...): XLNet
Language I am using the model on (English, Chinese ...): English
The problem arises when using:
Transformer + custom head + custom losses + differential learning rates, I don't think it matters.
The tasks I am working on is:
Custom news classification
Steps to reproduce the behavior:
File "transformers/modeling_xlnet.py", line 761, in forward
dtype_float = next(self.parameters()).dtype
StopIteration
Runs forward
transformers
version: 2.8.0I'm experiencing the same problem running transformers/examples/run_language_modeling.py
with RoBERTa. Works well with PyTorch 1.4.0 tho.
Also happens with RoBERTa, but only in distributed mode (only tested with DataParallel for now)
@Rizhiy, do you mind putting a code example? I can't reproduce on master
by doing an inference through the model. Thanks.
@LysandreJik I will try to put one together but it's a bit weird. I only observed it when I use a transformer model via lightning in DataParallel mode so far
Ah, got it
import transformers
import torch
m = transformers.AutoModel.from_pretrained("roberta-base")
m.to("cuda:0")
k = torch.nn.DataParallel(m, device_ids=[0,1])
k.forward(m.dummy_inputs['input_ids'])
gives
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "<snip>/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "<snip>/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/<snip>/lib/python3.8/site-packages/transformers/modeling_bert.py", line 707, in forward
attention_mask, input_shape, self.device
File "<snip>/lib/python3.8/site-packages/transformers/modeling_utils.py", line 113, in device
return next(self.parameters()).device
StopIteration
Using torch 1.5.0 and something like yesterday's transformers master
The same for me and Bert
transformers - 2.3.0
pytorch - 1.5.0
Hello Everybody, do we have an update on this?. Today i managed to gather some more data to train a RoBERTa Model from scratch, i have been running experiementes in Pytorch 1.4, and i found this bug today that updated to Pytorch 1.5.
Same problem here, running BERT.
torch==1.5.0
transformers==2.8.0
I'm running on GPUs, using export CUDA_VISIBLE_DEVICES=5,6,7
before running (I have 8 1080TIs on this server).
run_language_modeling.py --output_dir=models --model_type=bert --model_name_or_path=bert-base-uncased --do_train --train_data_file=Vol45.sample --mlm --save_steps-2000 --line_by_line --per_gpu_train_batch_size=8
Vol45.sample is a .txt with one doc per line
EDIT: It seems to work if I downgrade pytorch to 1.4
Same here.
This might have to do with the first issue listed under Known Issues in the pytorch changelog of version 1.5, i.e. the recent change in torch.nn.parallel.replicate
Same problem here, running BERT.
torch==1.5.0 transformers==2.8.0
I'm running on GPUs, using
export CUDA_VISIBLE_DEVICES=5,6,7
before running (I have 8 1080TIs on this server).
run_language_modeling.py --output_dir=models --model_type=bert --model_name_or_path=bert-base-uncased --do_train --train_data_file=Vol45.sample --mlm --save_steps-2000 --line_by_line --per_gpu_train_batch_size=8
Vol45.sample is a .txt with one doc per line
EDIT: It seems to work if I downgrade pytorch to 1.4
Thanks. It also works for me!
torch==1.4.0
transformers==2.8.0
The same issue: #4189
Just to scope this bug a little bit better, all of you are using torch.nn.DataParallel
(not DistributedDataParallel
or single-GPU), correct?
Just to scope this bug a little bit better, all of you are using
torch.nn.DataParallel
(notDistributedDataParallel
or single-GPU), correct?
Sure, please use the following code to reproduce the error:
import torch, transformers model = transformers.AutoModel.from_pretrained("bert-base-multilingual-cased") model = torch.nn.DataParallel(model) model = model.cuda() input = torch.ones([16, 10], dtype=torch.long) input = input.cuda() model(input)
Just to scope this bug a little bit better, all of you are using
torch.nn.DataParallel
(notDistributedDataParallel
or single-GPU), correct?
I was using the run_language_modeling.py script, which AFAIK uses torch.nn.DataParallel.
This seems to be due to https://github.com/pytorch/pytorch/pull/33907
Still looking for the most correct fix on our side.
Worked for me when downgraded to
torch==1.4.0
Can you guys take a look at https://github.com/huggingface/transformers/issues/4657 and suggest what environment I should use. I've tried several with no luck.
Can you install the repo from source and try again? There have been some issues with PyTorch upstream that Julien addressed here: #4300. So you can try with the latest master branch.
Can confirm that installing from source (2.10) solves the issue.
Hello!
Just for the record, this seems to be solved with the latest release of transformers (3.0.1 and pytorch 1.5.1, cuda 10.1). At least the provided MWE does not fail.
Best,
Hello!
Just for the record, this seems to be solved with the latest release of transformers (3.0.1 and pytorch 1.5.1, cuda 10.1). At least the provided MWE does not fail.
Best,
As per the previous comments: this was probably already fixed in 2.10.
Most helpful comment
Same problem here, running BERT.
I'm running on GPUs, using
export CUDA_VISIBLE_DEVICES=5,6,7
before running (I have 8 1080TIs on this server).run_language_modeling.py --output_dir=models --model_type=bert --model_name_or_path=bert-base-uncased --do_train --train_data_file=Vol45.sample --mlm --save_steps-2000 --line_by_line --per_gpu_train_batch_size=8
Vol45.sample is a .txt with one doc per line
EDIT: It seems to work if I downgrade pytorch to 1.4