Model I am using (Bert, XLNet ...):
Language I am using the model on (English, Chinese ...):
The problem arises when using:
The tasks I am working on is:
Steps to reproduce the behavior:
File "./examples/text-classification/run_glue.py", line 246, in
main()
File "./examples/text-classification/run_glue.py", line 173, in main
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
File "/work/vnhh/anaconda3/lib/python3.6/site-packages/transformers/trainer.py", line 499, in train
tr_loss += self._training_step(model, inputs, optimizer)
File "/work/vnhh/anaconda3/lib/python3.6/site-packages/transformers/trainer.py", line 622, in _training_step
outputs = model(*inputs)
File "/work/vnhh/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(input, *kwargs)
File "/work/vnhh/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
return self.gather(outputs, self.output_device)
File "/work/vnhh/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather
return gather(outputs, output_device, dim=self.dim)
File "/work/vnhh/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/work/vnhh/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(outputs)))
TypeError: __init__() missing 1 required positional argument: 'logits'
It should be able to run and finish training
transformers version: 3.0.2i faced the same error yesterday. Installing version 3.0.1 fixed the issue for me.
Installing one or two older versions can fix this. However, I will leave it here so that they know this bug exists in their newest version.
It appears that the CircleCI doesn't run gpu tests (or just multiple gpu?), all sub-tests test_multigpu_data_parallel_forward fail., e.g.: tests/test_modeling_flaubert.py::FlaubertModelTest::test_multigpu_data_parallel_forward.
pytest --disable-warnings -n 1 tests/test_modeling_bert.py::BertModelTest::test_multigpu_data_parallel_forward
====================================================================== test session starts =======================================================================
platform linux -- Python 3.7.5, pytest-5.4.3, py-1.9.0, pluggy-0.13.1
rootdir: /mnt/nvme1/code/huggingface/transformers-tests-1
plugins: hypothesis-5.5.4, filter-subpackage-0.1.1, arraydiff-0.3, flaky-3.6.1, ipynb-1.1.1.dev0, cov-2.10.0, astropy-header-0.1.2, forked-1.2.0, doctestplus-0.5.0, openfiles-0.4.0, remotedata-0.3.2, xdist-1.32.0
gw0 [1]
F [100%]
============================================================================ FAILURES ============================================================================
_______________________________________________________ BertModelTest.test_multigpu_data_parallel_forward ________________________________________________________
[gw0] linux -- Python 3.7.5 /home/stas/anaconda3/envs/main/bin/python
self = <tests.test_modeling_bert.BertModelTest testMethod=test_multigpu_data_parallel_forward>
@require_multigpu
def test_multigpu_data_parallel_forward(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
# some params shouldn't be scattered by nn.DataParallel
# so just remove them if they are present.
blacklist_non_batched_params = ["head_mask"]
for k in blacklist_non_batched_params:
inputs_dict.pop(k, None)
# move input tensors to cuda:O
for k, v in inputs_dict.items():
if torch.is_tensor(v):
inputs_dict[k] = v.to(0)
for model_class in self.all_model_classes:
model = model_class(config=config)
model.to(0)
model.eval()
# Wrap model in nn.DataParallel
model = torch.nn.DataParallel(model)
with torch.no_grad():
> _ = model(**self._prepare_for_class(inputs_dict, model_class))
tests/test_modeling_common.py:807:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/modules/module.py:550: in __call__
result = self.forward(*input, **kwargs)
/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py:156: in forward
return self.gather(outputs, self.output_device)
/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py:168: in gather
return gather(outputs, output_device, dim=self.dim)
/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py:68: in gather
res = gather_map(outputs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
outputs = [BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 1.0115e+00, 1.4145e+00, -5.7332e-01, ..., -4.6471e-01,
... 0.1111, -0.0592, -0.1177, 0.0074, -0.0155, -0.1015]],
device='cuda:1'), hidden_states=None, attentions=None)]
def gather_map(outputs):
out = outputs[0]
if isinstance(out, torch.Tensor):
return Gather.apply(target_device, dim, *outputs)
if out is None:
return None
if isinstance(out, dict):
if not all((len(out) == len(d) for d in outputs)):
raise ValueError('All dicts must have the same number of keys')
return type(out)(((k, gather_map([d[k] for d in outputs]))
for k in out))
> return type(out)(map(gather_map, zip(*outputs)))
E TypeError: __init__() missing 1 required positional argument: 'pooler_output'
/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py:63: TypeError
==================================================================== short test summary info =====================================================================
FAILED tests/test_modeling_bert.py::BertModelTest::test_multigpu_data_parallel_forward - TypeError: __init__() missing 1 required positional argument: 'pooler_...
================================================================= 1 failed, 4 warnings in 5.44s ==================================================================
pytest --disable-warnings -n 1 tests/test_modeling_flaubert.py::FlaubertModelTest::test_multigpu_data_parallel_forward
====================================================================== test session starts =======================================================================
platform linux -- Python 3.7.5, pytest-5.4.3, py-1.9.0, pluggy-0.13.1
rootdir: /mnt/nvme1/code/huggingface/transformers-tests-1
plugins: hypothesis-5.5.4, filter-subpackage-0.1.1, arraydiff-0.3, flaky-3.6.1, ipynb-1.1.1.dev0, cov-2.10.0, astropy-header-0.1.2, forked-1.2.0, doctestplus-0.5.0, openfiles-0.4.0, remotedata-0.3.2, xdist-1.32.0
gw0 [1]
F [100%]
============================================================================ FAILURES ============================================================================
_____________________________________________________ FlaubertModelTest.test_multigpu_data_parallel_forward ______________________________________________________
[gw0] linux -- Python 3.7.5 /home/stas/anaconda3/envs/main/bin/python
self = <tests.test_modeling_flaubert.FlaubertModelTest testMethod=test_multigpu_data_parallel_forward>
@require_multigpu
def test_multigpu_data_parallel_forward(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
# some params shouldn't be scattered by nn.DataParallel
# so just remove them if they are present.
blacklist_non_batched_params = ["head_mask"]
for k in blacklist_non_batched_params:
inputs_dict.pop(k, None)
# move input tensors to cuda:O
for k, v in inputs_dict.items():
if torch.is_tensor(v):
inputs_dict[k] = v.to(0)
for model_class in self.all_model_classes:
model = model_class(config=config)
model.to(0)
model.eval()
# Wrap model in nn.DataParallel
model = torch.nn.DataParallel(model)
with torch.no_grad():
> _ = model(**self._prepare_for_class(inputs_dict, model_class))
tests/test_modeling_common.py:807:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/modules/module.py:550: in __call__
result = self.forward(*input, **kwargs)
/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py:156: in forward
return self.gather(outputs, self.output_device)
/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py:168: in gather
return gather(outputs, output_device, dim=self.dim)
/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py:68: in gather
res = gather_map(outputs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
outputs = [MaskedLMOutput(loss=None, logits=tensor([[[-0.0008, 0.3751, -0.0050, ..., 0.0933, -0.1563, 0.0494],
[-0....0, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]]],
device='cuda:1'), hidden_states=None, attentions=None)]
def gather_map(outputs):
out = outputs[0]
if isinstance(out, torch.Tensor):
return Gather.apply(target_device, dim, *outputs)
if out is None:
return None
if isinstance(out, dict):
if not all((len(out) == len(d) for d in outputs)):
raise ValueError('All dicts must have the same number of keys')
return type(out)(((k, gather_map([d[k] for d in outputs]))
for k in out))
> return type(out)(map(gather_map, zip(*outputs)))
E TypeError: __init__() missing 1 required positional argument: 'logits'
/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py:63: TypeError
==================================================================== short test summary info =====================================================================
FAILED tests/test_modeling_flaubert.py::FlaubertModelTest::test_multigpu_data_parallel_forward - TypeError: __init__() missing 1 required positional argument: ...
================================================================= 1 failed, 4 warnings in 5.54s =============================================================
Digging deeper it appears that torch.nn.parallel.scatter_gather.gather can't gather outputs that are dataclasses - it gets a list of outputs that are dataclasses and completely breaks them down into just one value.
This pytorch hack fixes the problem for the failing tests. Swap the gather function for this one (including import):
# torch/nn/parallel/scatter_gather.py
import dataclasses
def gather(outputs, target_device, dim=0):
r"""
Gathers tensors from different GPUs on a specified device
(-1 means the CPU).
"""
def gather_map(outputs):
out = outputs[0]
if dataclasses.is_dataclass(out):
outputs = [dataclasses.asdict(out) for out in outputs]
out = outputs[0]
if isinstance(out, torch.Tensor):
return Gather.apply(target_device, dim, *outputs)
if out is None:
return None
if isinstance(out, dict):
if not all((len(out) == len(d) for d in outputs)):
raise ValueError('All dicts must have the same number of keys')
return type(out)(((k, gather_map([d[k] for d in outputs]))
for k in out))
return type(out)(map(gather_map, zip(*outputs)))
# Recursive function calls like this create reference cycles.
# Setting the function to None clears the refcycle.
try:
res = gather_map(outputs)
finally:
gather_map = None
return res
It converts the dataclass output into a dict and then it works - at least the tests do, I haven't tried OP's example.
What I added is:
import dataclasses
and
if dataclasses.is_dataclass(out):
outputs = [dataclasses.asdict(out) for out in outputs]
out = outputs[0]
I filed a bug report with pytorch: https://github.com/pytorch/pytorch/issues/41327
My pytorch tweak fixes the transformers tests, but when trying to use it on OP's use - it fails elsewhere:
export TASK_NAME=CoLA
export GLUE_DIR=/tmp/glue_data/
python ./examples/text-classification/run_glue.py --model_name_or_path bert-base-uncased --task_name $TASK_NAME --do_train --do_eval --data_dir $GLUE_DIR/$TASK_NAME --max_seq_length 128 --per_device_eval_batch_size=2 --per_device_train_batch_size=2 --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir /tmp/$TASK_NAME/
...
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 98, in <listcomp>
outputs = [dataclasses.asdict(out) for out in outputs]
File "/home/stas/anaconda3/envs/main/lib/python3.7/dataclasses.py", line 1045, in asdict
return _asdict_inner(obj, dict_factory)
File "/home/stas/anaconda3/envs/main/lib/python3.7/dataclasses.py", line 1052, in _asdict_inner
value = _asdict_inner(getattr(obj, f.name), dict_factory)
File "/home/stas/anaconda3/envs/main/lib/python3.7/dataclasses.py", line 1086, in _asdict_inner
return copy.deepcopy(obj)
File "/home/stas/anaconda3/envs/main/lib/python3.7/copy.py", line 161, in deepcopy
y = copier(memo)
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/tensor.py", line 44, in __deepcopy__
raise RuntimeError("Only Tensors created expl
RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment
So that conversion from dataclass from dict didn't work elsewhere. Needs more digging.
@vanh17, until this is sorted out, you may choose to run on a single gpu which I tested to work.
You can accomplish that by adding to your command line:
env CUDA_VISIBLE_DEVICES=0 python ./examples/text-classification/run_glue.py ...
change 0 to whichever GPU you want it to be run on.
I think this is related to https://github.com/huggingface/transformers/pull/5685
When used in a nn.DataParallel setup a model should be instantiated with return_tuple=True.
It would be nice to check if there is a way for a model to know that it's being part of a nn.DataParallel so it can setup this argument automatically. If someone wants to give it a look....
cc @sgugger
I can look at this when I'm back next week. In the meantime, merging #5685 will fix the issue.
merging #5685 will fix the issue.
I verified that the run_glue.py on dual gpu work after this merge.
Is there a CirleCI config that supports dual gpu tests?
edit: multigpu tests still fail as before. I forgot to back out the pytorch hack.
So, if with n_gpu > 1, it works w/o returning outputs wrapped in a model's output dataclass, why do we need to ever return a dataclass and not always a tuple regardless of n_gpu's value? same goes for the suggestion by @thomwolf - only with nn.DataParallel. https://github.com/huggingface/transformers/pull/5685 just moved the problem elsewhere, since it's not possible to rely on a model to return an output dataclass and the behavior is different depending on the hardware setup.
Always returning tuples require user to know which output is at which position (and it changes depending on the parameters you pass to the model) so having something self-documenting was a feature users asked for a long time.
I totally understand that and this is great. But if a user codes for that API relying on outputs being a dataclass, and their code is then run in multi-gpu env it will break. Are we on the same page now?
I can see 2 solutions that lead to a consistent API:
getting pytorch to support not only dict outputs but also dataclass in gather https://github.com/pytorch/pytorch/issues/41327
re-encapsulate the tuple into the original output dataclass when it returns from pytorch to transformers and before it is passed back to the user. There will be an additional small overhead. But we don't really have a proxy to insert such manipulation, so probably this is not feasible at the moment.
I updated my earlier comment - multigpu tests still fail after @sgugger's commit as before - so only part of the problem has been worked around. I forgot to back out the proposed pytorch hack so it looked like it worked, but it is not.
wrt the change https://github.com/huggingface/transformers/pull/5685, won't this be fitting:
# Our model outputs do not work with DataParallel, so forcing return tuple.
- if self.args.n_gpu > 1:
+ if isinstance(model, nn.DataParallel):
inputs["return_tuple"] = True
as @thomwolf suggested. But perhaps practically they are covering the same cases.
I'm digging for where else this is needed to make the tests work.
OK, to make the common tests work, this is needed:
diff --git a/tests/test_modeling_common.py b/tests/test_modeling_common.py
index 0021f23c..683b7913 100644
--- a/tests/test_modeling_common.py
+++ b/tests/test_modeling_common.py
@@ -803,6 +803,7 @@ class ModelTesterMixin:
# Wrap model in nn.DataParallel
model = torch.nn.DataParallel(model)
+ inputs_dict["return_tuple"] = True
with torch.no_grad():
_ = model(**self._prepare_for_class(inputs_dict, model_class))
yikes.
PR for both: https://github.com/huggingface/transformers/pull/5733
Let me know if you prefer a separate PR for each.
Also why does the return_tuple param defaults to None and not False in most models, whereas in some it's False. It probably should be False everywhere, no?
Same applies to output_hidden_states and output_attentions forward params - sometimes they default to None and other times False. Probably should be False everywhere.
I think we can find a work-around on this in the meantime by allowing our output data classes to accepts list/tuple as inputs to the first argument and spread these over the other arguments in __post_init__. I'll try to make a PR on this.
I think we can find a work-around on this in the meantime by allowing our output data classes to accepts list/tuple as inputs to the first argument and spread these over the other arguments in
__post_init__. I'll try to make a PR on this.
To me, it is now working with this workaround (fine-tuning LMs). But, shall I get concerned about the reliability of the results?
shall I get concerned about the reliability of the results?
If you're referring to https://github.com/huggingface/transformers/pull/5685 commit, there is no reason to be concerned. There was no "functional" change per se, this is really sorting out the API - trying to make it consistent.
I also ran into a similar problem when running the script from examples/question-answering using two GPUs from the master branch:
python run_squad.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--do_train \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--per_gpu_train_batch_size 12 \
--per_gpu_eval_batch_size=16 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 320 \
--doc_stride 128 \
--output_dir $SQUAD_DIR/bert-base-uncased-squad_v1
The error looks like below:
File "run_squad.py", line 821, in <module>
main()
File "run_squad.py", line 764, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_squad.py", line 202, in train
outputs = model(**inputs)
File "/home/qqcao/work/transformers/.env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/qqcao/work/transformers/.env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
return self.gather(outputs, self.output_device)
File "/home/qqcao/work/transformers/.env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/qqcao/work/transformers/.env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/home/qqcao/work/transformers/.env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
TypeError: __init__() missing 2 required positional arguments: 'start_logits' and 'end_logits'
I have to roll back to version 3.0.0. Do you have any ETA when this will get fixed? Thanks.
@csarron, this should fix it.
--- a/examples/question-answering/run_squad.py
+++ b/examples/question-answering/run_squad.py
@@ -199,6 +199,9 @@ def train(args, train_dataset, model, tokenizer):
{"langs": (torch.ones(batch[0].shape, dtype=torch.int64) * args.lang_id).to(args.device)}
)
+ if isinstance(model, torch.nn.DataParallel):
+ inputs["return_tuple"] = True
+
outputs = model(**inputs)
# model outputs are always tuple in transformers (see doc)
loss = outputs[0]
It appears that this will now need to be added everywhere before model is invoked, and users will need to do that too should they code their own and intend to use DataParallel.
Surely, there must be a better way. I suppose that when this neat dataclass feature was added it wasn't tested on nn.DataParallel. Perhaps best to back it out, figure out for pytorch to support dataclasses in scatter/gather and then put it back in with perhaps a monkeypatch for older pytorch versions. https://github.com/pytorch/pytorch/issues/41327
p.s. Note that the project's scripts/modules don't consistently import torch.nn as nn, so sometimes it's torch.nn.DataParallel, whereas other times nn.DataParallel.
Got same problem here.
@sgugger came up with a transparent solution for this issue: https://github.com/huggingface/transformers/pull/5941
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
@csarron, this should fix it.
It appears that this will now need to be added everywhere before model is invoked, and users will need to do that too should they code their own and intend to use
DataParallel.Surely, there must be a better way. I suppose that when this neat
dataclassfeature was added it wasn't tested onnn.DataParallel. Perhaps best to back it out, figure out for pytorch to supportdataclassesin scatter/gather and then put it back in with perhaps a monkeypatch for older pytorch versions. https://github.com/pytorch/pytorch/issues/41327p.s. Note that the project's scripts/modules don't consistently
import torch.nn as nn, so sometimes it'storch.nn.DataParallel, whereas other timesnn.DataParallel.