Transformers: init() missing 1 required positional argument: 'logits'

Created on 12 Jul 2020 · 24Comments · Source: huggingface/transformers

🐛 Bug

Information

Model I am using (Bert, XLNet ...):

Language I am using the model on (English, Chinese ...):

The problem arises when using:

[x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[x] an official GLUE/SQUaD task: (give the name)
[ ] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

python ./examples/text-classification/run_glue.py --model_name_or_path bert-base-uncased --task_name $TASK_NAME --do_train --do_eval --data_dir $GLUE_DIR/$TASK_NAME --max_seq_length 128 --per_device_eval_batch_size=2 --per_device_train_batch_size=2 --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir /tmp/$TASK_NAME/

File "./examples/text-classification/run_glue.py", line 246, in
main()
File "./examples/text-classification/run_glue.py", line 173, in main
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
File "/work/vnhh/anaconda3/lib/python3.6/site-packages/transformers/trainer.py", line 499, in train
tr_loss += self._training_step(model, inputs, optimizer)
File "/work/vnhh/anaconda3/lib/python3.6/site-packages/transformers/trainer.py", line 622, in _training_step
outputs = model(*inputs)
File "/work/vnhh/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(input, *kwargs)
File "/work/vnhh/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
return self.gather(outputs, self.output_device)
File "/work/vnhh/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather
return gather(outputs, output_device, dim=self.dim)
File "/work/vnhh/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/work/vnhh/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(outputs)))
TypeError: __init__() missing 1 required positional argument: 'logits'

Expected behavior

It should be able to run and finish training

Environment info

transformers version: 3.0.2
Platform: Linux-4.4.0-165-generic-x86_64-with-debian-stretch-sid
Python version: 3.6.5
PyTorch version (GPU?): 1.3.1 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?:
Using distributed or parallel set-up in script?:
-tensorboardX: 1.9.0

wontfix

Source

vanh17

👀2

Most helpful comment

@csarron, this should fix it.

--- a/examples/question-answering/run_squad.py
+++ b/examples/question-answering/run_squad.py
@@ -199,6 +199,9 @@ def train(args, train_dataset, model, tokenizer):
                         {"langs": (torch.ones(batch[0].shape, dtype=torch.int64) * args.lang_id).to(args.device)}
                     )

+            if isinstance(model, torch.nn.DataParallel):
+                inputs["return_tuple"] = True
+
             outputs = model(**inputs)
             # model outputs are always tuple in transformers (see doc)
             loss = outputs[0]

It appears that this will now need to be added everywhere before model is invoked, and users will need to do that too should they code their own and intend to use DataParallel.

Surely, there must be a better way. I suppose that when this neat dataclass feature was added it wasn't tested on nn.DataParallel. Perhaps best to back it out, figure out for pytorch to support dataclasses in scatter/gather and then put it back in with perhaps a monkeypatch for older pytorch versions. https://github.com/pytorch/pytorch/issues/41327

p.s. Note that the project's scripts/modules don't consistently import torch.nn as nn, so sometimes it's torch.nn.DataParallel, whereas other times nn.DataParallel.

stas00 on 16 Jul 2020

👍3

All 24 comments

i faced the same error yesterday. Installing version 3.0.1 fixed the issue for me.

mahimanzum on 12 Jul 2020

Installing one or two older versions can fix this. However, I will leave it here so that they know this bug exists in their newest version.

vanh17 on 12 Jul 2020

It appears that the CircleCI doesn't run gpu tests (or just multiple gpu?), all sub-tests test_multigpu_data_parallel_forward fail., e.g.: tests/test_modeling_flaubert.py::FlaubertModelTest::test_multigpu_data_parallel_forward.

pytest  --disable-warnings   -n 1 tests/test_modeling_bert.py::BertModelTest::test_multigpu_data_parallel_forward
====================================================================== test session starts =======================================================================
platform linux -- Python 3.7.5, pytest-5.4.3, py-1.9.0, pluggy-0.13.1
rootdir: /mnt/nvme1/code/huggingface/transformers-tests-1
plugins: hypothesis-5.5.4, filter-subpackage-0.1.1, arraydiff-0.3, flaky-3.6.1, ipynb-1.1.1.dev0, cov-2.10.0, astropy-header-0.1.2, forked-1.2.0, doctestplus-0.5.0, openfiles-0.4.0, remotedata-0.3.2, xdist-1.32.0
gw0 [1]
F                                                                                                                                                          [100%]
============================================================================ FAILURES ============================================================================
_______________________________________________________ BertModelTest.test_multigpu_data_parallel_forward ________________________________________________________
[gw0] linux -- Python 3.7.5 /home/stas/anaconda3/envs/main/bin/python

self = <tests.test_modeling_bert.BertModelTest testMethod=test_multigpu_data_parallel_forward>

    @require_multigpu
    def test_multigpu_data_parallel_forward(self):
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()

        # some params shouldn't be scattered by nn.DataParallel
        # so just remove them if they are present.
        blacklist_non_batched_params = ["head_mask"]
        for k in blacklist_non_batched_params:
            inputs_dict.pop(k, None)

        # move input tensors to cuda:O
        for k, v in inputs_dict.items():
            if torch.is_tensor(v):
                inputs_dict[k] = v.to(0)

        for model_class in self.all_model_classes:
            model = model_class(config=config)
            model.to(0)
            model.eval()

            # Wrap model in nn.DataParallel
            model = torch.nn.DataParallel(model)
            with torch.no_grad():
>               _ = model(**self._prepare_for_class(inputs_dict, model_class))

tests/test_modeling_common.py:807:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/modules/module.py:550: in __call__
    result = self.forward(*input, **kwargs)
/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py:156: in forward
    return self.gather(outputs, self.output_device)
/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py:168: in gather
    return gather(outputs, output_device, dim=self.dim)
/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py:68: in gather
    res = gather_map(outputs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

outputs = [BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 1.0115e+00,  1.4145e+00, -5.7332e-01,  ..., -4.6471e-01,
    ...  0.1111, -0.0592, -0.1177,  0.0074, -0.0155, -0.1015]],
       device='cuda:1'), hidden_states=None, attentions=None)]

    def gather_map(outputs):
        out = outputs[0]
        if isinstance(out, torch.Tensor):
            return Gather.apply(target_device, dim, *outputs)
        if out is None:
            return None
        if isinstance(out, dict):
            if not all((len(out) == len(d) for d in outputs)):
                raise ValueError('All dicts must have the same number of keys')
            return type(out)(((k, gather_map([d[k] for d in outputs]))
                              for k in out))
>       return type(out)(map(gather_map, zip(*outputs)))
E       TypeError: __init__() missing 1 required positional argument: 'pooler_output'

/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py:63: TypeError
==================================================================== short test summary info =====================================================================
FAILED tests/test_modeling_bert.py::BertModelTest::test_multigpu_data_parallel_forward - TypeError: __init__() missing 1 required positional argument: 'pooler_...
================================================================= 1 failed, 4 warnings in 5.44s ==================================================================

pytest  --disable-warnings   -n 1 tests/test_modeling_flaubert.py::FlaubertModelTest::test_multigpu_data_parallel_forward
====================================================================== test session starts =======================================================================
platform linux -- Python 3.7.5, pytest-5.4.3, py-1.9.0, pluggy-0.13.1
rootdir: /mnt/nvme1/code/huggingface/transformers-tests-1
plugins: hypothesis-5.5.4, filter-subpackage-0.1.1, arraydiff-0.3, flaky-3.6.1, ipynb-1.1.1.dev0, cov-2.10.0, astropy-header-0.1.2, forked-1.2.0, doctestplus-0.5.0, openfiles-0.4.0, remotedata-0.3.2, xdist-1.32.0
gw0 [1]
F                                                                                                                                                          [100%]
============================================================================ FAILURES ============================================================================
_____________________________________________________ FlaubertModelTest.test_multigpu_data_parallel_forward ______________________________________________________
[gw0] linux -- Python 3.7.5 /home/stas/anaconda3/envs/main/bin/python

self = <tests.test_modeling_flaubert.FlaubertModelTest testMethod=test_multigpu_data_parallel_forward>

    @require_multigpu
    def test_multigpu_data_parallel_forward(self):
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()

        # some params shouldn't be scattered by nn.DataParallel
        # so just remove them if they are present.
        blacklist_non_batched_params = ["head_mask"]
        for k in blacklist_non_batched_params:
            inputs_dict.pop(k, None)

        # move input tensors to cuda:O
        for k, v in inputs_dict.items():
            if torch.is_tensor(v):
                inputs_dict[k] = v.to(0)

        for model_class in self.all_model_classes:
            model = model_class(config=config)
            model.to(0)
            model.eval()

            # Wrap model in nn.DataParallel
            model = torch.nn.DataParallel(model)
            with torch.no_grad():
>               _ = model(**self._prepare_for_class(inputs_dict, model_class))

tests/test_modeling_common.py:807:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/modules/module.py:550: in __call__
    result = self.forward(*input, **kwargs)
/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py:156: in forward
    return self.gather(outputs, self.output_device)
/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py:168: in gather
    return gather(outputs, output_device, dim=self.dim)
/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py:68: in gather
    res = gather_map(outputs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

outputs = [MaskedLMOutput(loss=None, logits=tensor([[[-0.0008,  0.3751, -0.0050,  ...,  0.0933, -0.1563,  0.0494],
         [-0....0,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]]],
       device='cuda:1'), hidden_states=None, attentions=None)]

    def gather_map(outputs):
        out = outputs[0]
        if isinstance(out, torch.Tensor):
            return Gather.apply(target_device, dim, *outputs)
        if out is None:
            return None
        if isinstance(out, dict):
            if not all((len(out) == len(d) for d in outputs)):
                raise ValueError('All dicts must have the same number of keys')
            return type(out)(((k, gather_map([d[k] for d in outputs]))
                              for k in out))
>       return type(out)(map(gather_map, zip(*outputs)))
E       TypeError: __init__() missing 1 required positional argument: 'logits'

/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py:63: TypeError
==================================================================== short test summary info =====================================================================
FAILED tests/test_modeling_flaubert.py::FlaubertModelTest::test_multigpu_data_parallel_forward - TypeError: __init__() missing 1 required positional argument: ...
================================================================= 1 failed, 4 warnings in 5.54s =============================================================

stas00 on 13 Jul 2020

Digging deeper it appears that torch.nn.parallel.scatter_gather.gather can't gather outputs that are dataclasses - it gets a list of outputs that are dataclasses and completely breaks them down into just one value.

This pytorch hack fixes the problem for the failing tests. Swap the gather function for this one (including import):

# torch/nn/parallel/scatter_gather.py

import dataclasses
def gather(outputs, target_device, dim=0):
    r"""
    Gathers tensors from different GPUs on a specified device
      (-1 means the CPU).
    """
    def gather_map(outputs):
        out = outputs[0]
        if dataclasses.is_dataclass(out):
            outputs = [dataclasses.asdict(out) for out in outputs]
            out = outputs[0]
        if isinstance(out, torch.Tensor):
            return Gather.apply(target_device, dim, *outputs)
        if out is None:
            return None

        if isinstance(out, dict):
            if not all((len(out) == len(d) for d in outputs)):
                raise ValueError('All dicts must have the same number of keys')
            return type(out)(((k, gather_map([d[k] for d in outputs]))
                              for k in out))

        return type(out)(map(gather_map, zip(*outputs)))

    # Recursive function calls like this create reference cycles.
    # Setting the function to None clears the refcycle.
    try:
        res = gather_map(outputs)
    finally:
        gather_map = None
    return res

It converts the dataclass output into a dict and then it works - at least the tests do, I haven't tried OP's example.

What I added is:

import dataclasses

and

        if dataclasses.is_dataclass(out):
            outputs = [dataclasses.asdict(out) for out in outputs]
            out = outputs[0]

I filed a bug report with pytorch: https://github.com/pytorch/pytorch/issues/41327

stas00 on 13 Jul 2020

My pytorch tweak fixes the transformers tests, but when trying to use it on OP's use - it fails elsewhere:

export TASK_NAME=CoLA
export GLUE_DIR=/tmp/glue_data/
python ./examples/text-classification/run_glue.py --model_name_or_path bert-base-uncased --task_name $TASK_NAME --do_train --do_eval --data_dir $GLUE_DIR/$TASK_NAME --max_seq_length 128 --per_device_eval_batch_size=2 --per_device_train_batch_size=2 --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir /tmp/$TASK_NAME/

  ...
  File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 98, in <listcomp>
    outputs = [dataclasses.asdict(out) for out in outputs]
  File "/home/stas/anaconda3/envs/main/lib/python3.7/dataclasses.py", line 1045, in asdict
    return _asdict_inner(obj, dict_factory)
  File "/home/stas/anaconda3/envs/main/lib/python3.7/dataclasses.py", line 1052, in _asdict_inner
    value = _asdict_inner(getattr(obj, f.name), dict_factory)
  File "/home/stas/anaconda3/envs/main/lib/python3.7/dataclasses.py", line 1086, in _asdict_inner
    return copy.deepcopy(obj)
  File "/home/stas/anaconda3/envs/main/lib/python3.7/copy.py", line 161, in deepcopy
    y = copier(memo)
  File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/tensor.py", line 44, in __deepcopy__
    raise RuntimeError("Only Tensors created expl

RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment

So that conversion from dataclass from dict didn't work elsewhere. Needs more digging.

stas00 on 13 Jul 2020

@vanh17, until this is sorted out, you may choose to run on a single gpu which I tested to work.

You can accomplish that by adding to your command line:

env CUDA_VISIBLE_DEVICES=0 python ./examples/text-classification/run_glue.py  ...

change 0 to whichever GPU you want it to be run on.

stas00 on 13 Jul 2020

I think this is related to https://github.com/huggingface/transformers/pull/5685

When used in a nn.DataParallel setup a model should be instantiated with return_tuple=True.

It would be nice to check if there is a way for a model to know that it's being part of a nn.DataParallel so it can setup this argument automatically. If someone wants to give it a look....

cc @sgugger

thomwolf on 13 Jul 2020

I can look at this when I'm back next week. In the meantime, merging #5685 will fix the issue.

sgugger on 13 Jul 2020

merging #5685 will fix the issue.

I verified that the run_glue.py on dual gpu work after this merge.

Is there a CirleCI config that supports dual gpu tests?

edit: multigpu tests still fail as before. I forgot to back out the pytorch hack.

stas00 on 13 Jul 2020

So, if with n_gpu > 1, it works w/o returning outputs wrapped in a model's output dataclass, why do we need to ever return a dataclass and not always a tuple regardless of n_gpu's value? same goes for the suggestion by @thomwolf - only with nn.DataParallel. https://github.com/huggingface/transformers/pull/5685 just moved the problem elsewhere, since it's not possible to rely on a model to return an output dataclass and the behavior is different depending on the hardware setup.

stas00 on 13 Jul 2020

Always returning tuples require user to know which output is at which position (and it changes depending on the parameters you pass to the model) so having something self-documenting was a feature users asked for a long time.

sgugger on 13 Jul 2020

I totally understand that and this is great. But if a user codes for that API relying on outputs being a dataclass, and their code is then run in multi-gpu env it will break. Are we on the same page now?

I can see 2 solutions that lead to a consistent API:

getting pytorch to support not only dict outputs but also dataclass in gather https://github.com/pytorch/pytorch/issues/41327
re-encapsulate the tuple into the original output dataclass when it returns from pytorch to transformers and before it is passed back to the user. There will be an additional small overhead. But we don't really have a proxy to insert such manipulation, so probably this is not feasible at the moment.

stas00 on 14 Jul 2020

I updated my earlier comment - multigpu tests still fail after @sgugger's commit as before - so only part of the problem has been worked around. I forgot to back out the proposed pytorch hack so it looked like it worked, but it is not.

stas00 on 14 Jul 2020

wrt the change https://github.com/huggingface/transformers/pull/5685, won't this be fitting:

         # Our model outputs do not work with DataParallel, so forcing return tuple.
-        if self.args.n_gpu > 1:
+        if isinstance(model, nn.DataParallel):
             inputs["return_tuple"] = True

as @thomwolf suggested. But perhaps practically they are covering the same cases.

I'm digging for where else this is needed to make the tests work.

stas00 on 14 Jul 2020

OK, to make the common tests work, this is needed:

diff --git a/tests/test_modeling_common.py b/tests/test_modeling_common.py
index 0021f23c..683b7913 100644
--- a/tests/test_modeling_common.py
+++ b/tests/test_modeling_common.py
@@ -803,6 +803,7 @@ class ModelTesterMixin:

             # Wrap model in nn.DataParallel
             model = torch.nn.DataParallel(model)
+            inputs_dict["return_tuple"] = True
             with torch.no_grad():
                 _ = model(**self._prepare_for_class(inputs_dict, model_class))

yikes.

PR for both: https://github.com/huggingface/transformers/pull/5733
Let me know if you prefer a separate PR for each.

stas00 on 14 Jul 2020

Also why does the return_tuple param defaults to None and not False in most models, whereas in some it's False. It probably should be False everywhere, no?

Same applies to output_hidden_states and output_attentions forward params - sometimes they default to None and other times False. Probably should be False everywhere.

stas00 on 14 Jul 2020

I think we can find a work-around on this in the meantime by allowing our output data classes to accepts list/tuple as inputs to the first argument and spread these over the other arguments in __post_init__. I'll try to make a PR on this.

thomwolf on 14 Jul 2020

I think we can find a work-around on this in the meantime by allowing our output data classes to accepts list/tuple as inputs to the first argument and spread these over the other arguments in __post_init__. I'll try to make a PR on this.

To me, it is now working with this workaround (fine-tuning LMs). But, shall I get concerned about the reliability of the results?

gabrer on 15 Jul 2020

shall I get concerned about the reliability of the results?

If you're referring to https://github.com/huggingface/transformers/pull/5685 commit, there is no reason to be concerned. There was no "functional" change per se, this is really sorting out the API - trying to make it consistent.

stas00 on 15 Jul 2020

👍1

I also ran into a similar problem when running the script from examples/question-answering using two GPUs from the master branch:

python run_squad.py \
  --model_type bert \
  --model_name_or_path bert-base-uncased \
  --do_train \
  --do_eval \
  --do_lower_case \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --per_gpu_train_batch_size 12 \
  --per_gpu_eval_batch_size=16 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 320 \
  --doc_stride 128 \
  --output_dir $SQUAD_DIR/bert-base-uncased-squad_v1

The error looks like below:

  File "run_squad.py", line 821, in <module>
    main()
  File "run_squad.py", line 764, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "run_squad.py", line 202, in train
    outputs = model(**inputs)
  File "/home/qqcao/work/transformers/.env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/qqcao/work/transformers/.env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
    return self.gather(outputs, self.output_device)
  File "/home/qqcao/work/transformers/.env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/qqcao/work/transformers/.env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/home/qqcao/work/transformers/.env/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
TypeError: __init__() missing 2 required positional arguments: 'start_logits' and 'end_logits'

I have to roll back to version 3.0.0. Do you have any ETA when this will get fixed? Thanks.

csarron on 16 Jul 2020

@csarron, this should fix it.

--- a/examples/question-answering/run_squad.py
+++ b/examples/question-answering/run_squad.py
@@ -199,6 +199,9 @@ def train(args, train_dataset, model, tokenizer):
                         {"langs": (torch.ones(batch[0].shape, dtype=torch.int64) * args.lang_id).to(args.device)}
                     )

+            if isinstance(model, torch.nn.DataParallel):
+                inputs["return_tuple"] = True
+
             outputs = model(**inputs)
             # model outputs are always tuple in transformers (see doc)
             loss = outputs[0]

It appears that this will now need to be added everywhere before model is invoked, and users will need to do that too should they code their own and intend to use DataParallel.

p.s. Note that the project's scripts/modules don't consistently import torch.nn as nn, so sometimes it's torch.nn.DataParallel, whereas other times nn.DataParallel.

stas00 on 16 Jul 2020

👍3

Got same problem here.

yl-to on 17 Jul 2020

@sgugger came up with a transparent solution for this issue: https://github.com/huggingface/transformers/pull/5941

stas00 on 21 Jul 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.