Transformers: pytorch lightning examples doesn't work in multi gpu's with backend=dp

Created on 21 Apr 2020 · 28Comments · Source: huggingface/transformers

🐛 Bug

Information

Model I am using (Bert, XLNet ...): Bert

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

[x] the official example scripts: run_pl.sh (run_pl_glue.py)

The tasks I am working on is:

[x] an official GLUE/SQUaD task: Glue

To reproduce

Steps to reproduce the behavior:

run_pl.sh script with multi-gpu's (ex:8 gpu's)

Expected behavior

Glue training should happen

Environment info

transformers version: 2.8.0
Platform: Linux
Python version: 3.7
PyTorch version (GPU?): 1.4
Tensorflow version (GPU?):
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: DataParallel

wontfix

Source

leslyarun

Most helpful comment

I found one more issue. If I use fast tokenizers with ddp as backend, I get the below error:

@leslyarun I am also facing a similar issue with ddp backend (not exactly the same): github issue
My guess is that maybe there is an issue with the callback and the saving objects with pickle. At this moment I will try to manually save checkpoint without using the callbacks.

mmiakashs on 24 Apr 2020

👍3

All 28 comments

I get the below error:

Validation sanity check:   0%|                                                                                                                | 0/5 [00:00<?, ?it/s]Traceback (most recent call last):
  File "run_pl_glue.py", line 186, in <module>
    trainer = generic_train(model, args)
  File "/home/jupyter/transformers/examples/transformer_base.py", line 307, in generic_train
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 701, in fit
    self.dp_train(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 540, in dp_train
    self.run_pretrain_routine(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 843, in run_pretrain_routine
    False)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 262, in _evaluate
    output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 430, in evaluation_forward
    output = model(*args)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 66, in forward
    return self.gather(outputs, self.output_device)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
    for k in out))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in <genexpr>
    for k in out))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 54, in forward
    assert all(map(lambda i: i.is_cuda, inputs))
AssertionError

@nateraw @williamFalcon

leslyarun on 21 Apr 2020

👍1

update to the latest lightning version?
0.7.4rc1

williamFalcon on 21 Apr 2020

@williamFalcon doesn't work with lightning version 0.7.4rc1, 0.7.4rc2 and even 0.7.3, 0.7.1

leslyarun on 22 Apr 2020

ok, can you share a colab here? happy to take a look

williamFalcon on 22 Apr 2020

@williamFalcon Thanks. I'm running the code as per the given instructions in https://github.com/huggingface/transformers/tree/master/examples/glue
I didn't make any changes, I just ran the same official example script in multi gpu's - https://github.com/huggingface/transformers/blob/master/examples/glue/run_pl.sh
It works in CPU and single GPU, but doesn't work in multi-gpu's

leslyarun on 22 Apr 2020

👍1

It is a bit unclear what is going on in there: the bash script installs lightning but the python code doesn't seem to use it?

Evpok on 22 Apr 2020

I am also facing the error but on a different custom learning model. My code is working properly on a single GPU, however, if I increase the number of GPUs to 2, it gives me the above error. I checked both PL 0.7.3 and 0.7.4rc3

Update: Interestingly when I changed distributed_backend to ddp then it worked perfectly without any error I think there is an issue with the dp distributed_backend

mmiakashs on 23 Apr 2020

👍1

run_pl.sh runs fine.

I ran without ANY changes to the file. Did you guys change anything in the file?

williamFalcon on 23 Apr 2020

@williamFalcon Didn't change anything, hope you ran it in multi-gpu's. The code seems to run fine in ddp, but not in dp, as mentioned by @mmiakashs .

When I debugged, I found that when using dp (DataParallel) with 8 gpu's, it generates 8 different losses and since the training_step can't gather 8 losses, it showed error like this:
TypeError: zip argument #1 must support iteration

leslyarun on 23 Apr 2020

Ummm, yeah not sure... It looks ok to me.

Try running dp on 2 GPUs? This test is on 2 GPUs

williamFalcon on 23 Apr 2020

It looks like hf sets ddp as the backend which is great because dp has a bunch of issues (this is a PyTorch problem, not lightning). Both PyTorch and lightning discourage dp use.

Just ran this with the default ddp and it works well (although the run_pl.sh script has a bunch of usability issues, ie: i need the data in a different part of the cluster but that script doesn't do that, so I had to run from that directory in the cluster. Ideally --data_dir solves this issue but it doesn't).

williamFalcon on 23 Apr 2020

I can confirm that the issue occurs only when using multi-gpu's with dp as backend. Using ddp solves the issues.

I found one more issue. If I use fast tokenizers with ddp as backend, I get the below error:

INFO:lightning:GPU available: True, used: True
INFO:lightning:CUDA_VISIBLE_DEVICES: [0,1]
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/warnings.py:18: RuntimeWarning: You have defined a `val_dataloader()` and have defined a `validation_step()`, you may also want to define `validation_epoch_end()` for accumulating stats.
  warnings.warn(*args, **kwargs)
Traceback (most recent call last):
  File "run_pl_glue.py", line 187, in <module>
    trainer = generic_train(model, args)
  File "/home/jupyter/transformers/examples/transformer_base.py", line 310, in generic_train
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 734, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle Tokenizer objects

leslyarun on 24 Apr 2020

👍1

I found one more issue. If I use fast tokenizers with ddp as backend, I get the below error:

@leslyarun I am also facing a similar issue with ddp backend (not exactly the same): github issue
My guess is that maybe there is an issue with the callback and the saving objects with pickle. At this moment I will try to manually save checkpoint without using the callbacks.

mmiakashs on 24 Apr 2020

👍3

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 23 Jun 2020

@mmiakashs did that end up working?

sshleifer on 23 Jun 2020

@mmiakashs did that end up working?

currently, I am using ddp_spwan mode and it is working fine.

mmiakashs on 25 Jun 2020

@sshleifer can confirm A) the Lightning examples don't work at all with dp B) does run, but needs significant editing with ddp

For examples I've looked at it's not as simple as turning ddp on and all great. It seems whomever wrote the Lightning examples never tried multi-GPU. Happy to elaborate or share (though mine are not in great shape at the moment).

And ddp_spawn definitely does not work for me. Gives several spawn-based errors -- says my model is not compliant.

moscow25 on 27 Jun 2020

A) don't know but that sounds very likely. @williamFalcon told me "Dont use dp".

B) examples/seq2seq/finetune.py works in multigpu with two caveats:
(a) versions need to be transformers=master, pl=0.8.1.
(b) you cannot pass --do_predict. (pl.Trainer.test is broken for multi-gpu)

For the other two pl examples: ner, and glue, I haven't tested multi-gpu, but they should be at least close to working because they inherit from the same BaseTransformer. Which one of those were you trying to run/ are you interesting in running?

sshleifer on 27 Jun 2020

Thanks @sshleifer. We're fine using ddp for everything -- only need one version to work, not multiple ways to do the same thing. Also according to the docs, ddp is the only one that works with FP16 anyway (have not tested yet, will do soon).
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html

I'm working off of transformers from GitHub... so should be a recent version. If that's not what you are saying couple you please be more specific?

We also don't necessarily "need" Lightning... but would be great if it worked (in single set of settings) for multi-GPU. As it is great having reasonable out of the box options for LR schedule, model synchronization, gradient accumulation, and all those other things I've grown tired of implementing for every project.

moscow25 on 29 Jun 2020

👀1

@moscow25 dp is NOT recommended by PyTorch

https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html

The current base transformers has a few issues which I've submitted a PR for.
Please let me know what example you are using / what code i can look at to reproduce the issues.

williamFalcon on 29 Jun 2020

@sshleifer can confirm A) the Lightning examples don't work at all with dp B) does run, but needs significant editing with ddp

For examples I've looked at it's not as simple as turning ddp on and all great. It seems whomever wrote the Lightning examples never tried multi-GPU. Happy to elaborate or share (though mine are not in great shape at the moment).

And ddp_spawn definitely does not work for me. Gives several spawn-based errors -- says my model is not compliant.

ddp doesn't work for me and ddp_spawn gives a lot of errors. On using ddp, no error is shown but it doesn't start anything on the GPU - just the notebook cell being busy indefinitely. I am using the DistilBertTokenizer and DistilBertModel - has anyone been able to run pytorch lightning on multipe gpus with Distilbert?

kswamy15 on 10 Aug 2020

I suspect that your issue is ddp+jupyter rather than distillbert. Try running your command from the terminal.

sshleifer on 10 Aug 2020

I suspect that your issue is ddp+jupyter rather than distillbert. Try running your command from the terminal.

Why does running the code in Jupyter notebook create a problem? I was able to run the BertModels like SequenceClassification in the Jupyter notebook on multiple gpus without any problem - but running into this multiple gpu problem using pytorch lightning. It is nice to be able to use Pytorch lightning given all the built in options. It makes it easier to build the models interactively on the Jupyter notebook

kswamy15 on 10 Aug 2020

I suspect that your issue is ddp+jupyter rather than distillbert. Try running your command from the terminal.

Why does running the code in Jupyter notebook create a problem? I was able to run the BertModels like SequenceClassification in the Jupyter notebook on multiple gpus without any problem - but running into this multiple gpu problem using pytorch lightning. It is nice to be able to use Pytorch lightning given all the built in options. It makes it easier to build the models interactively on the Jupyter notebook

Looks like usage of ddp doesn't work in Jupyter notebook. and transformers don't work with dp parameter of pytorch lightning in Jupyter notebook. So looks like the only option to use pytorch lightning, multiple gpus and transformer is to run it as a python script.
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html
Jupyter Notebooks
Unfortunately any ddp_ is not supported in jupyter notebooks. Please use dp for multiple GPUs. This is a known Jupyter issue. If you feel like taking a stab at adding this support, feel free to submit a PR!

kswamy15 on 11 Aug 2020

i believe @nateraw is almost done updating the examples with the latest version of PL.

can you share the model that does work with multiple gpus in a jupyter notebook?

williamFalcon on 11 Aug 2020

I read somewhere on the pytorch lightning documents about being careful to checkpoint models when running on DDP mode - can't find that documentation now but is there something I need to be careful about checkpointing while running DDP on a single machine with 8 GPUs? It was something about the model getting split among multiple machines - not sure if that is valid if DDP used on a single machine.

kswamy15 on 11 Aug 2020

nothing you have to worry about... we save the checkpoint correctly automatically

williamFalcon on 11 Aug 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 10 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings