Transformers: pytorch lightning examples doesn't work in multi gpu's with backend=dp

Created on 21 Apr 2020  路  28Comments  路  Source: huggingface/transformers

馃悰 Bug

Information

Model I am using (Bert, XLNet ...): Bert

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

  • [x] the official example scripts: run_pl.sh (run_pl_glue.py)

The tasks I am working on is:

  • [x] an official GLUE/SQUaD task: Glue

To reproduce

Steps to reproduce the behavior:

  1. run_pl.sh script with multi-gpu's (ex:8 gpu's)

Expected behavior

Glue training should happen

Environment info

  • transformers version: 2.8.0
  • Platform: Linux
  • Python version: 3.7
  • PyTorch version (GPU?): 1.4
  • Tensorflow version (GPU?):
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: DataParallel
wontfix

Most helpful comment

>

I found one more issue. If I use fast tokenizers with ddp as backend, I get the below error:

@leslyarun I am also facing a similar issue with ddp backend (not exactly the same): github issue
My guess is that maybe there is an issue with the callback and the saving objects with pickle. At this moment I will try to manually save checkpoint without using the callbacks.

All 28 comments

I get the below error:

Validation sanity check:   0%|                                                                                                                | 0/5 [00:00<?, ?it/s]Traceback (most recent call last):
  File "run_pl_glue.py", line 186, in <module>
    trainer = generic_train(model, args)
  File "/home/jupyter/transformers/examples/transformer_base.py", line 307, in generic_train
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 701, in fit
    self.dp_train(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 540, in dp_train
    self.run_pretrain_routine(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 843, in run_pretrain_routine
    False)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 262, in _evaluate
    output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 430, in evaluation_forward
    output = model(*args)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 66, in forward
    return self.gather(outputs, self.output_device)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
    for k in out))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in <genexpr>
    for k in out))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 54, in forward
    assert all(map(lambda i: i.is_cuda, inputs))
AssertionError

@nateraw @williamFalcon

update to the latest lightning version?
0.7.4rc1

@williamFalcon doesn't work with lightning version 0.7.4rc1, 0.7.4rc2 and even 0.7.3, 0.7.1

ok, can you share a colab here? happy to take a look

@williamFalcon Thanks. I'm running the code as per the given instructions in https://github.com/huggingface/transformers/tree/master/examples/glue
I didn't make any changes, I just ran the same official example script in multi gpu's - https://github.com/huggingface/transformers/blob/master/examples/glue/run_pl.sh
It works in CPU and single GPU, but doesn't work in multi-gpu's

It is a bit unclear what is going on in there: the bash script installs lightning but the python code doesn't seem to use it?

I am also facing the error but on a different custom learning model. My code is working properly on a single GPU, however, if I increase the number of GPUs to 2, it gives me the above error. I checked both PL 0.7.3 and 0.7.4rc3

Update: Interestingly when I changed distributed_backend to ddp then it worked perfectly without any error I think there is an issue with the dp distributed_backend

image

run_pl.sh runs fine.

I ran without ANY changes to the file. Did you guys change anything in the file?

@williamFalcon Didn't change anything, hope you ran it in multi-gpu's. The code seems to run fine in ddp, but not in dp, as mentioned by @mmiakashs .

When I debugged, I found that when using dp (DataParallel) with 8 gpu's, it generates 8 different losses and since the training_step can't gather 8 losses, it showed error like this:
TypeError: zip argument #1 must support iteration

Ummm, yeah not sure... It looks ok to me.

image

image

Try running dp on 2 GPUs? This test is on 2 GPUs

It looks like hf sets ddp as the backend which is great because dp has a bunch of issues (this is a PyTorch problem, not lightning). Both PyTorch and lightning discourage dp use.

Just ran this with the default ddp and it works well (although the run_pl.sh script has a bunch of usability issues, ie: i need the data in a different part of the cluster but that script doesn't do that, so I had to run from that directory in the cluster. Ideally --data_dir solves this issue but it doesn't).

I can confirm that the issue occurs only when using multi-gpu's with dp as backend. Using ddp solves the issues.

I found one more issue. If I use fast tokenizers with ddp as backend, I get the below error:

INFO:lightning:GPU available: True, used: True
INFO:lightning:CUDA_VISIBLE_DEVICES: [0,1]
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/warnings.py:18: RuntimeWarning: You have defined a `val_dataloader()` and have defined a `validation_step()`, you may also want to define `validation_epoch_end()` for accumulating stats.
  warnings.warn(*args, **kwargs)
Traceback (most recent call last):
  File "run_pl_glue.py", line 187, in <module>
    trainer = generic_train(model, args)
  File "/home/jupyter/transformers/examples/transformer_base.py", line 310, in generic_train
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 734, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle Tokenizer objects

>

I found one more issue. If I use fast tokenizers with ddp as backend, I get the below error:

@leslyarun I am also facing a similar issue with ddp backend (not exactly the same): github issue
My guess is that maybe there is an issue with the callback and the saving objects with pickle. At this moment I will try to manually save checkpoint without using the callbacks.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@mmiakashs did that end up working?

@mmiakashs did that end up working?

currently, I am using ddp_spwan mode and it is working fine.

@sshleifer can confirm A) the Lightning examples don't work at all with dp B) does run, but needs significant editing with ddp

For examples I've looked at it's not as simple as turning ddp on and all great. It seems whomever wrote the Lightning examples never tried multi-GPU. Happy to elaborate or share (though mine are not in great shape at the moment).

And ddp_spawn definitely does not work for me. Gives several spawn-based errors -- says my model is not compliant.

A) don't know but that sounds very likely. @williamFalcon told me "Dont use dp".

B) examples/seq2seq/finetune.py works in multigpu with two caveats:
(a) versions need to be transformers=master, pl=0.8.1.
(b) you cannot pass --do_predict. (pl.Trainer.test is broken for multi-gpu)

For the other two pl examples: ner, and glue, I haven't tested multi-gpu, but they should be at least close to working because they inherit from the same BaseTransformer. Which one of those were you trying to run/ are you interesting in running?

Thanks @sshleifer. We're fine using ddp for everything -- only need one version to work, not multiple ways to do the same thing. Also according to the docs, ddp is the only one that works with FP16 anyway (have not tested yet, will do soon).
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html

I'm working off of transformers from GitHub... so should be a recent version. If that's not what you are saying couple you please be more specific?

We also don't necessarily "need" Lightning... but would be great if it worked (in single set of settings) for multi-GPU. As it is great having reasonable out of the box options for LR schedule, model synchronization, gradient accumulation, and all those other things I've grown tired of implementing for every project.

@moscow25 dp is NOT recommended by PyTorch

https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html
image

  1. The current base transformers has a few issues which I've submitted a PR for.
  2. Please let me know what example you are using / what code i can look at to reproduce the issues.

@sshleifer can confirm A) the Lightning examples don't work at all with dp B) does run, but needs significant editing with ddp

For examples I've looked at it's not as simple as turning ddp on and all great. It seems whomever wrote the Lightning examples never tried multi-GPU. Happy to elaborate or share (though mine are not in great shape at the moment).

And ddp_spawn definitely does not work for me. Gives several spawn-based errors -- says my model is not compliant.

ddp doesn't work for me and ddp_spawn gives a lot of errors. On using ddp, no error is shown but it doesn't start anything on the GPU - just the notebook cell being busy indefinitely. I am using the DistilBertTokenizer and DistilBertModel - has anyone been able to run pytorch lightning on multipe gpus with Distilbert?

I suspect that your issue is ddp+jupyter rather than distillbert. Try running your command from the terminal.

I suspect that your issue is ddp+jupyter rather than distillbert. Try running your command from the terminal.

Why does running the code in Jupyter notebook create a problem? I was able to run the BertModels like SequenceClassification in the Jupyter notebook on multiple gpus without any problem - but running into this multiple gpu problem using pytorch lightning. It is nice to be able to use Pytorch lightning given all the built in options. It makes it easier to build the models interactively on the Jupyter notebook

I suspect that your issue is ddp+jupyter rather than distillbert. Try running your command from the terminal.

Why does running the code in Jupyter notebook create a problem? I was able to run the BertModels like SequenceClassification in the Jupyter notebook on multiple gpus without any problem - but running into this multiple gpu problem using pytorch lightning. It is nice to be able to use Pytorch lightning given all the built in options. It makes it easier to build the models interactively on the Jupyter notebook

Looks like usage of ddp doesn't work in Jupyter notebook. and transformers don't work with dp parameter of pytorch lightning in Jupyter notebook. So looks like the only option to use pytorch lightning, multiple gpus and transformer is to run it as a python script.
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html
Jupyter Notebooks
Unfortunately any ddp_ is not supported in jupyter notebooks. Please use dp for multiple GPUs. This is a known Jupyter issue. If you feel like taking a stab at adding this support, feel free to submit a PR!

i believe @nateraw is almost done updating the examples with the latest version of PL.

can you share the model that does work with multiple gpus in a jupyter notebook?

I read somewhere on the pytorch lightning documents about being careful to checkpoint models when running on DDP mode - can't find that documentation now but is there something I need to be careful about checkpointing while running DDP on a single machine with 8 GPUs? It was something about the model getting split among multiple machines - not sure if that is valid if DDP used on a single machine.

nothing you have to worry about... we save the checkpoint correctly automatically

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

0x01h picture 0x01h  路  3Comments

hsajjad picture hsajjad  路  3Comments

fyubang picture fyubang  路  3Comments

alphanlp picture alphanlp  路  3Comments

adigoryl picture adigoryl  路  3Comments