Model I am using (Bert, XLNet ...): Bert
Language I am using the model on (English, Chinese ...): English
The problem arises when using:
The tasks I am working on is:
Steps to reproduce the behavior:
Glue training should happen
transformers version: 2.8.0I get the below error:
Validation sanity check: 0%| | 0/5 [00:00<?, ?it/s]Traceback (most recent call last):
File "run_pl_glue.py", line 186, in <module>
trainer = generic_train(model, args)
File "/home/jupyter/transformers/examples/transformer_base.py", line 307, in generic_train
trainer.fit(model)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 701, in fit
self.dp_train(model)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 540, in dp_train
self.run_pretrain_routine(model)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 843, in run_pretrain_routine
False)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 262, in _evaluate
output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 430, in evaluation_forward
output = model(*args)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 66, in forward
return self.gather(outputs, self.output_device)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather
return gather(outputs, output_device, dim=self.dim)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
for k in out))
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in <genexpr>
for k in out))
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 54, in forward
assert all(map(lambda i: i.is_cuda, inputs))
AssertionError
@nateraw @williamFalcon
update to the latest lightning version?
0.7.4rc1
@williamFalcon doesn't work with lightning version 0.7.4rc1, 0.7.4rc2 and even 0.7.3, 0.7.1
ok, can you share a colab here? happy to take a look
@williamFalcon Thanks. I'm running the code as per the given instructions in https://github.com/huggingface/transformers/tree/master/examples/glue
I didn't make any changes, I just ran the same official example script in multi gpu's - https://github.com/huggingface/transformers/blob/master/examples/glue/run_pl.sh
It works in CPU and single GPU, but doesn't work in multi-gpu's
It is a bit unclear what is going on in there: the bash script installs lightning but the python code doesn't seem to use it?
I am also facing the error but on a different custom learning model. My code is working properly on a single GPU, however, if I increase the number of GPUs to 2, it gives me the above error. I checked both PL 0.7.3 and 0.7.4rc3
Update: Interestingly when I changed distributed_backend to ddp then it worked perfectly without any error I think there is an issue with the dp distributed_backend

run_pl.sh runs fine.
I ran without ANY changes to the file. Did you guys change anything in the file?
@williamFalcon Didn't change anything, hope you ran it in multi-gpu's. The code seems to run fine in ddp, but not in dp, as mentioned by @mmiakashs .
When I debugged, I found that when using dp (DataParallel) with 8 gpu's, it generates 8 different losses and since the training_step can't gather 8 losses, it showed error like this:
TypeError: zip argument #1 must support iteration
Ummm, yeah not sure... It looks ok to me.


Try running dp on 2 GPUs? This test is on 2 GPUs
It looks like hf sets ddp as the backend which is great because dp has a bunch of issues (this is a PyTorch problem, not lightning). Both PyTorch and lightning discourage dp use.
Just ran this with the default ddp and it works well (although the run_pl.sh script has a bunch of usability issues, ie: i need the data in a different part of the cluster but that script doesn't do that, so I had to run from that directory in the cluster. Ideally --data_dir solves this issue but it doesn't).
I can confirm that the issue occurs only when using multi-gpu's with dp as backend. Using ddp solves the issues.
I found one more issue. If I use fast tokenizers with ddp as backend, I get the below error:
INFO:lightning:GPU available: True, used: True
INFO:lightning:CUDA_VISIBLE_DEVICES: [0,1]
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/warnings.py:18: RuntimeWarning: You have defined a `val_dataloader()` and have defined a `validation_step()`, you may also want to define `validation_epoch_end()` for accumulating stats.
warnings.warn(*args, **kwargs)
Traceback (most recent call last):
File "run_pl_glue.py", line 187, in <module>
trainer = generic_train(model, args)
File "/home/jupyter/transformers/examples/transformer_base.py", line 310, in generic_train
trainer.fit(model)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 734, in fit
mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
process.start()
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/opt/conda/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/opt/conda/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/opt/conda/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle Tokenizer objects
>
I found one more issue. If I use fast tokenizers with ddp as backend, I get the below error:
@leslyarun I am also facing a similar issue with ddp backend (not exactly the same): github issue
My guess is that maybe there is an issue with the callback and the saving objects with pickle. At this moment I will try to manually save checkpoint without using the callbacks.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@mmiakashs did that end up working?
@mmiakashs did that end up working?
currently, I am using ddp_spwan mode and it is working fine.
@sshleifer can confirm A) the Lightning examples don't work at all with dp B) does run, but needs significant editing with ddp
For examples I've looked at it's not as simple as turning ddp on and all great. It seems whomever wrote the Lightning examples never tried multi-GPU. Happy to elaborate or share (though mine are not in great shape at the moment).
And ddp_spawn definitely does not work for me. Gives several spawn-based errors -- says my model is not compliant.
A) don't know but that sounds very likely. @williamFalcon told me "Dont use dp".
B) examples/seq2seq/finetune.py works in multigpu with two caveats:
(a) versions need to be transformers=master, pl=0.8.1.
(b) you cannot pass --do_predict. (pl.Trainer.test is broken for multi-gpu)
For the other two pl examples: ner, and glue, I haven't tested multi-gpu, but they should be at least close to working because they inherit from the same BaseTransformer. Which one of those were you trying to run/ are you interesting in running?
Thanks @sshleifer. We're fine using ddp for everything -- only need one version to work, not multiple ways to do the same thing. Also according to the docs, ddp is the only one that works with FP16 anyway (have not tested yet, will do soon).
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html
I'm working off of transformers from GitHub... so should be a recent version. If that's not what you are saying couple you please be more specific?
We also don't necessarily "need" Lightning... but would be great if it worked (in single set of settings) for multi-GPU. As it is great having reasonable out of the box options for LR schedule, model synchronization, gradient accumulation, and all those other things I've grown tired of implementing for every project.
@moscow25 dp is NOT recommended by PyTorch
https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html

@sshleifer can confirm A) the Lightning examples don't work at all with
dpB) does run, but needs significant editing withddpFor examples I've looked at it's not as simple as turning
ddpon and all great. It seems whomever wrote the Lightning examples never tried multi-GPU. Happy to elaborate or share (though mine are not in great shape at the moment).And
ddp_spawndefinitely does not work for me. Gives several spawn-based errors -- says my model is not compliant.
ddp doesn't work for me and ddp_spawn gives a lot of errors. On using ddp, no error is shown but it doesn't start anything on the GPU - just the notebook cell being busy indefinitely. I am using the DistilBertTokenizer and DistilBertModel - has anyone been able to run pytorch lightning on multipe gpus with Distilbert?
I suspect that your issue is ddp+jupyter rather than distillbert. Try running your command from the terminal.
I suspect that your issue is ddp+jupyter rather than distillbert. Try running your command from the terminal.
Why does running the code in Jupyter notebook create a problem? I was able to run the BertModels like SequenceClassification in the Jupyter notebook on multiple gpus without any problem - but running into this multiple gpu problem using pytorch lightning. It is nice to be able to use Pytorch lightning given all the built in options. It makes it easier to build the models interactively on the Jupyter notebook
I suspect that your issue is ddp+jupyter rather than distillbert. Try running your command from the terminal.
Why does running the code in Jupyter notebook create a problem? I was able to run the BertModels like SequenceClassification in the Jupyter notebook on multiple gpus without any problem - but running into this multiple gpu problem using pytorch lightning. It is nice to be able to use Pytorch lightning given all the built in options. It makes it easier to build the models interactively on the Jupyter notebook
Looks like usage of ddp doesn't work in Jupyter notebook. and transformers don't work with dp parameter of pytorch lightning in Jupyter notebook. So looks like the only option to use pytorch lightning, multiple gpus and transformer is to run it as a python script.
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html
Jupyter Notebooks
Unfortunately any ddp_ is not supported in jupyter notebooks. Please use dp for multiple GPUs. This is a known Jupyter issue. If you feel like taking a stab at adding this support, feel free to submit a PR!
i believe @nateraw is almost done updating the examples with the latest version of PL.
can you share the model that does work with multiple gpus in a jupyter notebook?
I read somewhere on the pytorch lightning documents about being careful to checkpoint models when running on DDP mode - can't find that documentation now but is there something I need to be careful about checkpointing while running DDP on a single machine with 8 GPUs? It was something about the model getting split among multiple machines - not sure if that is valid if DDP used on a single machine.
nothing you have to worry about... we save the checkpoint correctly automatically
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
>