Pytorch-lightning: Infinite hang when running `Trainer.test` after `Trainer.fit` with DDP

Created on 22 Sep 2020  路  8Comments  路  Source: PyTorchLightning/pytorch-lightning

馃悰 Bug

If I run Trainer.test after running Trainer.fit with distributed_backend='ddp' then the system hangs.

To Reproduce

Steps to reproduce the behavior:

Run the following script

# main.py
import os
from argparse import ArgumentParser
from pl_examples.models.lightning_template import LightningTemplateModel
from pytorch_lightning import Trainer, seed_everything

seed_everything(234)


def main(args):
    model = LightningTemplateModel(**vars(args))
    trainer = Trainer.from_argparse_args(args)
    trainer.fit(model)     # if this is commented out then test will complete, otherwise it hangs
    trainer.test(model)


def run_cli():
    root_dir = os.path.dirname(os.path.realpath(__file__))
    parent_parser = ArgumentParser(add_help=False)
    parser = LightningTemplateModel.add_model_specific_args(parent_parser, root_dir)
    parser = Trainer.add_argparse_args(parser)
    parser.set_defaults(gpus=2)
    args = parser.parse_args()

    main(args)


if __name__ == '__main__':
    run_cli()

with command line arguments (assuming >= 2 GPUs)

python main.py --gpus 2 --hidden_dim 500 --max_epochs 1 --distributed_backend ddp

Running this script causes the program to hang during test phase.

Expected behavior

I would expect Trainer.test to complete rather than hanging.

Environment

Output of collect_env_details.py:

* CUDA:
        - GPU:
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.19.1
        - pyTorch_debug:     False
        - pyTorch_version:   1.6.0
        - pytorch-lightning: 0.9.1rc3
        - tqdm:              4.49.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.7.5
        - version:           #51~18.04.1-Ubuntu SMP Sat Sep 5 14:35:50 UTC 2020
  • PyTorch Version: 1.6.0
  • OS: Ubuntu 20.04
  • How you installed PyTorch: pip
  • Build command you used (if compiling from source):
  • Python version: 3.7.5
  • CUDA/cuDNN version: 7.6.5
  • GPU models and configuration: GeForce RTX 2080 Ti (x2)

List of all installed packages (output of pip freeze):

absl-py==0.10.0
cachetools==4.1.1
certifi==2020.6.20
chardet==3.0.4
decorator==4.4.2
fsspec==0.8.2
future==0.18.2
google-auth==1.21.2
google-auth-oauthlib==0.4.1
grpcio==1.32.0
idna==2.10
importlib-metadata==1.7.0
Markdown==3.2.2
networkx==2.5
numpy==1.19.1
oauthlib==3.1.0
packaging==20.4
Pillow==7.2.0
pkg-resources==0.0.0
protobuf==3.13.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyparsing==2.4.7
pytorch-lightning==0.9.1rc3
PyYAML==5.3.1
requests==2.24.0
requests-oauthlib==1.3.0
rsa==4.6
six==1.15.0
tensorboard==2.2.0
tensorboard-plugin-wit==1.7.0
torch==1.6.0
torchvision==0.7.0
tqdm==4.49.0
urllib3==1.25.10
Werkzeug==1.0.1
zipp==3.1.0

Additional context

If I comment out trainer.fit then everything works as expected.

I was able to pause the execution during hang while running in PyCharm. The following are the stack frames for the main thread, which is the only thread I could get to pause.

select, selectors.py:418
wait, connection.py:920
_poll, connection.py:414
poll, connection.py:257
get, queues.py:104
_worker_loop, worker.py:167
run, process.py:99
_bootstrap, process.py:297
_launch, popen_fork.py:74
__init__, popen_fork.py:20
_Popen, context.py:277
_Popen, context.py:223
start, process.py:112
__init__, dataloader.py:737
__iter__, dataloader.py:291
run_evaluation, trainer.py:437
run_test, trainer.py:489
train_or_test, base_backend.py:34
ddp_train, ddp_backend.py:243
train, ddp_backend.py:138
fit, trainer.py:324
wrapped_fn, states.py:48
__test_given_model, trainer.py:627
test, trainer.py:564
wrapped_fn, states.py:48
main, main.py:13
run_cli, main.py:24
<module>, main.py:28
Working as intended bug / fix duplicate help wanted

All 8 comments

Hi! thanks for your contribution!, great first issue!

Hi, in this ddp mode you can call trainer.fit / test only once.
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-data-parallel

There are cases in which it is not possible to use DDP. Examples are:

  • Jupyter Notebook, Google COLAB, Kaggle, etc.
  • You have a nested script without a root package
  • Your script needs to invoke .fit or .test multiple times

you need to switch to ddp_spawn or launch your .test in a separate script.

Interesting. I read this but to me it seemed to indicate you could call both fit and test as long as neither were called multiple times. Perhaps the documentation could be updated to make this more clear.

yeah granted, this may be a bit ambiguous.

Hi
I encountered that DDP couldn't work for me when running my python script as a module (python -m).
Does it match the second limitation case? ("You have a nested script without a root package").
I couldn't really understand this one, and I am not sure why DDP cannot work with modules.
When I am running the same script not as module, everything works fine.
I tested it also on a really simple MNIST example.

@AvivWn Because when you run the ddp script, it will call itself under the hood again n-1 times where n is the number of gpus you selected. So, example, if we did this:

python train.py --gpus 4 --something-else --distributed_backend=ddp

this process will launch on gpu 0 and then call

python train.py --gpus "1,"  --something-else --distributed_backend=ddp
python train.py --gpus "2,"  --something-else --distributed_backend=ddp
python train.py --gpus "3,"  --something-else --distributed_backend=ddp

(simplified example)

so if we wanted to support the module way of launching the script, we would have to strip -m ... from the command and append it to the subprocess calls.

@awaelchli Sure
And is it impossible to programmatically add the "-m" only if it specified? It should be just a single if, right?
It is frustrating to know that a module cannot be run as DDP, after I have already built a full working module project.
Transforming it into a script won't be easy...

No, I would not assume it is impossible. If the -m option can be stripped from arglist, it should not be too difficult.
Very sorry that it causes trouble for you but it looks like they just didn't think about this use case when the ddp backend was added.
Let's open a feature request for this.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

monney picture monney  路  3Comments

Vichoko picture Vichoko  路  3Comments

edenlightning picture edenlightning  路  3Comments

williamFalcon picture williamFalcon  路  3Comments

jcreinhold picture jcreinhold  路  3Comments