Pytorch-lightning: Infinite hang when running `Trainer.test` after `Trainer.fit` with DDP

Created on 22 Sep 2020 · 8Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

If I run Trainer.test after running Trainer.fit with distributed_backend='ddp' then the system hangs.

To Reproduce

Steps to reproduce the behavior:

Run the following script

# main.py
import os
from argparse import ArgumentParser
from pl_examples.models.lightning_template import LightningTemplateModel
from pytorch_lightning import Trainer, seed_everything

seed_everything(234)


def main(args):
    model = LightningTemplateModel(**vars(args))
    trainer = Trainer.from_argparse_args(args)
    trainer.fit(model)     # if this is commented out then test will complete, otherwise it hangs
    trainer.test(model)


def run_cli():
    root_dir = os.path.dirname(os.path.realpath(__file__))
    parent_parser = ArgumentParser(add_help=False)
    parser = LightningTemplateModel.add_model_specific_args(parent_parser, root_dir)
    parser = Trainer.add_argparse_args(parser)
    parser.set_defaults(gpus=2)
    args = parser.parse_args()

    main(args)


if __name__ == '__main__':
    run_cli()

with command line arguments (assuming >= 2 GPUs)

python main.py --gpus 2 --hidden_dim 500 --max_epochs 1 --distributed_backend ddp

Running this script causes the program to hang during test phase.

Expected behavior

I would expect Trainer.test to complete rather than hanging.

Environment

Output of collect_env_details.py:

* CUDA:
        - GPU:
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.19.1
        - pyTorch_debug:     False
        - pyTorch_version:   1.6.0
        - pytorch-lightning: 0.9.1rc3
        - tqdm:              4.49.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.7.5
        - version:           #51~18.04.1-Ubuntu SMP Sat Sep 5 14:35:50 UTC 2020

PyTorch Version: 1.6.0
OS: Ubuntu 20.04
How you installed PyTorch: pip
Build command you used (if compiling from source):
Python version: 3.7.5
CUDA/cuDNN version: 7.6.5
GPU models and configuration: GeForce RTX 2080 Ti (x2)

List of all installed packages (output of pip freeze):

absl-py==0.10.0
cachetools==4.1.1
certifi==2020.6.20
chardet==3.0.4
decorator==4.4.2
fsspec==0.8.2
future==0.18.2
google-auth==1.21.2
google-auth-oauthlib==0.4.1
grpcio==1.32.0
idna==2.10
importlib-metadata==1.7.0
Markdown==3.2.2
networkx==2.5
numpy==1.19.1
oauthlib==3.1.0
packaging==20.4
Pillow==7.2.0
pkg-resources==0.0.0
protobuf==3.13.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyparsing==2.4.7
pytorch-lightning==0.9.1rc3
PyYAML==5.3.1
requests==2.24.0
requests-oauthlib==1.3.0
rsa==4.6
six==1.15.0
tensorboard==2.2.0
tensorboard-plugin-wit==1.7.0
torch==1.6.0
torchvision==0.7.0
tqdm==4.49.0
urllib3==1.25.10
Werkzeug==1.0.1
zipp==3.1.0

Additional context

If I comment out trainer.fit then everything works as expected.

I was able to pause the execution during hang while running in PyCharm. The following are the stack frames for the main thread, which is the only thread I could get to pause.

select, selectors.py:418
wait, connection.py:920
_poll, connection.py:414
poll, connection.py:257
get, queues.py:104
_worker_loop, worker.py:167
run, process.py:99
_bootstrap, process.py:297
_launch, popen_fork.py:74
__init__, popen_fork.py:20
_Popen, context.py:277
_Popen, context.py:223
start, process.py:112
__init__, dataloader.py:737
__iter__, dataloader.py:291
run_evaluation, trainer.py:437
run_test, trainer.py:489
train_or_test, base_backend.py:34
ddp_train, ddp_backend.py:243
train, ddp_backend.py:138
fit, trainer.py:324
wrapped_fn, states.py:48
__test_given_model, trainer.py:627
test, trainer.py:564
wrapped_fn, states.py:48
main, main.py:13
run_cli, main.py:24
<module>, main.py:28

Working as intended bug / fix duplicate help wanted

Source

josh-gleason

All 8 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 22 Sep 2020

Hi, in this ddp mode you can call trainer.fit / test only once.
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-data-parallel

There are cases in which it is not possible to use DDP. Examples are:

Jupyter Notebook, Google COLAB, Kaggle, etc.

You have a nested script without a root package

Your script needs to invoke .fit or .test multiple times

you need to switch to ddp_spawn or launch your .test in a separate script.

awaelchli on 22 Sep 2020

👍1

Interesting. I read this but to me it seemed to indicate you could call both fit and test as long as neither were called multiple times. Perhaps the documentation could be updated to make this more clear.

josh-gleason on 22 Sep 2020

👍1

yeah granted, this may be a bit ambiguous.

awaelchli on 22 Sep 2020

Hi
I encountered that DDP couldn't work for me when running my python script as a module (python -m).
Does it match the second limitation case? ("You have a nested script without a root package").
I couldn't really understand this one, and I am not sure why DDP cannot work with modules.
When I am running the same script not as module, everything works fine.
I tested it also on a really simple MNIST example.

AvivWn on 26 Sep 2020

@AvivWn Because when you run the ddp script, it will call itself under the hood again n-1 times where n is the number of gpus you selected. So, example, if we did this:

python train.py --gpus 4 --something-else --distributed_backend=ddp

this process will launch on gpu 0 and then call

python train.py --gpus "1,"  --something-else --distributed_backend=ddp
python train.py --gpus "2,"  --something-else --distributed_backend=ddp
python train.py --gpus "3,"  --something-else --distributed_backend=ddp

(simplified example)

so if we wanted to support the module way of launching the script, we would have to strip -m ... from the command and append it to the subprocess calls.

awaelchli on 26 Sep 2020

@awaelchli Sure
And is it impossible to programmatically add the "-m" only if it specified? It should be just a single if, right?
It is frustrating to know that a module cannot be run as DDP, after I have already built a full working module project.
Transforming it into a script won't be easy...

AvivWn on 26 Sep 2020

No, I would not assume it is impossible. If the -m option can be stripped from arglist, it should not be too difficult.
Very sorry that it causes trouble for you but it looks like they just didn't think about this use case when the ddp backend was added.
Let's open a feature request for this.

awaelchli on 28 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings