Pytorch-lightning: Single node DDP: "Default process group is not initialized"

Created on 19 Jun 2020 · 16Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

Unable to start single node ddp training on 0.8.0

To Reproduce

~~was going to run the gpu_template but... #2235~~
both methods of running the template result in the same error

$ python -m pl_examples.basic_examples.gpu_template --gpus 4 --distributed_backend ddp_spawn
$ python -m pl_examples.basic_examples.gpu_template --gpus 4 --distributed_backend ddp

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0,1,2,3]
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 80, in <module>
    main(hyperparams)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 41, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 860, in fit
    self.barrier('fit_prepare_data')
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1261, in barrier
    torch_distrib.barrier()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1484, in barrier
    _check_default_pg()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 187, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

bug / fix help wanted

Source

s-rog

Most helpful comment

+1, doesn't look like the issue is resolved yet.

armancohan on 2 Jul 2020

👍3

All 16 comments

can you post code to reproduce? just a minimal example that breaks

BTW, the GPU template is fixed...

williamFalcon on 19 Jun 2020

done, let me post my env as well

s-rog on 19 Jun 2020

ok wait... i think i see it. one sec

williamFalcon on 19 Jun 2020

I just tested the merged changes with both ddp and ddp_spawn again got this:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 80, in <module>
    main(hyperparams)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 41, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 891, in fit
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    self.ddp_train(task, model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 479, in ddp_train
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 80, in <module>
    main(hyperparams)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 41, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 907, in fit
    self.setup()
TypeError: setup() missing 1 required positional argument: 'stage'
    self.spawn_ddp_children(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 441, in spawn_ddp_children
    self.ddp_train(local_rank, model, is_master=True)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 479, in ddp_train
    self.setup()
TypeError: setup() missing 1 required positional argument: 'stage'

s-rog on 19 Jun 2020

try again. that was a typo

williamFalcon on 19 Jun 2020

cheers, works now!

s-rog on 19 Jun 2020

Still having the Default process group is not initialized issue when using trainer.test

armancohan on 23 Jun 2020

Still having the Default process group is not initialized issue when using trainer.test

I still have this bug as well. One temporary solution is creating a new single GPU trainer to do the test.

trainer = Trainer(gpus=1, deterministic=True, logger=logger)
trainer.model = model
trainer.test()

wukailu on 23 Jun 2020

Right, I know it works on single gpu. I have a large test set and ideally want faster inference using multiple gpus.

armancohan on 23 Jun 2020

Can we re-open this issue? I am still having the Default process group is not initialized issue when I hit trainer.test() with ddp (with any number of gpus, even 1). I'm using the latest release from yesterday.

zackcarson on 2 Jul 2020

👍3

+1, doesn't look like the issue is resolved yet.

armancohan on 2 Jul 2020

👍3

having the same problem..... I also tried to downgrade pl to an older version, like 0.7.5, and try to using the older version to do the inference. But, the model trained and saved using the 0.8.x seems to not directly be compatible with older version.

jxchen01 on 4 Jul 2020

👍2

version: 0.8.4 train with ddp, Got "Default process group is not initialized" when run trainer.test()

channingxiao on 9 Jul 2020

could you try master? this is fixed there

williamFalcon on 9 Jul 2020

Just tried it, it works fine now! Thank you!

zackcarson on 9 Jul 2020

@williamFalcon Trying 0.8.5

Trained with ddp, and testing with ddp, but got the following error message:

AssertionError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient.

Any idea?

Thanks!

jxchen01 on 17 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Metrics: Accuracy, Precision, Recall, F1, ROC

justusschock · 3Comments

How set number of epochs

Vichoko · 3Comments

How to save checkpoints within lightning_logs?

polars05 · 3Comments

Namespace Cleaning

monney · 3Comments

Access the logging directory through LightningModule or Trainer

DavidRuhe · 3Comments