Pytorch-lightning: Single node DDP: "Default process group is not initialized"

Created on 19 Jun 2020  路  16Comments  路  Source: PyTorchLightning/pytorch-lightning

馃悰 Bug

Unable to start single node ddp training on 0.8.0

To Reproduce

was going to run the gpu_template but... #2235
both methods of running the template result in the same error

$ python -m pl_examples.basic_examples.gpu_template --gpus 4 --distributed_backend ddp_spawn
$ python -m pl_examples.basic_examples.gpu_template --gpus 4 --distributed_backend ddp
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0,1,2,3]
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 80, in <module>
    main(hyperparams)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 41, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 860, in fit
    self.barrier('fit_prepare_data')
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1261, in barrier
    torch_distrib.barrier()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1484, in barrier
    _check_default_pg()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 187, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized
bug / fix help wanted

Most helpful comment

+1, doesn't look like the issue is resolved yet.

All 16 comments

can you post code to reproduce? just a minimal example that breaks

BTW, the GPU template is fixed...

done, let me post my env as well

ok wait... i think i see it. one sec

I just tested the merged changes with both ddp and ddp_spawn again got this:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 80, in <module>
    main(hyperparams)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 41, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 891, in fit
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    self.ddp_train(task, model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 479, in ddp_train
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 80, in <module>
    main(hyperparams)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 41, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 907, in fit
    self.setup()
TypeError: setup() missing 1 required positional argument: 'stage'
    self.spawn_ddp_children(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 441, in spawn_ddp_children
    self.ddp_train(local_rank, model, is_master=True)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 479, in ddp_train
    self.setup()
TypeError: setup() missing 1 required positional argument: 'stage'

try again. that was a typo

cheers, works now!

Still having the Default process group is not initialized issue when using trainer.test

Still having the Default process group is not initialized issue when using trainer.test

I still have this bug as well. One temporary solution is creating a new single GPU trainer to do the test.

Like

trainer = Trainer(gpus=1, deterministic=True, logger=logger)
trainer.model = model
trainer.test()

Right, I know it works on single gpu. I have a large test set and ideally want faster inference using multiple gpus.

Can we re-open this issue? I am still having the Default process group is not initialized issue when I hit trainer.test() with ddp (with any number of gpus, even 1). I'm using the latest release from yesterday.

+1, doesn't look like the issue is resolved yet.

having the same problem..... I also tried to downgrade pl to an older version, like 0.7.5, and try to using the older version to do the inference. But, the model trained and saved using the 0.8.x seems to not directly be compatible with older version.

version: 0.8.4 train with ddp, Got "Default process group is not initialized" when run trainer.test()

could you try master? this is fixed there

Just tried it, it works fine now! Thank you!

@williamFalcon Trying 0.8.5

Trained with ddp, and testing with ddp, but got the following error message:

AssertionError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient.

Any idea?

Thanks!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

srush picture srush  路  3Comments

monney picture monney  路  3Comments

iakremnev picture iakremnev  路  3Comments

justusschock picture justusschock  路  3Comments

williamFalcon picture williamFalcon  路  3Comments