Unable to start single node ddp training on 0.8.0
was going to run the gpu_template but... #2235
both methods of running the template result in the same error
$ python -m pl_examples.basic_examples.gpu_template --gpus 4 --distributed_backend ddp_spawn
$ python -m pl_examples.basic_examples.gpu_template --gpus 4 --distributed_backend ddp
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0,1,2,3]
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 80, in <module>
main(hyperparams)
File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 41, in main
trainer.fit(model)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 860, in fit
self.barrier('fit_prepare_data')
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1261, in barrier
torch_distrib.barrier()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1484, in barrier
_check_default_pg()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 187, in _check_default_pg
"Default process group is not initialized"
AssertionError: Default process group is not initialized
can you post code to reproduce? just a minimal example that breaks
BTW, the GPU template is fixed...
done, let me post my env as well
ok wait... i think i see it. one sec
I just tested the merged changes with both ddp and ddp_spawn again got this:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 80, in <module>
main(hyperparams)
File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 41, in main
trainer.fit(model)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 891, in fit
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
self.ddp_train(task, model)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 479, in ddp_train
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 80, in <module>
main(hyperparams)
File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 41, in main
trainer.fit(model)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 907, in fit
self.setup()
TypeError: setup() missing 1 required positional argument: 'stage'
self.spawn_ddp_children(model)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 441, in spawn_ddp_children
self.ddp_train(local_rank, model, is_master=True)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 479, in ddp_train
self.setup()
TypeError: setup() missing 1 required positional argument: 'stage'
try again. that was a typo
cheers, works now!
Still having the Default process group is not initialized issue when using trainer.test
Still having the
Default process group is not initializedissue when using trainer.test
I still have this bug as well. One temporary solution is creating a new single GPU trainer to do the test.
Like
trainer = Trainer(gpus=1, deterministic=True, logger=logger)
trainer.model = model
trainer.test()
Right, I know it works on single gpu. I have a large test set and ideally want faster inference using multiple gpus.
Can we re-open this issue? I am still having the Default process group is not initialized issue when I hit trainer.test() with ddp (with any number of gpus, even 1). I'm using the latest release from yesterday.
+1, doesn't look like the issue is resolved yet.
having the same problem..... I also tried to downgrade pl to an older version, like 0.7.5, and try to using the older version to do the inference. But, the model trained and saved using the 0.8.x seems to not directly be compatible with older version.
version: 0.8.4 train with ddp, Got "Default process group is not initialized" when run trainer.test()
could you try master? this is fixed there
Just tried it, it works fine now! Thank you!
@williamFalcon Trying 0.8.5
Trained with ddp, and testing with ddp, but got the following error message:
AssertionError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient.
Any idea?
Thanks!
Most helpful comment
+1, doesn't look like the issue is resolved yet.