I try to run pytorch lighting using ddp with 2 gpus. Running with one gpu works fine. Using fp16 vs not results in the same error. See the stacktrace at the end of the post to see the error. I also tried ddp2 and dp, but both of those fail with a different error.
Not sure. Let me know what I can do to diagnose.
I'm running my code on a cluster where each gpu is locked to one process. I'm using NCCL version 2.4.8.
I tried pytorch-lightning versions 0.9.0, 0.9.1rc4, 0.10.0rc1. All of them result in the same error. I'm running pytorch version 1.6.
I expected training to start running smoothly using both gpus.
Stacktrace and error.
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
INFO:lightning:initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
INFO:lightning:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
Using native 16bit precision.
INFO:lightning:Using native 16bit precision.
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
INFO:lightning:initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
lo-s4-039:21587:21587 [0] NCCL INFO Bootstrap : Using [0]fabric:10.204.67.89<0> [1]enp129s0f0:10.204.3.89<0>
lo-s4-039:21587:21587 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
lo-s4-039:21587:21587 [0] NCCL INFO NET/IB : No device found.
lo-s4-039:21587:21587 [0] NCCL INFO NET/Socket : Using [0]fabric:10.204.67.89<0> [1]enp129s0f0:10.204.3.89<0>
NCCL version 2.4.8+cuda10.1
lo-s4-039:21614:21614 [1] NCCL INFO Bootstrap : Using [0]fabric:10.204.67.89<0> [1]enp129s0f0:10.204.3.89<0>
lo-s4-039:21614:21614 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
lo-s4-039:21614:21614 [1] NCCL INFO NET/IB : No device found.
lo-s4-039:21614:21614 [1] NCCL INFO NET/Socket : Using [0]fabric:10.204.67.89<0> [1]enp129s0f0:10.204.3.89<0>
lo-s4-039:21587:21646 [0] NCCL INFO Setting affinity for GPU 0 to 1fd001fd
lo-s4-039:21614:21647 [1] NCCL INFO Setting affinity for GPU 1 to 1fd001fd
lo-s4-039:21587:21646 [0] NCCL INFO Channel 00 : 0 1
lo-s4-039:21587:21646 [0] NCCL INFO Ring 00 : 0[1] -> 1[2] via P2P/IPC
lo-s4-039:21614:21647 [1] NCCL INFO Ring 00 : 1[2] -> 0[1] via P2P/IPC
lo-s4-039:21587:21646 [0] transport/p2p.cc:574 NCCL WARN failed to open CUDA IPC handle : 711 peer mapping resources exhausted
lo-s4-039:21587:21646 [0] NCCL INFO init.cc:669 -> 1
lo-s4-039:21587:21646 [0] NCCL INFO init.cc:815 -> 1
lo-s4-039:21587:21646 [0] NCCL INFO init.cc:951 -> 1
lo-s4-039:21587:21646 [0] NCCL INFO misc/group.cc:69 -> 1 [Async thread]
lo-s4-039:21614:21647 [1] transport/p2p.cc:574 NCCL WARN failed to open CUDA IPC handle : 711 peer mapping resources exhausted
lo-s4-039:21614:21647 [1] NCCL INFO init.cc:669 -> 1
lo-s4-039:21614:21647 [1] NCCL INFO init.cc:815 -> 1
lo-s4-039:21614:21647 [1] NCCL INFO init.cc:951 -> 1
lo-s4-039:21614:21647 [1] NCCL INFO misc/group.cc:69 -> 1 [Async thread]
Traceback (most recent call last):
File "tools/lightning.py", line 514, in <module>
Traceback (most recent call last):
File "/cluster/home/user/tracking/tools/lightning.py", line 514, in <module>
trainer.fit(model)
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 451, in fit
results = self.accelerator_backend.train()
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 140, in train
trainer.fit(model)
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 451, in fit
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 266, in ddp_train
model = model.configure_ddp(model, device_ids)
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 954, in configure_ddp
results = self.accelerator_backend.train()
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 140, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 266, in ddp_train
model, device_ids=device_ids, find_unused_parameters=True
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 333, in __init__
model = model.configure_ddp(model, device_ids)
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 954, in configure_ddp
self.broadcast_bucket_size)
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced
model, device_ids=device_ids, find_unused_parameters=True
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 333, in __init__
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629403081/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled cuda error, NCCL version 2.4.8
self.broadcast_bucket_size)
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629403081/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled cuda error, NCCL version 2.4.8
Hi! thanks for your contribution!, great first issue!
mind upgrading to 1.0.2? And try to reproduce using this model-> https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report_model.py
@kekeblom how do you launch your script? how do I reproduce it with the bug report template?
I'm running my code on a cluster where each gpu is locked to one process.
That should be fine, ddp launches 1 process per gpu.
Actually, I haven't seen this happen again. So I can't reproduce. Might have been the version or then it's somehow related to the gpu that gets scheduled.
Maybe it's worth closing the issue. I'll let you know if I re-encouter the problem.
Speak of the devil. @awaelchli
Tried it on 1.0.3 with 2 gpus. Modified the bug_report_model.py to have gpus=2, distributed_backend="ddp" in the trainer initializer arguments.
It seems it only happens when running on a machine with GTX 1080 gpus. Machines with GTX 1080 Ti or RTX2080 do not appear to suffer from this issue.
Here is the output of nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:0D:00.0 Off | N/A |
| 28% 29C P8 5W / 180W | 0MiB / 8119MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 00000000:0F:00.0 Off | N/A |
| 28% 31C P8 5W / 180W | 0MiB / 8119MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Here is the stack trace:
Traceback (most recent call last):
File "/cluster/home/user/bug_report.py", line 135, in <module>
Traceback (most recent call last):
File "/cluster/home/user/bug_report.py", line 135, in <module>
run_test()
File "/cluster/home/user/bug_report.py", line 130, in run_test
run_test()
File "/cluster/home/user/bug_report.py", line 130, in run_test
trainer.fit(model, train_data, val_data)
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 439, in fit
trainer.fit(model, train_data, val_data)
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 439, in fit
results = self.accelerator_backend.train()
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 146, in train
results = self.accelerator_backend.train()
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 146, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 272, in ddp_train
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 272, in ddp_train
model = self.configure_ddp(model, device_ids)
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 290, in configure_ddp
model = self.configure_ddp(model, device_ids)
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 290, in configure_ddp
model, device_ids=device_ids, find_unused_parameters=True
model, device_ids=device_ids, find_unused_parameters=True
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 333, in __init__
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 333, in __init__
self.broadcast_bucket_size)
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced
self.broadcast_bucket_size)
File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629403081/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled cuda error, NCCL version 2.4.8
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629403081/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled cuda error, NCCL version 2.4.8
Most helpful comment
Actually, I haven't seen this happen again. So I can't reproduce. Might have been the version or then it's somehow related to the gpu that gets scheduled.
Maybe it's worth closing the issue. I'll let you know if I re-encouter the problem.