Pytorch-lightning: NCCL error when using ddp with 2 gpus

Created on 5 Oct 2020 · 5Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

I try to run pytorch lighting using ddp with 2 gpus. Running with one gpu works fine. Using fp16 vs not results in the same error. See the stacktrace at the end of the post to see the error. I also tried ddp2 and dp, but both of those fail with a different error.

To Reproduce

Not sure. Let me know what I can do to diagnose.

I'm running my code on a cluster where each gpu is locked to one process. I'm using NCCL version 2.4.8.

I tried pytorch-lightning versions 0.9.0, 0.9.1rc4, 0.10.0rc1. All of them result in the same error. I'm running pytorch version 1.6.

Expected behavior

I expected training to start running smoothly using both gpus.

Environment

CUDA:
- GPU:
- GeForce GTX 1080
- GeForce GTX 1080
- available: True
- version: 10.1
Packages:
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 0.10.0rc1
- tqdm: 4.46.1
System:
- OS: Linux
- architecture:
- 64bit
-
- processor:
- python: 3.7.7
- version: #1 SMP Tue May 12 16:57:42 UTC 2020

Additional context

Stacktrace and error.

initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
INFO:lightning:initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
INFO:lightning:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
Using native 16bit precision.
INFO:lightning:Using native 16bit precision.
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
INFO:lightning:initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
lo-s4-039:21587:21587 [0] NCCL INFO Bootstrap : Using [0]fabric:10.204.67.89<0> [1]enp129s0f0:10.204.3.89<0>
lo-s4-039:21587:21587 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
lo-s4-039:21587:21587 [0] NCCL INFO NET/IB : No device found.
lo-s4-039:21587:21587 [0] NCCL INFO NET/Socket : Using [0]fabric:10.204.67.89<0> [1]enp129s0f0:10.204.3.89<0>
NCCL version 2.4.8+cuda10.1
lo-s4-039:21614:21614 [1] NCCL INFO Bootstrap : Using [0]fabric:10.204.67.89<0> [1]enp129s0f0:10.204.3.89<0>
lo-s4-039:21614:21614 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
lo-s4-039:21614:21614 [1] NCCL INFO NET/IB : No device found.
lo-s4-039:21614:21614 [1] NCCL INFO NET/Socket : Using [0]fabric:10.204.67.89<0> [1]enp129s0f0:10.204.3.89<0>
lo-s4-039:21587:21646 [0] NCCL INFO Setting affinity for GPU 0 to 1fd001fd
lo-s4-039:21614:21647 [1] NCCL INFO Setting affinity for GPU 1 to 1fd001fd
lo-s4-039:21587:21646 [0] NCCL INFO Channel 00 :    0   1
lo-s4-039:21587:21646 [0] NCCL INFO Ring 00 : 0[1] -> 1[2] via P2P/IPC
lo-s4-039:21614:21647 [1] NCCL INFO Ring 00 : 1[2] -> 0[1] via P2P/IPC

lo-s4-039:21587:21646 [0] transport/p2p.cc:574 NCCL WARN failed to open CUDA IPC handle : 711 peer mapping resources exhausted
lo-s4-039:21587:21646 [0] NCCL INFO init.cc:669 -> 1
lo-s4-039:21587:21646 [0] NCCL INFO init.cc:815 -> 1
lo-s4-039:21587:21646 [0] NCCL INFO init.cc:951 -> 1
lo-s4-039:21587:21646 [0] NCCL INFO misc/group.cc:69 -> 1 [Async thread]

lo-s4-039:21614:21647 [1] transport/p2p.cc:574 NCCL WARN failed to open CUDA IPC handle : 711 peer mapping resources exhausted
lo-s4-039:21614:21647 [1] NCCL INFO init.cc:669 -> 1
lo-s4-039:21614:21647 [1] NCCL INFO init.cc:815 -> 1
lo-s4-039:21614:21647 [1] NCCL INFO init.cc:951 -> 1
lo-s4-039:21614:21647 [1] NCCL INFO misc/group.cc:69 -> 1 [Async thread]
Traceback (most recent call last):
  File "tools/lightning.py", line 514, in <module>
Traceback (most recent call last):
  File "/cluster/home/user/tracking/tools/lightning.py", line 514, in <module>
    trainer.fit(model)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 451, in fit
    results = self.accelerator_backend.train()
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 140, in train
    trainer.fit(model)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 451, in fit
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 266, in ddp_train
    model = model.configure_ddp(model, device_ids)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 954, in configure_ddp
    results = self.accelerator_backend.train()
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 140, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 266, in ddp_train
    model, device_ids=device_ids, find_unused_parameters=True
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 333, in __init__
    model = model.configure_ddp(model, device_ids)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 954, in configure_ddp
    self.broadcast_bucket_size)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced
    model, device_ids=device_ids, find_unused_parameters=True
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 333, in __init__
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629403081/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled cuda error, NCCL version 2.4.8
    self.broadcast_bucket_size)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629403081/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled cuda error, NCCL version 2.4.8

DDP Priority P0 bug / fix

Source

kekeblom

Most helpful comment

Actually, I haven't seen this happen again. So I can't reproduce. Might have been the version or then it's somehow related to the gpu that gets scheduled.

Maybe it's worth closing the issue. I'll let you know if I re-encouter the problem.

kekeblom on 23 Oct 2020

❤1 👍1

All 5 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 5 Oct 2020

mind upgrading to 1.0.2? And try to reproduce using this model-> https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report_model.py

edenlightning on 20 Oct 2020

@kekeblom how do you launch your script? how do I reproduce it with the bug report template?

I'm running my code on a cluster where each gpu is locked to one process.

That should be fine, ddp launches 1 process per gpu.

awaelchli on 21 Oct 2020

Actually, I haven't seen this happen again. So I can't reproduce. Might have been the version or then it's somehow related to the gpu that gets scheduled.

Maybe it's worth closing the issue. I'll let you know if I re-encouter the problem.

kekeblom on 23 Oct 2020

❤1 👍1

Speak of the devil. @awaelchli

Tried it on 1.0.3 with 2 gpus. Modified the bug_report_model.py to have gpus=2, distributed_backend="ddp" in the trainer initializer arguments.

It seems it only happens when running on a machine with GTX 1080 gpus. Machines with GTX 1080 Ti or RTX2080 do not appear to suffer from this issue.

Here is the output of nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:0D:00.0 Off |                  N/A |
| 28%   29C    P8     5W / 180W |      0MiB /  8119MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:0F:00.0 Off |                  N/A |
| 28%   31C    P8     5W / 180W |      0MiB /  8119MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Here is the stack trace:

Traceback (most recent call last):                                                                                                                            
  File "/cluster/home/user/bug_report.py", line 135, in <module>                                                                                              
Traceback (most recent call last):                                                                                                                            
  File "/cluster/home/user/bug_report.py", line 135, in <module>                                                                                              
    run_test()                                                                                                                                                
  File "/cluster/home/user/bug_report.py", line 130, in run_test                                                                                              
    run_test()                                                                                                                                                
  File "/cluster/home/user/bug_report.py", line 130, in run_test                                                                                              
    trainer.fit(model, train_data, val_data)                                                                                                                  
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 439, in fit                          
    trainer.fit(model, train_data, val_data)                                                                                                                  
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 439, in fit                          
    results = self.accelerator_backend.train()                                                                                                                
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 146, in train           
    results = self.accelerator_backend.train()                                                                                                                
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 146, in train           
    results = self.ddp_train(process_idx=self.task_idx, model=model)                                                                                          
    results = self.ddp_train(process_idx=self.task_idx, model=model)                                                                                          
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 272, in ddp_train       
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 272, in ddp_train       
    model = self.configure_ddp(model, device_ids)                                                                                                             
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 290, in configure_ddp   
    model = self.configure_ddp(model, device_ids)                                                                                                             
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 290, in configure_ddp   
    model, device_ids=device_ids, find_unused_parameters=True                                                                                                 
    model, device_ids=device_ids, find_unused_parameters=True                                                                                                 
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 333, in __init__                         
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 333, in __init__                         
    self.broadcast_bucket_size)                                                                                                                               
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced 
    self.broadcast_bucket_size)                                                                                                                               
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced 
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)                                                                                       
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629403081/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled cuda error, NCCL version 2.4.8
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)                                                                                       
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629403081/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled cuda error, NCCL version 2.4.8

kekeblom on 23 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings