Maskrcnn-benchmark: NCCL backend is not available,try to recompile the THD package with CUDA and NCCL 2

Created on 6 Nov 2018 · 12Comments · Source: facebookresearch/maskrcnn-benchmark

## ❓ Questions and Help

environment:
Ubuntu 16.04
Pytorch 1.0.0.dev20181104
cuda 9.0

cuDNN 7

zyserver01@gds:/data/maskrcnn-benchmark$ python3.5 -m torch.distributed.launch --nproc_per_node=$NGPUS ./tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1

Traceback (most recent call last):
File "./tools/train_net.py", line 173, in
main()
File "./tools/train_net.py", line 142, in main
backend="nccl", init_method="env://"
File "/home/zyserver01/.local/lib/python3.5/site-packages/torch/distributed/deprecated/__init__.py", line 101, in init_process_group
group_name, rank)
RuntimeError: the distributed NCCL backend is not available; try to recompile the THD package with CUDA and NCCL 2+ support at /pytorch/torch/lib/THD/process_group/General.cpp:20
Traceback (most recent call last):
File "./tools/train_net.py", line 173, in
main()
File "./tools/train_net.py", line 142, in main
backend="nccl", init_method="env://"
File "/home/zyserver01/.local/lib/python3.5/site-packages/torch/distributed/deprecated/__init__.py", line 101, in init_process_group
group_name, rank)
RuntimeError: the distributed NCCL backend is not available; try to recompile the THD package with CUDA and NCCL 2+ support at /pytorch/torch/lib/THD/process_group/General.cpp:20

+++++++++++++++++++++++++++++++++++++
my pytorch install no use conda,such as:

https://pytorch.org/get-started/locally/

pip install numpy torchvision_nightly
pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu90/torch_nightly.html

final , i try to install pytorch 1.0 using conda, succese. But above error still occurred~
NCCL ,i check it ,install right
i don't know how solve it ? help me~

dependency bug

Source

ds-gong

Most helpful comment

@ds-gong pytorch master should be fixed

teng-li on 7 Nov 2018

👍2

All 12 comments

when i use one GPU work normal~but 2 gpus occurred error~

ds-gong on 6 Nov 2018

Did you try creating a new conda environment and installing everything following the instructions in INSTALL.md?

fmassa on 6 Nov 2018

I've been dealing with this problem for two days, it's actually an issue of the latest pytorch-nightly build since 1.0.0.dev20181102. Some recent modifications of pytorch broke the torch.nn.parallel.deprecated module when using NCCL backend. Since this is already a deprecated module I didn't dig further for which exact commit caused this.

@ds-gong You may use a pytorch-nightly build ealier than 1.0.0.dev20181102(e.g. 1.0.0.dev20181029) for now. @fmassa In the meanwhile since https://github.com/pytorch/pytorch/pull/13248 is merged, do you recommend to use torch.nn.parallel.DistributedDataParallel directly?

JiamingSuen on 6 Nov 2018

@JiamingSuen thanks for the info, we should fix the build with distributed.deprecated.

About moving to the new c10d backend for distributed, this can be a possibility but I haven't tried using it yet, so I'm not sure if it works in all the cases / doesn't deadlock.

I'm busy this week with other things so I won't have time to test out the c10d backend, but let me ping @teng-li and @pietern so that they are aware that torch.nn.parallel.deprecated is not working properly with NCCL for the nightlies.

fmassa on 6 Nov 2018

👍1

Thanks, I will try an earlier version.

ds-gong on 6 Nov 2018

Indeed, using an earlier version can be solved, working normally~~

ds-gong on 7 Nov 2018

@ds-gong pytorch master should be fixed

teng-li on 7 Nov 2018

👍2

@ds-gong please help me, how exactly do you install that previous pytorch version?

KorovkoAlexander on 7 Nov 2018

@KorovkoAlexander In conda env, rebuild pytorch source using version such as 10.29:
reset version to 74ac86d2fedde7fb55cc8feca000b7a3af1c20db
https://github.com/pytorch/pytorch/tree/74ac86d2fedde7fb55cc8feca000b7a3af1c20db

ds-gong on 7 Nov 2018

@KorovkoAlexander conda install -f pytorch-nightly==1.0.0.dev20181029 -c pytorch

JiamingSuen on 7 Nov 2018

👍1

For those that face this issue:

please try updating PyTorch to a nightly version from today or tomorrow.
This issue should have been fixed by https://github.com/pytorch/pytorch/pull/13653

Thanks @teng-li !

fmassa on 7 Nov 2018

I faced a similar issue while using PyTorch 1.6 version.

Traceback (most recent call last):
File "tools/run_net.py", line 42, in
main()
File "tools/run_net.py", line 23, in main
launch_job(cfg=cfg, init_method=args.init_method, func=train)
File "/home/sonu/src/SlowFast/slowfast/utils/misc.py", line 285, in launch_job
daemon=daemon,
File "/home/sonu/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/sonu/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/sonu/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

Process 5 terminated with the following error:
Traceback (most recent call last):
File "/home/sonu/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/sonu/src/SlowFast/slowfast/utils/multiprocessing.py", line 47, in run
raise e
File "/home/sonu/src/SlowFast/slowfast/utils/multiprocessing.py", line 44, in run
rank=rank,
File "/home/sonu/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 432, in init_process_group
timeout=timeout)
File "/home/sonu/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 508, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL "
RuntimeError: Distributed package doesn't have NCCL built in

How do I solve this? I thought the issue was fixed with torch version 1.5. I am using 1.6 - wondering if the fix done was only to 1.5?