When I trained by 2 GPUs like this:
python -m torch.distributed.launch --nproc_per_node=2 train_net.py --config-file "/home/amax/yang/maskrcnn-benchmark/configs/cityscapes/e2e_mask_rcnn_R_50_FPN_1x_cocostyle.yaml"
There would be a error:
Traceback (most recent call last):
File "/home/amax/anaconda3/envs/Yang/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/amax/anaconda3/envs/Yang/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/amax/anaconda3/envs/Yang/lib/python3.6/site-packages/torch/distributed/launch.py", line 238, in
main()
File "/home/amax/anaconda3/envs/Yang/lib/python3.6/site-packages/torch/distributed/launch.py", line 234, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/home/amax/anaconda3/envs/Yang/bin/python', '-u', 'train_net.py', '--local_rank=0', '--config-file', '/home/amax/yang/maskrcnn-benchmark/configs/cityscapes/e2e_mask_rcnn_R_50_FPN_1x_cocostyle.yaml']' returned non-zero exit status 1.
Could anyone tell me what happend?
Can you try running without -m torch.distributed.launch --nproc_per_node=2 to get a more meaningful error message?
I try it again and find that when I add new fc layer in model the error will occur:
def __init__(self, cfg):
super(GeneralizedRCNN, self).__init__()
self.backbone = build_backbone(cfg)
self.rpn = build_rpn(cfg, self.backbone.out_channels)
self.roi_heads = build_roi_heads(cfg, self.backbone.out_channels)
self.fc1 = nn.Linear(256, 16)
The error will not occur when the code is like this:
def __init__(self, cfg):
super(GeneralizedRCNN, self).__init__()
self.backbone = build_backbone(cfg)
self.rpn = build_rpn(cfg, self.backbone.out_channels)
self.roi_heads = build_roi_heads(cfg, self.backbone.out_channels)
I feel very strange about it. Why I can not add new linear layer in the model? And when I add new linear layer there will be some strange numbers like this:

Only when I use two GPUs to train and add new linear layers will the error occur. When I use one GPU for training the error will not occur. When I do not add new linear layers the erroe will also not occur.
Looks like this problem is as same as https://github.com/facebookresearch/maskrcnn-benchmark/issues/318 and https://github.com/pytorch/pytorch/issues/13273
If you add the new layer and without using it, this layer will not get any gradient during backpropagation. The single-gpu version of PyTorch supports it. However, the distributed version does not now.
Thank you for your reply! @chengyangfu
I will try it again.
Thanks @chengyangfu for finding the reason!
@2678918253 I'm closing this issue, but let us know if you have further problems
Most helpful comment
Looks like this problem is as same as https://github.com/facebookresearch/maskrcnn-benchmark/issues/318 and https://github.com/pytorch/pytorch/issues/13273
If you add the new layer and without using it, this layer will not get any gradient during backpropagation. The single-gpu version of PyTorch supports it. However, the distributed version does not now.