Maskrcnn-benchmark: subprocess.CalledProcessError

Created on 24 Feb 2019  ยท  5Comments  ยท  Source: facebookresearch/maskrcnn-benchmark

โ“ Questions and Help

When I trained by 2 GPUs like this:
python -m torch.distributed.launch --nproc_per_node=2 train_net.py --config-file "/home/amax/yang/maskrcnn-benchmark/configs/cityscapes/e2e_mask_rcnn_R_50_FPN_1x_cocostyle.yaml"
There would be a error:
Traceback (most recent call last):
File "/home/amax/anaconda3/envs/Yang/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/amax/anaconda3/envs/Yang/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/amax/anaconda3/envs/Yang/lib/python3.6/site-packages/torch/distributed/launch.py", line 238, in
main()
File "/home/amax/anaconda3/envs/Yang/lib/python3.6/site-packages/torch/distributed/launch.py", line 234, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/home/amax/anaconda3/envs/Yang/bin/python', '-u', 'train_net.py', '--local_rank=0', '--config-file', '/home/amax/yang/maskrcnn-benchmark/configs/cityscapes/e2e_mask_rcnn_R_50_FPN_1x_cocostyle.yaml']' returned non-zero exit status 1.
Could anyone tell me what happend?

duplicate

Most helpful comment

Looks like this problem is as same as https://github.com/facebookresearch/maskrcnn-benchmark/issues/318 and https://github.com/pytorch/pytorch/issues/13273

If you add the new layer and without using it, this layer will not get any gradient during backpropagation. The single-gpu version of PyTorch supports it. However, the distributed version does not now.

All 5 comments

Can you try running without -m torch.distributed.launch --nproc_per_node=2 to get a more meaningful error message?

I try it again and find that when I add new fc layer in model the error will occur:

def __init__(self, cfg):

    super(GeneralizedRCNN, self).__init__()

    self.backbone = build_backbone(cfg)

    self.rpn = build_rpn(cfg, self.backbone.out_channels)

    self.roi_heads = build_roi_heads(cfg, self.backbone.out_channels)

    self.fc1 = nn.Linear(256, 16)

The error will not occur when the code is like this:

def __init__(self, cfg):

    super(GeneralizedRCNN, self).__init__()

    self.backbone = build_backbone(cfg)

    self.rpn = build_rpn(cfg, self.backbone.out_channels)

    self.roi_heads = build_roi_heads(cfg, self.backbone.out_channels)

I feel very strange about it. Why I can not add new linear layer in the model? And when I add new linear layer there will be some strange numbers like this:
yang
Only when I use two GPUs to train and add new linear layers will the error occur. When I use one GPU for training the error will not occur. When I do not add new linear layers the erroe will also not occur.

Looks like this problem is as same as https://github.com/facebookresearch/maskrcnn-benchmark/issues/318 and https://github.com/pytorch/pytorch/issues/13273

If you add the new layer and without using it, this layer will not get any gradient during backpropagation. The single-gpu version of PyTorch supports it. However, the distributed version does not now.

Thank you for your reply! @chengyangfu
I will try it again.

Thanks @chengyangfu for finding the reason!

@2678918253 I'm closing this issue, but let us know if you have further problems

Was this page helpful?
0 / 5 - 0 ratings

Related issues

YuShen1116 picture YuShen1116  ยท  4Comments

Idolized22 picture Idolized22  ยท  3Comments

auroua picture auroua  ยท  3Comments

mrteera picture mrteera  ยท  3Comments

kaaier picture kaaier  ยท  3Comments