Maskrcnn-benchmark: subprocess.CalledProcessError

Created on 24 Feb 2019 · 5Comments · Source: facebookresearch/maskrcnn-benchmark

❓ Questions and Help

When I trained by 2 GPUs like this:
python -m torch.distributed.launch --nproc_per_node=2 train_net.py --config-file "/home/amax/yang/maskrcnn-benchmark/configs/cityscapes/e2e_mask_rcnn_R_50_FPN_1x_cocostyle.yaml"
There would be a error:
Traceback (most recent call last):
File "/home/amax/anaconda3/envs/Yang/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/amax/anaconda3/envs/Yang/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/amax/anaconda3/envs/Yang/lib/python3.6/site-packages/torch/distributed/launch.py", line 238, in
main()
File "/home/amax/anaconda3/envs/Yang/lib/python3.6/site-packages/torch/distributed/launch.py", line 234, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/home/amax/anaconda3/envs/Yang/bin/python', '-u', 'train_net.py', '--local_rank=0', '--config-file', '/home/amax/yang/maskrcnn-benchmark/configs/cityscapes/e2e_mask_rcnn_R_50_FPN_1x_cocostyle.yaml']' returned non-zero exit status 1.
Could anyone tell me what happend?

duplicate

Source

longrongyang

Most helpful comment

Looks like this problem is as same as https://github.com/facebookresearch/maskrcnn-benchmark/issues/318 and https://github.com/pytorch/pytorch/issues/13273

If you add the new layer and without using it, this layer will not get any gradient during backpropagation. The single-gpu version of PyTorch supports it. However, the distributed version does not now.

chengyangfu on 26 Feb 2019

👍2

All 5 comments

Can you try running without -m torch.distributed.launch --nproc_per_node=2 to get a more meaningful error message?

fmassa on 24 Feb 2019

I try it again and find that when I add new fc layer in model the error will occur:

def __init__(self, cfg):

    super(GeneralizedRCNN, self).__init__()

    self.backbone = build_backbone(cfg)

    self.rpn = build_rpn(cfg, self.backbone.out_channels)

    self.roi_heads = build_roi_heads(cfg, self.backbone.out_channels)

    self.fc1 = nn.Linear(256, 16)

The error will not occur when the code is like this:

def __init__(self, cfg):

    super(GeneralizedRCNN, self).__init__()

    self.backbone = build_backbone(cfg)

    self.rpn = build_rpn(cfg, self.backbone.out_channels)

    self.roi_heads = build_roi_heads(cfg, self.backbone.out_channels)

I feel very strange about it. Why I can not add new linear layer in the model? And when I add new linear layer there will be some strange numbers like this:
yang
Only when I use two GPUs to train and add new linear layers will the error occur. When I use one GPU for training the error will not occur. When I do not add new linear layers the erroe will also not occur.