Maskrcnn-benchmark: aspect ratio grouping error

Created on 29 Oct 2018 · 21Comments · Source: facebookresearch/maskrcnn-benchmark

❓ Questions and Help

I added a new loss and it works fine if I use a single GPU.
However, it fails on "losses.backward()" if I use multiple GPUs. It seems this error relates to the "torch.distributed"
The error information is below:

File "tools/train_net.py", line 170, in <module>
    main()
  File "tools/train_net.py", line 163, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 73, in train
    arguments,
  File "/home/maskrcnn_benchmark/engine/trainer.py", line 77, in do_train
    losses.backward()
  File "/usr/local/lib/python3.5/dist-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.5/dist-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/deprecated/distributed.py", line 342, in reduction_fn_nccl
    group=self.nccl_reduction_group_id)
  File "/usr/local/lib/python3.5/dist-packages/torch/distributed/deprecated/__init__.py", line 317, in all_reduce_multigpu
    return torch._C._dist_all_reduce_multigpu(tensor_list, op, group)

bug contributions welcome

Source

MhLiao

Most helpful comment

Oh, there might be indeed a problem with the GroupedBatchSampler.
As a quick workaround, I'd recommend setting the ASPECT_RATIO_GROUPING to False in the config.
I'll need to dig a bit further to identify in which contexts the issue you are facing arises.

fmassa on 29 Oct 2018

👍3 🚀1 ❤1 🎉1 😄1

All 21 comments

Hi,

It is a bit difficult to understand where the problem might be without a bit more information.

A few questions:

is your loss written in Python using only PyTorch operations?
do you create a new tensor during your loss, and do you make sure you set the device of it properly?
does your loss involve double-backwards?
is it a loss (like MSE-loss) or a new FasterRCNNLossComputation class (or alike) that you wrote?

fmassa on 29 Oct 2018

@fmassa Thank you very much for your quick response!
The new loss is a "F.cross_entropy" for another mask prediction branch. I create a new target tensor for the loss and set its device as project_masks_on_boxes() in maskrcnn_benchmark/modeling/roi_heads/mask_head/loss.py does.

MhLiao on 29 Oct 2018

Do you also handle the case where there are no masks present in the batch?

If you have an early return from the losses and you don't backpropagate through all the model, you might face deadlocks (or maybe errors in the newest version, I don't know).
Thats why I have parts like the following in the code https://github.com/facebookresearch/maskrcnn-benchmark/blob/master/maskrcnn_benchmark/modeling/roi_heads/mask_head/loss.py#L124-L125

This means that the loss need to be linked to the whole model, even if it is zero.

fmassa on 29 Oct 2018

I also use if mask_targets.numel() == 0: to return the loss early. I can run it with a single GPU stablely so I guess the problem is related with the "torch.distributed". Maybe I should register the new loss function or modify some code about the distributed code?

MhLiao on 29 Oct 2018

How do you return the loss early?
It should be something like

return mask_logits.sum() * 0

instead of

return torch.tensor(0, requires_grad=True, device=device)

fmassa on 29 Oct 2018

Yes, I use the code like return mask_logits.sum() * 0.

MhLiao on 29 Oct 2018

It is difficult to say what else can be the problem without seeing the code.
If it works on single GPU, but fails on multi-GPU, the possibilities that I can think of are the following:

you selectively return one loss or the other depending on the batch: you need to return all losses, even if one is zero, with the approach I mentioned to you before. So if one branch is not used, you still need to make its loss be mask_logits.sum() * 0 or something like that.

Can you share the modifications that you did? It would be easier to help you in this case

fmassa on 29 Oct 2018

Also note that what I mentioned is true everywhere in the model.
So if somewhere else in your model you have an early return (for example in the box_heads.py), you should make sure that everything gets linked by the model (via a forward / backward), or else you might face deadlocks in NCCL.

fmassa on 29 Oct 2018

In the loss.py, I add a cross_entropy loss function and keep the steps almost the same as the original loss.
I return a dict with two keys in the mask_head.py instead of the original one-key dict.

MhLiao on 29 Oct 2018

@MhLiao can you try changing this line with

if mask_targets.numel() == 0 or char_mask_targets.numel() == 0 :

and let me know?

fmassa on 29 Oct 2018

There is a mistake in this line. I have correct it but the error is the same. I do not suffer deadlocks.

MhLiao on 29 Oct 2018

I notice another error at the top of the error logs, which may be the actual cause of this problem.
There may be something wrong in the data sampler when I use multiple GPUs.

Traceback (most recent call last):
  File "tools/train_net.py", line 170, in <module>
    main()
  File "tools/train_net.py", line 163, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 73, in train
    arguments,
  File "/unsullied/sharefs/_csg_algorithm/Interns/liaominghui/data/masktextspotter/maskrcnn-benchmark/maskrcnn_benchmark
/engine/trainer.py", line 56, in do_train
    for iteration, (images, targets, _) in enumerate(data_loader, start_iter):
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 614, in __next__
    indices = next(self.sample_iter)  # may raise StopIteration
  File "/unsullied/sharefs/_csg_algorithm/Interns/liaominghui/data/masktextspotter/maskrcnn-benchmark/maskrcnn_benchmark
/data/samplers/iteration_based_batch_sampler.py", line 24, in __iter__
    for batch in self.batch_sampler:
  File "/unsullied/sharefs/_csg_algorithm/Interns/liaominghui/data/masktextspotter/maskrcnn-benchmark/maskrcnn_benchmark
/data/samplers/grouped_batch_sampler.py", line 107, in __iter__
    batches = self._prepare_batches()
  File "/unsullied/sharefs/_csg_algorithm/Interns/liaominghui/data/masktextspotter/maskrcnn-benchmark/maskrcnn_benchmark
/data/samplers/grouped_batch_sampler.py", line 79, in _prepare_batches
    first_element_of_batch = [t[0].item() for t in merged]
  File "/unsullied/sharefs/_csg_algorithm/Interns/liaominghui/data/masktextspotter/maskrcnn-benchmark/maskrcnn_benchmark
/data/samplers/grouped_batch_sampler.py", line 79, in <listcomp>
    first_element_of_batch = [t[0].item() for t in merged]
IndexError: index 0 is out of bounds for dimension 0 with size 0

MhLiao on 29 Oct 2018

👍1

fmassa on 29 Oct 2018

👍3 🚀1 ❤1 🎉1 😄1

That's right. When setting the ASPECT_RATIO_GROUPING to False, everything is OK.
I print the value of merged in this line
But I can not find any differences between using a single GPU and using multiple GPUs.
multple GPUs:

(tensor([9]), tensor([187]), tensor([63]), tensor([48]), tensor([159]), tensor([172]), tensor([176]), tensor([75]), tensor(
[221]), tensor([131]), tensor([56]), tensor([191]), tensor([99]), tensor([46]), tensor([80]), tensor([124]), tensor([161]),
 tensor([184]), tensor([166]), tensor([141]), tensor([155]), tensor([175]), tensor([214]), tensor([89]), tensor([93]), tens
or([144]), tensor([64]), tensor([69]), tensor([174]))
(tensor([109]), tensor([200]), tensor([211]), tensor([189]), tensor([17]), tensor([59]), tensor([104]), tensor([31]), tenso
r([180]), tensor([137]), tensor([51]), tensor([5]), tensor([183]), tensor([44]), tensor([60]), tensor([138]), tensor([158])
, tensor([15]), tensor([185]), tensor([30]), tensor([142]), tensor([204]), tensor([216]), tensor([206]), tensor([190]), ten
sor([165]), tensor([164]), tensor([24]), tensor([111]))
(tensor([122]), tensor([121]), tensor([209]), tensor([133]), tensor([162]), tensor([81]), tensor([227]), tensor([128]), ten
sor([57]), tensor([68]), tensor([218]), tensor([169]), tensor([21]), tensor([149]), tensor([47]), tensor([156]), tensor([8]
), tensor([148]), tensor([18]), tensor([207]), tensor([62]), tensor([210]), tensor([73]), tensor([12]), tensor([192]), tens
or([103]), tensor([96]), tensor([107]), tensor([152]))
(tensor([123]), tensor([130]), tensor([113]), tensor([153]), tensor([32]), tensor([181]), tensor([170]), tensor([222]), ten
sor([7]), tensor([115]), tensor([91]), tensor([61]), tensor([199]), tensor([43]), tensor([22]), tensor([19]), tensor([26]),
 tensor([145]), tensor([49]), tensor([127]), tensor([88]), tensor([28]), tensor([53]), tensor([208]), tensor([114]), tensor
([100]), tensor([194]), tensor([215]), tensor([39]))
(tensor([114]), tensor([100]), tensor([194]), tensor([151]), tensor([92]), tensor([224]), tensor([219]), tensor([182]), ten
sor([116]), tensor([72]), tensor([87]), tensor([71]), tensor([90]), tensor([52]), tensor([117]), tensor([27]), tensor([157]
), tensor([45]), tensor([97]), tensor([112]), tensor([220]), tensor([140]), tensor([84]), tensor([193]), tensor([173]), ten
sor([78]), tensor([34]), tensor([226]), tensor([79]), tensor([], dtype=torch.int64))
(tensor([177]), tensor([106]), tensor([14]), tensor([203]), tensor([83]), tensor([205]), tensor([74]), tensor([129]), tenso
r([86]), tensor([38]), tensor([225]), tensor([201]), tensor([147]), tensor([120]), tensor([101]), tensor([217]), tensor([20
]), tensor([160]), tensor([23]), tensor([29]), tensor([6]), tensor([65]), tensor([212]), tensor([171]), tensor([198]), tens
or([40]), tensor([10]), tensor([94]), tensor([126]))
(tensor([146]), tensor([167]), tensor([95]), tensor([2]), tensor([36]), tensor([3]), tensor([35]), tensor([119]), tensor([4
2]), tensor([41]), tensor([1]), tensor([82]), tensor([228]), tensor([143]), tensor([196]), tensor([50]), tensor([33]), tens
or([195]), tensor([202]), tensor([54]), tensor([150]), tensor([58]), tensor([0]), tensor([16]), tensor([135]), tensor([125]
), tensor([188]), tensor([163]), tensor([108]))
(tensor([197]), tensor([37]), tensor([178]), tensor([118]), tensor([98]), tensor([4]), tensor([67]), tensor([136]), tensor(
[132]), tensor([168]), tensor([186]), tensor([77]), tensor([13]), tensor([223]), tensor([11]), tensor([134]), tensor([66]),
 tensor([179]), tensor([55]), tensor([70]), tensor([154]), tensor([102]), tensor([213]), tensor([110]), tensor([76]), tenso
r([139]), tensor([105]), tensor([25]), tensor([85]))

Single GPU:

(tensor([67]), tensor([104]), tensor([44]), tensor([59]), tensor([190]), tensor([187]), tensor([12]), tensor([65]), tensor(
[2]), tensor([26]), tensor([92]), tensor([221]), tensor([198]), tensor([34]), tensor([32]), tensor([61]), tensor([71]), ten
sor([156]), tensor([131]), tensor([178]), tensor([49]), tensor([121]), tensor([136]), tensor([188]), tensor([135]), tensor(
[123]), tensor([64]), tensor([179]), tensor([142]), tensor([83]), tensor([79]), tensor([109]), tensor([127]), tensor([48]),
 tensor([11]), tensor([163]), tensor([118]), tensor([52]), tensor([66]), tensor([170]), tensor([84]), tensor([63]), tensor(
[186]), tensor([87]), tensor([96]), tensor([207]), tensor([195]), tensor([191]), tensor([103]), tensor([211]), tensor([101]
), tensor([138]), tensor([75]), tensor([114]), tensor([20]), tensor([201]), tensor([143]), tensor([141]), tensor([177]), te
nsor([76]), tensor([95]), tensor([113]), tensor([112]), tensor([51]), tensor([23]), tensor([46]), tensor([157]), tensor([19
6]), tensor([228]), tensor([199]), tensor([153]), tensor([145]), tensor([205]), tensor([159]), tensor([45]), tensor([9]), t
ensor([224]), tensor([4]), tensor([144]), tensor([100]), tensor([81]), tensor([214]), tensor([154]), tensor([173]), tensor(
[150]), tensor([7]), tensor([91]), tensor([42]), tensor([184]), tensor([164]), tensor([213]), tensor([62]), tensor([115]),
tensor([53]), tensor([148]), tensor([18]), tensor([110]), tensor([133]), tensor([89]), tensor([47]), tensor([158]), tensor(
[200]), tensor([217]), tensor([220]), tensor([194]), tensor([5]), tensor([175]), tensor([226]), tensor([28]), tensor([222])
, tensor([19]), tensor([29]), tensor([146]), tensor([82]), tensor([204]), tensor([60]), tensor([15]), tensor([165]), tensor
([192]), tensor([223]), tensor([202]), tensor([90]), tensor([203]), tensor([225]), tensor([68]), tensor([216]), tensor([30]
), tensor([149]), tensor([209]), tensor([210]), tensor([77]), tensor([6]), tensor([193]), tensor([116]), tensor([78]), tens
or([122]), tensor([147]), tensor([168]), tensor([180]), tensor([160]), tensor([128]), tensor([72]), tensor([93]), tensor([2
2]), tensor([55]), tensor([139]), tensor([13]), tensor([182]), tensor([212]), tensor([73]), tensor([10]), tensor([130]), te
nsor([137]), tensor([98]), tensor([183]), tensor([86]), tensor([125]), tensor([151]), tensor([169]), tensor([197]), tensor(
[107]), tensor([172]), tensor([161]), tensor([124]), tensor([102]), tensor([41]), tensor([185]), tensor([132]), tensor([140
]), tensor([35]), tensor([57]), tensor([166]), tensor([181]), tensor([40]), tensor([50]), tensor([88]), tensor([227]), tens
or([74]), tensor([58]), tensor([97]), tensor([208]), tensor([56]), tensor([176]), tensor([36]), tensor([206]), tensor([171]
), tensor([33]), tensor([117]), tensor([105]), tensor([155]), tensor([17]), tensor([219]), tensor([54]), tensor([70]), tens
or([21]), tensor([16]), tensor([43]), tensor([129]), tensor([119]), tensor([167]), tensor([0]), tensor([80]), tensor([120])
, tensor([38]), tensor([1]), tensor([189]), tensor([218]), tensor([106]), tensor([99]), tensor([27]), tensor([162]), tensor
([37]), tensor([3]), tensor([8]), tensor([134]), tensor([31]), tensor([14]), tensor([152]), tensor([111]), tensor([25]), te
nsor([85]), tensor([69]), tensor([24]), tensor([39]), tensor([174]), tensor([108]), tensor([215]), tensor([126]), tensor([9
4]))

MhLiao on 30 Oct 2018

👍1

I don't exactly know where the issue might come from, but during multi you training we mask the indices so that each GPU see a different subset of the data.
Maybe there is an edge case there that I'm not taking into account. I'll need to investigate further

fmassa on 30 Oct 2018

Oh, there might be indeed a problem with the GroupedBatchSampler.
As a quick workaround, I'd recommend setting the ASPECT_RATIO_GROUPING to False in the config.
I'll need to dig a bit further to identify in which contexts the issue you are facing arises.

i have same issue

gxd1994 on 24 Jan 2019

If you manage to isolate the problem with a minimal example, it would be very helpful as for now I don't know where to start looking

fmassa on 24 Jan 2019

Is there any update to this?

Edit: When i run with multi-GPU and leave aspect_grouping on, it showed the error as follow:
/data/samplers/grouped_batch_sampler.py", line 79, in _prepare_batches first_element_of_batch = [t[0].item() for t in merged] File "/unsullied/sharefs/_csg_algorithm/Interns/liaominghui/data/masktextspotter/maskrcnn-benchmark/maskrcnn_benchmark /data/samplers/grouped_batch_sampler.py", line 79, in <listcomp> first_element_of_batch = [t[0].item() for t in merged] IndexError: index 0 is out of bounds for dimension 0 with size 0

I am running two experiments (first with single GPU, second with single GPU and aspect_grouping off) and so far (17000 iterations) no error is encountered.

mikelam14 on 4 Apr 2019

Oh, there might be indeed a problem with the GroupedBatchSampler.
As a quick workaround, I'd recommend setting the ASPECT_RATIO_GROUPING to False in the config.
I'll need to dig a bit further to identify in which contexts the issue you are facing arises.

in addition this parameter ‘ASPECT_RATIO_GROUPING’ is in the file .maskrcnn-benchmark\maskrcnn_benchmark\config\defaults.py

zsc1220 on 26 Aug 2019

I have met the same issue when training my custom dataset with 2 GPUs. On 1 GPU, the value merged is normal, but on 2nd gpu, there is an empty tensor in merged:

rank: 0, type(merged): <class 'tuple'>, len(merged): 125
rank: 1, type(merged): <class 'tuple'>, len(merged): 126
(tensor([228, 496]), tensor([355, 383]), tensor([465, 116]), tensor([169, 150]), tensor([324, 212]), tensor([394,   2]), tensor([238,  84]), tensor([471, 245]), tensor([411, 125]), tensor([231, 128]), tensor([316, 184]), tensor([88, 79]), tensor([267, 482]), tensor([ 90, 336]), tensor([414, 137]), tensor([29, 33]), tensor([40, 97]), tensor([ 77, 200]), tensor([58, 96]), tensor([190, 287]), tensor([165, 356]), tensor([ 98, 172]), tensor([282, 454]), tensor([ 39, 342]), tensor([149, 152]), tensor([492,  56]), tensor([175, 138]), tensor([345, 257]), tensor([358, 403]), tensor([189, 106]), tensor([ 75, 307]), tensor([ 80, 359]), tensor([338, 205]), tensor([181, 129]), tensor([301, 221]), tensor([ 13, 232]), tensor([182, 313]), tensor([ 83, 340]), tensor([ 15, 333]), tensor([350, 397]), tensor([490, 208]), tensor([  1, 216]), tensor([230, 368]), tensor([357,  62]), tensor([199, 151]), tensor([335, 332]), tensor([ 67, 135]), tensor([239, 253]), tensor([102, 132]), tensor([277, 323]), tensor([ 99, 275]), tensor([163, 286]), tensor([265, 447]), tensor([276, 448]), tensor([153, 249]), tensor([ 52, 193]), tensor([ 66, 421]), tensor([73, 18]), tensor([270, 177]), tensor([ 54, 269]), tensor([429, 296]), tensor([360, 422]), tensor([327, 481]), tensor([449, 386]), tensor([486, 180]), tensor([406, 312]), tensor([134, 387]), tensor([211, 480]), tensor([46, 68]), tensor([ 35, 235]), tensor([72, 53]), tensor([ 71, 101]), tensor([244, 161]), tensor([ 48, 466]), tensor([ 23, 168]), tensor([154, 197]), tensor([464, 436]), tensor([120, 372]), tensor([63, 85]), tensor([ 31, 272]), tensor([279, 110]), tensor([179, 317]), tensor([370, 404]), tensor([380, 401]), tensor([ 87, 437]), tensor([413, 477]), tensor([155, 311]), tensor([  8, 443]), tensor([469, 218]), tensor([405, 415]), tensor([241, 251]), tensor([ 17, 305]), tensor([183, 364]), tensor([104, 304]), tensor([331, 322]), tensor([113, 111]), tensor([ 60, 130]), tensor([297, 157]), tensor([474, 487]), tensor([407, 426]), tensor([227,   9]), tensor([363, 434]), tensor([460, 424]), tensor([431, 337]), tensor([281, 159]), tensor([32,  7]), tensor([475, 488]), tensor([ 55, 295]), tensor([220, 293]), tensor([146, 451]), tensor([385,   6]), tensor([224,  34]), tensor([348, 167]), tensor([395, 427]), tensor([366, 278]), tensor([141, 484]), tensor([369, 213]), tensor([410, 377]), tensor([463,  19]), tensor([351, 346]), tensor([362,  24]), tensor([103,  81]), tensor([352, 491]), tensor([145, 318]), tensor([59]))
(tensor([349, 396]), tensor([25, 82]), tensor([391, 248]), tensor([115, 450]), tensor([440, 124]), tensor([156, 389]), tensor([334, 166]), tensor([259, 271]), tensor([107, 176]), tensor([126,  49]), tensor([ 89, 143]), tensor([420, 388]), tensor([258, 384]), tensor([ 61, 341]), tensor([185, 247]), tensor([419, 290]), tensor([428, 162]), tensor([198, 382]), tensor([472, 347]), tensor([ 94, 192]), tensor([326, 237]), tensor([289, 148]), tensor([444, 459]), tensor([303, 409]), tensor([343, 374]), tensor([456, 204]), tensor([ 37, 376]), tensor([393,   0]), tensor([91, 65]), tensor([164, 186]), tensor([261, 329]), tensor([441, 268]), tensor([ 78, 108]), tensor([252, 430]), tensor([320, 105]), tensor([274, 207]), tensor([206, 226]), tensor([461, 173]), tensor([ 30, 242]), tensor([ 76, 122]), tensor([256, 412]), tensor([273, 294]), tensor([209, 196]), tensor([321, 123]), tensor([ 64, 119]), tensor([ 44, 371]), tensor([489, 435]), tensor([285, 147]), tensor([392,  27]), tensor([ 14, 300]), tensor([375, 240]), tensor([280,  36]), tensor([92, 12]), tensor([353, 446]), tensor([402,  22]), tensor([478, 442]), tensor([158, 479]), tensor([263, 339]), tensor([308, 390]), tensor([325, 373]), tensor([314,  41]), tensor([188, 117]), tensor([109, 400]), tensor([142, 178]), tensor([418, 191]), tensor([458, 476]), tensor([445, 423]), tensor([365, 260]), tensor([470, 136]), tensor([399, 233]), tensor([ 69, 398]), tensor([319, 222]), tensor([194, 379]), tensor([250, 495]), tensor([133, 202]), tensor([225, 298]), tensor([195, 234]), tensor([170, 330]), tensor([416, 433]), tensor([361,  10]), tensor([284, 378]), tensor([  3, 219]), tensor([467,  26]), tensor([ 93, 439]), tensor([100, 174]), tensor([462, 288]), tensor([243, 160]), tensor([ 86, 140]), tensor([ 74, 215]), tensor([283, 408]), tensor([171,  16]), tensor([302, 187]), tensor([309,  43]), tensor([112,  21]), tensor([344, 494]), tensor([485, 310]), tensor([291, 457]), tensor([381, 328]), tensor([432, 425]), tensor([417, 264]), tensor([266,  51]), tensor([114,  11]), tensor([  5, 255]), tensor([473,  70]), tensor([236,  50]), tensor([121,  95]), tensor([367, 455]), tensor([229, 306]), tensor([299,  42]), tensor([292, 139]), tensor([223,  45]), tensor([214, 493]), tensor([354, 203]), tensor([246, 210]), tensor([468, 217]), tensor([452, 118]), tensor([127, 144]), tensor([47, 20]), tensor([  4, 453]), tensor([38, 28]), tensor([ 57, 315]), tensor([438, 254]), tensor([201, 262]), tensor([483, 131]), tensor([228]), tensor([], dtype=torch.int64))

On a similar dataset with multiple GPU training, I haven't this issue. It is weird.

Setting ASPECT_RATIO_GROUPING to false in config.yml seems to fix this issue.

jdhao on 28 Dec 2019

Oh, there might be indeed a problem with the GroupedBatchSampler.
As a quick workaround, I'd recommend setting the ASPECT_RATIO_GROUPING to False in the config.
I'll need to dig a bit further to identify in which contexts the issue you are facing arises.

牛逼，解决了