Maskrcnn-benchmark: What actually the BatchCollator does ?

Created on 12 Jul 2019  ยท  13Comments  ยท  Source: facebookresearch/maskrcnn-benchmark

โ“ Questions and Help

Hello fmassa.
Thank you for the awesome code.
I have a question about the dataloader when training with multi gpu.
Usually images have differenct number of groudtruth box, however in a minibatch the shape of each sample should be the same.
How does the code deal with this situation ?
If I have two images with 2 and 3 groundtruth boxes respectively, does it align with the sample with minimum number of boxes so that each sample has same shape ?
Thanks !

question

All 13 comments

If it actually does as I conjecture, whether it has negative influence on the performance ?

No, it return the boxes as a list of tensors. And it pads the images with zeros so that they have the same shape.

No, it return the boxes as a list of tensors. And it pads the images with zeros so that they have the same shape.

@fmassa Thanks for your reply. Could you please give me some hint on where this manipulation happen ? I search for quiet a lot and still cannot figure out that... From the BatchCollator, I found:

def __call__(self, batch):
        transposed_batch = list(zip(*batch))
        images = to_image_list(transposed_batch[0], self.size_divisible)
        targets = transposed_batch[1]
        img_ids = transposed_batch[2]
        return images, targets, img_ids

But I stll cannot found how the padding is accomplished.
Thanks.

Maybe I didnot convey clearly. I said the sample with same shape, the same shape is that every sample has same number of boxes, but not for the shape of image.

The padding of the images is done in to_image_list.

The boxes do not need to be padded, because we only pad on the bottom right corner

Thanks you.

No, it return the boxes as a list of tensors. And it pads the images with zeros so that they have the same shape.

@fmassa Hi. As you said it return the boxes as a list of tensors, does it means that the length of boxlist, i.e., the number of boxes per image, is not necessary to be the same ?
Thanks.

Maybe I realize what mistake I've made...
This repo uses distributed data parallel for multi-gpu training, while I used to train with multi gpus on single machine using data parallel.
Besides the benefits from distributed data parallel compared with data parallel, when one forces num of gpu per node to be 1, it can do more magic.
As I said, many other repos deal with different box num by padding and indexing, which is contrainted by using data parallel as it invokes scatter to distribute the inputs to each gpu, and the implementation of scatter is only friendly for well shaped tensor.
By using distributed data parallel, and force gpu per node to be 1, one can avoid the scatter and it thus supports variable box num.

Exactly, using Distributed Data Parallel makes things faster and easier to implement

@fmassa
Hi. Sorry for disturbing you again. I have another question about the optimizer.
After reading the official tutorial about distributed training, I found it goes with a procedure to average the graidient from all the nodes.

for epoch in range(10):
        epoch_loss = 0.0
        for data, target in train_set:
            optimizer.zero_grad()
            output = model(data)
            loss = F.nll_loss(output, target)
            epoch_loss += loss.item()
            loss.backward()
            average_gradients(model)
            optimizer.step()

The line average_gradients(model) all_reduce the gradient and then step forward.
However, from the code of maskrcnn_benchmark, I can just found the code below:


        images = images.to(device)
        targets = [target.to(device) for target in targets]

        loss_dict = model(images, targets)

        losses = sum(loss for loss in loss_dict.values())

        # reduce losses over all GPUs for logging purposes
        loss_dict_reduced = reduce_loss_dict(loss_dict)
        losses_reduced = sum(loss for loss in loss_dict_reduced.values())
        meters.update(loss=losses_reduced, **loss_dict_reduced)

        optimizer.zero_grad()
        # Note: If mixed precision is not used, this ends up doing nothing
        # Otherwise apply loss scaling for mixed-precision recipe
        with amp.scale_loss(losses, optimizer) as scaled_losses:
            scaled_losses.backward()
        optimizer.step()

It seems that the gradient average operation is missed...
Or is it implemented somewhere ?
Thanks.

This is implemented inside DistributedDataParallel, after the backward, and is done automatically

This is implemented inside DistributedDataParallel, after the backward, and is done automatically

@fmassa Thank you.
Now I can found the related code in DistributedDataParallel. I see that it sets the flag require_backward_grad_sync.
However, as it is a nn.Module, how can it determine when the backward is done and where is the code for the virtual work of gradient average ? Or this procedure is implemented in C++ backend ?
Thanks.

OK, it seem the operation is done by a hook in the c10 backend.

// Check if this was the final gradient for this bucket.
  if (--replica.pending == 0) {
    // Prescale bucket contents to turn the global sum into the global average.
    replica.contents.div_(process_group_->getSize());
    // Kick off reduction if all replicas for this bucket are ready.
    if (--bucket.pending == 0) {
      mark_bucket_ready(bucket_index.bucket_index);
    }
  }
Was this page helpful?
0 / 5 - 0 ratings