Maskrcnn-benchmark: What actually the BatchCollator does ?

Created on 12 Jul 2019 · 13Comments · Source: facebookresearch/maskrcnn-benchmark

❓ Questions and Help

Hello fmassa.
Thank you for the awesome code.
I have a question about the dataloader when training with multi gpu.
Usually images have differenct number of groudtruth box, however in a minibatch the shape of each sample should be the same.
How does the code deal with this situation ?
If I have two images with 2 and 3 groundtruth boxes respectively, does it align with the sample with minimum number of boxes so that each sample has same shape ?
Thanks !

question

Source

chenchr

All 13 comments

If it actually does as I conjecture, whether it has negative influence on the performance ?

chenchr on 12 Jul 2019

No, it return the boxes as a list of tensors. And it pads the images with zeros so that they have the same shape.

fmassa on 12 Jul 2019

No, it return the boxes as a list of tensors. And it pads the images with zeros so that they have the same shape.

@fmassa Thanks for your reply. Could you please give me some hint on where this manipulation happen ? I search for quiet a lot and still cannot figure out that... From the BatchCollator, I found:

def __call__(self, batch):
        transposed_batch = list(zip(*batch))
        images = to_image_list(transposed_batch[0], self.size_divisible)
        targets = transposed_batch[1]
        img_ids = transposed_batch[2]
        return images, targets, img_ids

But I stll cannot found how the padding is accomplished.
Thanks.

chenchr on 12 Jul 2019

Maybe I didnot convey clearly. I said the sample with same shape, the same shape is that every sample has same number of boxes, but not for the shape of image.

chenchr on 12 Jul 2019

The padding of the images is done in to_image_list.

The boxes do not need to be padded, because we only pad on the bottom right corner

fmassa on 12 Jul 2019

Thanks you.

chenchr on 12 Jul 2019

👍1

No, it return the boxes as a list of tensors. And it pads the images with zeros so that they have the same shape.

@fmassa Hi. As you said it return the boxes as a list of tensors, does it means that the length of boxlist, i.e., the number of boxes per image, is not necessary to be the same ?
Thanks.

chenchr on 13 Jul 2019

Maybe I realize what mistake I've made...
This repo uses distributed data parallel for multi-gpu training, while I used to train with multi gpus on single machine using data parallel.
Besides the benefits from distributed data parallel compared with data parallel, when one forces num of gpu per node to be 1, it can do more magic.
As I said, many other repos deal with different box num by padding and indexing, which is contrainted by using data parallel as it invokes scatter to distribute the inputs to each gpu, and the implementation of scatter is only friendly for well shaped tensor.
By using distributed data parallel, and force gpu per node to be 1, one can avoid the scatter and it thus supports variable box num.

chenchr on 13 Jul 2019

😄1

Exactly, using Distributed Data Parallel makes things faster and easier to implement

fmassa on 13 Jul 2019

@fmassa
Hi. Sorry for disturbing you again. I have another question about the optimizer.
After reading the official tutorial about distributed training, I found it goes with a procedure to average the graidient from all the nodes.

for epoch in range(10):
        epoch_loss = 0.0
        for data, target in train_set:
            optimizer.zero_grad()
            output = model(data)
            loss = F.nll_loss(output, target)
            epoch_loss += loss.item()
            loss.backward()
            average_gradients(model)
            optimizer.step()

The line average_gradients(model) all_reduce the gradient and then step forward.
However, from the code of maskrcnn_benchmark, I can just found the code below:


        images = images.to(device)
        targets = [target.to(device) for target in targets]

        loss_dict = model(images, targets)

        losses = sum(loss for loss in loss_dict.values())

        # reduce losses over all GPUs for logging purposes
        loss_dict_reduced = reduce_loss_dict(loss_dict)
        losses_reduced = sum(loss for loss in loss_dict_reduced.values())
        meters.update(loss=losses_reduced, **loss_dict_reduced)

        optimizer.zero_grad()
        # Note: If mixed precision is not used, this ends up doing nothing
        # Otherwise apply loss scaling for mixed-precision recipe
        with amp.scale_loss(losses, optimizer) as scaled_losses:
            scaled_losses.backward()
        optimizer.step()

It seems that the gradient average operation is missed...
Or is it implemented somewhere ?
Thanks.

chenchr on 13 Jul 2019

This is implemented inside DistributedDataParallel, after the backward, and is done automatically

fmassa on 13 Jul 2019

This is implemented inside DistributedDataParallel, after the backward, and is done automatically

@fmassa Thank you.
Now I can found the related code in DistributedDataParallel. I see that it sets the flag require_backward_grad_sync.
However, as it is a nn.Module, how can it determine when the backward is done and where is the code for the virtual work of gradient average ? Or this procedure is implemented in C++ backend ?
Thanks.

chenchr on 13 Jul 2019

OK, it seem the operation is done by a hook in the c10 backend.

// Check if this was the final gradient for this bucket.
  if (--replica.pending == 0) {
    // Prescale bucket contents to turn the global sum into the global average.
    replica.contents.div_(process_group_->getSize());
    // Kick off reduction if all replicas for this bucket are ready.
    if (--bucket.pending == 0) {
      mark_bucket_ready(bucket_index.bucket_index);
    }
  }

chenchr on 13 Jul 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

problem at last setp 'python setup.py build develop'.

nanyoullm · 3Comments

Cityscapes to COCO inefficiency

botcs · 3Comments

Raise ValueError: Type mismatch (<type 'str'> vs. <type 'tuple'>) with values (coco_2017_train vs. ('coco_2017_train',)) for config key: DATASETS.TRAIN

SkeletonOne · 3Comments

Why the large batchsize cause training slow?

auroua · 3Comments

Support for Fast RCNN

adityaarun1 · 3Comments