Vision: Problems training Faster-RCNN from pretrained backbone

Created on 13 Jul 2019 · 44Comments · Source: pytorch/vision

Is there any recommendation to train Faster-RCNN starting from the pretrained backbone? I'm using VOC 2007 dataset and I'm able to do transfer learning starting from:

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = torchvision.models.detection.faster_rcnn.FastRCNNPredictor(in_features, num_classes=21)

Using COCO pretrained 'fasterrcnn_resnet50_fpn' i'm able to obtain an mAP of 79% on VOC 2007 test set. Problems arise when i try to train from scratch using only the pretrained backbone:

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=False)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = torchvision.models.detection.faster_rcnn.FastRCNNPredictor(in_features, num_classes=21)

I have been trying to train this model for weeks but the highest mAP i got was 63% (again on test set).

Now, i know that training from scratch is harder, but i really would like to know how to set the training parameters to obtain a decent accuracy, in the future i may want to change the backbone and chances are that i will be not able to find a pretrained faster-rcnn on which i can do transfer learning.

reference scripts question object detection

Source

lpuglia

👍1

Most helpful comment

@hktxt FYI i can get easily 72% mAP using the example provided in FasterRCNN source code using mobilenet_v2 as backbone:

    backbone = torchvision.models.mobilenet_v2(pretrained=True).features
    backbone.out_channels = 1280
    anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                       aspect_ratios=((0.5, 1.0, 2.0),))
    roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],
                                                    output_size=7,
                                                    sampling_ratio=2)
    model = torchvision.models.detection.faster_rcnn.FasterRCNN(backbone,
                       num_classes=21,
                       rpn_anchor_generator=anchor_generator,
                       box_roi_pool=roi_pooler)

no need to modify the BoxHead.

lpuglia on 2 Aug 2019

👍3 🚀2

All 44 comments

I haven't tried training on Pascal from scratch.

But here is a tip: in Detectron, we generally train for a fixed number of iterations (in this case, 90000).
This corresponds to roughly 13 epochs for COCO, but for Pascal it represents many more epochs given that the training set size is smaller.
You might want to take that into account when setting the number of iterations / lr scheduler steps (i.e., train for 15x more epochs or so)

Given that this doesn't seem to be an issue with the current implementation, I'm closing the issue, but feel free to comment on this if you have further questions

fmassa on 15 Jul 2019

I tried to train over 200 epochs, the loss keep decreasing (down to 0.01) but the mAP over Test and validation decreases over the time from 63% at the 20th epoch to 56% at the 200th. It just overfits the training set. I'm out of idea, it is like it is missing something really important (e.g augmentation).

lpuglia on 16 Jul 2019

@lpuglia have you tried using maskrcnn-benchmark, and if yes, how much accuracy did you get?
A word of caution, the evaluation code for Pascal 2007 in maskrcnn-benchmark is not necessarily 100% accurate, and might give higher numbers than the original VOCdevkit

fmassa on 16 Jul 2019

@fmassa I have been using the python eval from a clone of Girshick repository:
https://github.com/jwyang/faster-rcnn.pytorch/blob/master/lib/datasets/voc_eval.py
which apparently is the same as:
https://github.com/facebookresearch/Detectron/blob/master/detectron/datasets/voc_eval.py

I know that it is not zero-diff accurate compared to original matlab implementation but 63% is much lower than 79%, also, it is quite improbable that I'm doing something wrong there since it works great when i do transfer learning and not so much when I train from scratch.

The problem extends also to other backbones, if I train from scratch with VGG16 as backbone, i never get better than 59%. This is weird since i can easily get 69% in 10 epochs using this code base.

As per my understanding the main difference between pytorch and this code is that they have a different starting point for the backbone, in the README you can read:

NOTE. We compare the pretrained models from Pytorch and Caffe, and surprisingly find Caffe pretrained models have slightly better performance than Pytorch pretrained. We would suggest to use Caffe pretrained models from the above link to reproduce our results.
If you want to use pytorch pre-trained models, please remember to transpose images from BGR to RGB, and also use the same data transformer (minus mean and normalize) as used in pretrained model.

Which may be an explanation for the low accuracy in pytorch, now I'm wondering what is the exact procedure used to obtain the weights of pytorch backbones, are they trained on ImageNet? what augmentation and normalization they use?

lpuglia on 16 Jul 2019

@lpuglia some more questions:

Did you train your model (using the other codebase) using FPN or not?
There are some hyperparameters that have been simplified in this implementation because they didn't have any noticeable effect on COCO. Maybe it is more sensitive for Pascal dataset, which is smaller.
maskrcnn-benchmark has exactly the same implementation as Detectron, including the backbone from Caffe2 (apart from the evaluation code for pascal). It matches very closely Detectron on several experiments, while the implementation in torchvision was simplified (while basically matching maskrcnn-benchmark on COCO). If the results using maskrcnn-benchmark for pascal are better than with using the implementation in torchvision, it would be great to let me know, so that I can understand what factor is the main one for the difference.

fmassa on 17 Jul 2019

@fmassa

Using the other code base I'm able to obtain 69% mAP using VGG16 without FPN, with pytorch i use the following code to obtain a Faster RCNN with VGG16 (no FPN) as a backbone:

backbone = torchvision.models.vgg16(pretrained=True).features
backbone.out_channels = 512
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                           aspect_ratios=((0.5, 1.0, 2.0),))
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0], output_size=7,
                                                        sampling_ratio=2)
model = torchvision.models.detection.faster_rcnn.FasterRCNN(backbone, 
                                num_classes, rpn_anchor_generator=anchor_generator,
                                box_roi_pool=roi_pooler)

This model trained from scratch using PASCAL 07 never reaches 60% accuracy.

Can you tell me which one in particular?
I'm gonna train maskrcnn-benchmark and see what accuracy I get

lpuglia on 18 Jul 2019

Oh, if you are not using FPN and are training on Pascal, then I might know (one of) the issues.

In the RPN, we used to discard anchors that go out of the boundaries of the image.
This was apparently important for Pascal dataset, when the model doesn't have FPN. But for COCO with models having FPN, this was totally unnecessary and in the sake of simplifying things, I just removed it.

Can you try adding the box_is_inside_image function from https://github.com/fmassa/maskrcnn-benchmark/commit/071f1e793e98e4abc69de933cac910e95bec8196#diff-b99b4d8eb481e525b2f9900f40def679L135

and add the following lines

# discard anchors that go out of the boundaries of the image
inds_inside = box_is_inside_image(anchors_per_image, image_size)
labels_per_image[~inds_inside] = -1

just before https://github.com/pytorch/vision/blob/bbd363ca2713fb68e1e190206578e600a87baf90/torchvision/models/detection/rpn.py#L289-L291

and let me know the results? I might want to add back those lines in the current version of the code

fmassa on 18 Jul 2019

@fmassa how do i get the image_size there? I was checking the history of the file but that function doesn't seem to be ever used (except in the boxlist_is_inside_image version).

lpuglia on 18 Jul 2019

@lpuglia the image sizes can be obtained from image, as it carries this information https://github.com/pytorch/vision/blob/bbd363ca2713fb68e1e190206578e600a87baf90/torchvision/models/detection/rpn.py#L411

image.image_sizes

fmassa on 18 Jul 2019

@fmassa Just to be clear I passed that argument to the function assign_targets_to_anchors and then i used it in the loop like:

for anchors_per_image, targets_per_image, image_size in zip(anchors, targets, image_sizes):

But the accuracy doesn't increase, no better than 60%.

lpuglia on 18 Jul 2019

@lpuglia so you added the following lines in the code and it didn't help?

# discard anchors that go out of the boundaries of the image
inds_inside = box_is_inside_image(anchors_per_image, image_size)
labels_per_image[~inds_inside] = -1

fmassa on 18 Jul 2019

@fmassa indeed, i can't see any difference, since I'm already writing I can add that I tried with other backbones (like mobilenet_v2) with no success.

lpuglia on 19 Jul 2019

@lpuglia can you share your code on github? I won't have time to have a closer look before August, but having the code you are using might help me identify the issue

fmassa on 19 Jul 2019

@fmassa Thank you, I will, do you think that with maskrcnn-benchmark i will get the correct mAP?

lpuglia on 19 Jul 2019

@lpuglia maskrcnn-benchmark follows exactly all the implementation details of Detectron, so it should reproduce whatever they have in Detectron.

fmassa on 19 Jul 2019

Hey~ @lpuglia @fmassa
I encountered the same problem. What I want to do is using this implementation to reproduce the result reported in the paper of faster rcnn. using vgg16 backbone and trained on VOC2007.
I tried some times and got result no better than 60% neither.
Then I found out this implementation is different from original implementation.
So I thought maybe that's the problem. Next I changed the code a little bit:

for vgg16 backbone:

vgg16 = torchvision.models.vgg16(pretrained=False)
state_dict = torch.load('vgg16_caffe.pth')
vgg16.load_state_dict({k: v for k, v in state_dict.items() if k in vgg16.state_dict()})
backbone = vgg16.features[:-1]  # not using last maxpooling layer
# freeze top4 conv
for layer in backbone[:10]:
    for p in layer.parameters():
        p.requires_grad = False

for box head:

class BoxHead(nn.Module):
    """
    box head for vgg16 of faster rcnn backbone. 
    weights are loaded from vgg16_caffe.pth
    change TwoMLHead to BoxHead.
    """

    def __init__(self, vgg16):
        super(BoxHead, self).__init__()

        classifier = vgg16.classifier
        classifier = list(classifier)
        del classifier[6]
        del classifier[5]
        del classifier[2]
        self.classifier = nn.Sequential(*classifier)

    def forward(self, x):
        x = x.flatten(start_dim=1)
        x = self.classifier(x)
        return x

for predictor:
box_predictor = FastRCNNPredictor(4096, 21) # 1024 -> 4096
and add init param to __init__() of FastRCNNPredictor:

# modified! initial params
# https://github.com/chenyuntc/simple-faster-rcnn pytorch/blob/master///model/faster_rcnn_vgg16.py#L109
nn.init.normal_(self.cls_score.weight, std=0.01)
nn.init.constant_(self.cls_score.bias, 0)
nn.init.normal_(self.bbox_pred.weight, std=0.001)
nn.init.constant_(self.bbox_pred.bias, 0)

for anchor_generator and roi_pooler:

anchor_generator = AnchorGenerator(sizes=((128, 256, 512),),
                                           aspect_ratios=((0.5, 1.0, 2.0),))
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],
                                                    output_size=7,
                                                    sampling_ratio=-1)

I made these changes and thought this will give me a competitive result before training...
truth is I'm too naive.
The best result I got is 61.1% lower than reported 69.9%.
My code is here: https://github.com/hktxt/Faster-RCNN
Most of them are copied from https://github.com/pytorch/vision/blob/master/torchvision/models/detection.
set dataset correctly, just run trainX.py

hktxt on 19 Jul 2019

Oh, wait a bit.

I've just realized something.

If you are not using FPN, the resnet and the heads that should be completely different (the heads are much bigger than what the FPN-based model do).

There is most probably an issue with your backbone / head model definitions that are the root cause of the problem.

See for example https://github.com/facebookresearch/maskrcnn-benchmark/blob/24c8c90efdb7cc51381af5ce0205b23567c3cd21/maskrcnn_benchmark/modeling/roi_heads/box_head/roi_box_feature_extractors.py#L28-L37

I have not added the C4 backbone because it is generally much larger and slower than the FPN-based version, while working worse.

If you want to reproduce results using the C4 backbone, for now it might just be simpler to use the implementation in maskrcnn-benchmark.

fmassa on 19 Jul 2019

@fmassa I'm not sure what you mean, but this does not explain why transfer learning works much better than training from scratch even if i live it to train for 200 epoch (which was the original question of the topic)

lpuglia on 19 Jul 2019

@lpuglia yes, it explains it.

In fact, the head for the standard Faster R-CNN is the whole layer4 from resnet, or all the classifier from VGG16, which are pre-trained for classification already.

While the head from FPN-based models are initialized from scratch, and have only two MLPs

fmassa on 19 Jul 2019

@fmassa maybe you are missing a thing:

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=False)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = torchvision.models.detection.faster_rcnn.FastRCNNPredictor(in_features, num_classes=21)

this model HAS FPN, the only thing missing is the weight initialization given by pretrained=True. How is it possible that training from scratch on PASCAL gives worst result compared to transfer learning?

We are mixing 2 topics in this issue:

_When the backbone stay the same:_ training from scratch and transfer learning give very different accuracy (63%<79%).
_When the backbone change:_ box_head doesn't support non-FPN backbone (vgg16 accuracy lover than 60%).

My initial question was about training from scratch vs transfer learning.

lpuglia on 19 Jul 2019

@lpuglia ok, got it.

Training from scratch giving worse performances is expected.
Indeed, the pre-trained models were trained on COCO, which have many more images than Pascal VOC, and the classes in Pascal are a subset of the classes in COCO.

BTW, most of the top performing methods in Pascal first pre-train in COCO and then fine-tune on Pascal.

fmassa on 23 Jul 2019

@fmassa I found out what my main problem was, I was using the val set for validation only. However, to get good result on PASCAL VOC 2007 you are supposed to use trainval all together. Also, thanks to @hktxt comment I got 66% accuracy training from scratch (just 3% less than the expected). If anyone is intereseted here the highlights:

Backbone

        vgg = torchvision.models.vgg16(pretrained=True)
        backbone = vgg.features[:-1]
        for layer in backbone[:10]:
            for p in layer.parameters():
                p.requires_grad = False
        backbone.out_channels = 512
        anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                           aspect_ratios=((0.5, 1.0, 2.0),))
        roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],
                                                        output_size=7,
                                                        sampling_ratio=2)

        class BoxHead(nn.Module):
            def __init__(self, vgg):
                super(BoxHead, self).__init__()
                self.classifier = nn.Sequential(*list(vgg.classifier._modules.values())[:-1])

            def forward(self, x):
                x = x.flatten(start_dim=1)
                x = self.classifier(x)
                return x
        box_head = BoxHead(vgg)

Model

        model = torchvision.models.detection.faster_rcnn.FasterRCNN(
            backbone, #num_classes,
            rpn_anchor_generator = anchor_generator,
            box_roi_pool = roi_pooler,
            box_head = box_head,
            box_predictor = torchvision.models.detection.faster_rcnn.FastRCNNPredictor(4096, num_classes=21))

Dataset

dataset = VOCDetection(img_folder=root, year='2007', image_set='trainval', transforms=transforms)

The only aumentation i used was RandomHorizontalFlip.

Parameters

--epochs 40
--lr-steps 30
--momentum 0.9
--lr-gamma 0.1

lpuglia on 26 Jul 2019

🎉4 🚀1

@lpuglia awesome, thanks for letting me know!

Also, did you add the visibility checks for the anchors in the RPN, as I mentioned in https://github.com/pytorch/vision/issues/1116#issuecomment-512731349 ?

This could maybe give the few remaining points left

fmassa on 26 Jul 2019

@fmassa It was enabled the whole time, I don't know how much did it influenced the training, I'm gonna repeat the test commenting it out and let you know (my guess is that it doesn't change much)

lpuglia on 26 Jul 2019

👍1

oh, that's a good news.
I'll try your training strategy. seems like most parts are the same, except the AnchorGenerator and dropout in vgg backbone.

hktxt on 26 Jul 2019

@hktxt what dropout are you referring to in particular? (by the way i'm using a batch size of 4)

lpuglia on 26 Jul 2019

@fmassa removing the visibility check decreases the accuracy from 66 to 64%

lpuglia on 26 Jul 2019

👍1

@lpuglia thanks for the info! Very helpful! I might include the visibility check again, as this gives a few more points on Pascal without FPN

fmassa on 26 Jul 2019

The BoxHead, if you print the net you'll see the dropout layer, however, it was droped in mine.
bs=4? I use bs=1 for trianing, got ~60%... no improvements....

hktxt on 29 Jul 2019

@hktxt did you tried to add the visibility checks? how many images do you feed during the training?

@fmassa I'm still working on it and i can see that using caffe pretrained model gives another 1% on the final accuracy. I will try to close the gap more, did you removed anything else beside the visibility check?

lpuglia on 29 Jul 2019

@lpuglia there are some other minor changes. Here is what I remember now.

RPN loss

This can affect the results negatively for AP@50, but improves results for higher thresholds.
We are in here using l1_loss in the RPN https://github.com/pytorch/vision/blob/2287c8f2dc9dcad955318cc022cabe4d53051f65/torchvision/models/detection/rpn.py#L368-L372
while in Detectron we use smooth_l1_loss with beta parameter of 1 / 9

Different init

I removed some custom inits in the heads in https://github.com/fmassa/maskrcnn-benchmark/commit/3e0e12a652331eeff87777e9a3be81a939817141
This didn't change performance at all in my experiments on COCO, but maybe this could change something for Pascal? Not sure

fmassa on 29 Jul 2019

I use trainval set which contains 5011 images for training, and test set contains 4952 images.
I also used RandomHorizontalFlip(0.5) for data augmentation.

In my expriment, trian from scratch is impracticable. However, @lpuglia made it, got 66%.
Use caffe pretrianed model I got 60% map, @lpuglia got one more gain. even use your params, got no improvements.
It's weird.

hktxt on 30 Jul 2019

@fmassa I tried them both, the first actually decrease the accuracy for some reasons, the second makes no difference. I will train from scratch using COCO and then use transfer learning to see if i can get 70% on Pascal, thanks for the help!

@hktxt my advice is to make sure to have the visibility checks enabled and use the following class for conversion:

class ConvertVOCtoCOCO(object):
    CLASSES = (
        "__background__", "aeroplane", "bicycle",
        "bird", "boat", "bottle", "bus", "car",
        "cat", "chair", "cow", "diningtable", "dog",
        "horse", "motorbike", "person", "pottedplant",
        "sheep", "sofa", "train", "tvmonitor",
    )
    def __call__(self, image, target):
        # return image, target
        anno = target['annotations']
        filename = anno["filename"].split('.')[0]
        h, w = anno['size']['height'], anno['size']['width']
        boxes = []
        classes = []
        objects = anno['object']
        if not isinstance(objects, list):
            objects = [objects]
        for obj in objects:
            bbox = obj['bndbox']
            bbox = [int(bbox[n]) - 1 for n in ['xmin', 'ymin', 'xmax', 'ymax']]
            boxes.append(bbox)
            classes.append(self.CLASSES.index(obj['name']))

        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        classes = torch.as_tensor(classes)

        image_id = anno['filename'][:-4]
        image_id = torch.as_tensor([int(image_id)])

        target = {}
        target["boxes"] = boxes
        target["labels"] = classes
        target['name'] = image_id #convert filename in int8

        return image, target

Also (I don't know if this is useful yet) but make sure to have a 10022 image dataset flipping all the images. This is different from random flipping because you make sure that every image is shown to the network twice in different orientation per epoch. If you use this strategy you will need just 15 epoch to train the netwrork. Here is my code:

class VOCDetection_flip(torchvision.datasets.VOCDetection):
    def __init__(self, img_folder, year, image_set, transforms):
        super(VOCDetection_flip, self).__init__(img_folder,  year, image_set)
        self._transforms = transforms

    def __getitem__(self, idx):
        real_idx = idx//2
        img, target = super(VOCDetection_flip, self).__getitem__(real_idx)
        target = dict(image_id=real_idx, annotations=target['annotation'])
        if self._transforms is not None:
            img, target = self._transforms(img, target)
            # img = img[[2, 1, 0],:]

            if (idx % 2) == 0:
                height, width = img.shape[-2:]
                img = img.flip(-1)
                bbox = target["boxes"]
                bbox[:, [0, 2]] = width - bbox[:, [2, 0]]
                target["boxes"] = bbox

        return img, target

    def __len__(self):
        return 2*len(self.images)

lpuglia on 1 Aug 2019

👍1

@hktxt FYI i can get easily 72% mAP using the example provided in FasterRCNN source code using mobilenet_v2 as backbone:

    backbone = torchvision.models.mobilenet_v2(pretrained=True).features
    backbone.out_channels = 1280
    anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                       aspect_ratios=((0.5, 1.0, 2.0),))
    roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],
                                                    output_size=7,
                                                    sampling_ratio=2)
    model = torchvision.models.detection.faster_rcnn.FasterRCNN(backbone,
                       num_classes=21,
                       rpn_anchor_generator=anchor_generator,
                       box_roi_pool=roi_pooler)

no need to modify the BoxHead.

lpuglia on 2 Aug 2019

👍3 🚀2

Hey guys @fmassa @lpuglia thanks for the great discussion. I followed and modified some of your code to train for Object detectoin in pascal VOC 2007 . I also followed and borrowed code from the pytorch tutorial for 'transfer learning for object detection for Penn Fudan dataset' especially to evaluate the model. However I didn't get any good mAP results with VGG as backbone.. infact it was showing 0 mAP. I will attach the code below. Could you comment if anything is missing there thanks.

import os
import numpy as np
import torch
from PIL import Image

import torch.nn as nn
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor, AnchorGenerator

from engine import train_one_epoch, evaluate
import utils
import transforms as T

from torchvision.datasets import VOCDetection
from tqdm import tqdm
from  torch.utils.tensorboard import SummaryWriter

#%%

class PrepareInstance(object):
    CLASSES = (
        "__background__ ",
        "aeroplane",
        "bicycle",
        "bird",
        "boat",
        "bottle",
        "bus",
        "car",
        "cat",
        "chair",
        "cow",
        "diningtable",
        "dog",
        "horse",
        "motorbike",
        "person",
        "pottedplant",
        "sheep",
        "sofa",
        "train",
        "tvmonitor",
    )
    def __call__(self, image, target):
        anno = target['annotation']
        h, w = anno['size']['height'], anno['size']['width']
        boxes = []
        classes = []
        area = []
        iscrowd = []
        objects = anno['object']
        if not isinstance(objects, list):
            objects = [objects]
        for obj in objects:
            bbox = obj['bndbox']
            bbox = [int(bbox[n]) - 1 for n in ['xmin', 'ymin', 'xmax', 'ymax']]
            boxes.append(bbox)
            classes.append(self.CLASSES.index(obj['name']))
            iscrowd.append(int(obj['difficult']))
            area.append((bbox[2] - bbox[0]) * (bbox[3] - bbox[1]))

        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        classes = torch.as_tensor(classes)
        area = torch.as_tensor(area)
        iscrowd = torch.as_tensor(iscrowd)

        image_id = anno['filename'][0:6]
        image_id = torch.as_tensor([int(image_id)])

        target = {}
        target["boxes"] = boxes
        target["labels"] = classes
        target["image_id"] = image_id

        # for conversion to coco api
        target["area"] = area
        target["iscrowd"] = iscrowd

        return image, target


class VOCDetection_flip(VOCDetection):
    def __init__(self, img_folder, year, image_set, transforms):
        super().__init__(img_folder,  year, image_set)
        self._transforms = transforms

    def __getitem__(self, idx):
        real_idx = idx//2
        img, target = super(VOCDetection_flip, self).__getitem__(real_idx)
        target = dict(image_id=real_idx, annotations=target['annotation'])
        if self._transforms is not None:
            img, target = self._transforms(img, target)
            # img = img[[2, 1, 0],:]

            if (idx % 2) == 0:
                height, width = img.shape[-2:]
                img = img.flip(-1)
                bbox = target["boxes"]
                bbox[:, [0, 2]] = width - bbox[:, [2, 0]]
                target["boxes"] = bbox

        return img, target

    def __len__(self):
        return 2*len(self.images)



def get_voc(root, image_set, transforms):
     t = [PrepareInstance()]

     if transforms is not None:
         t.append(transforms)
     transforms = T.Compose(t)

     dataset = VOCDetection(root, '2007', image_set, transforms=transforms, download=False)

     return dataset

def get_transform(istrain=False):
     transforms = []
     transforms.append(T.ToTensor())
     if istrain:
         transforms.append(T.RandomHorizontalFlip(0.5))
     return T.Compose(transforms)

class BoxHead(nn.Module):
    def __init__(self, vgg):
        super(BoxHead, self).__init__()
        self.classifier = nn.Sequential(*list(vgg.classifier._modules.values())[:-1])
        self.in_features = 4096 # feature out from mlp

    def forward(self, x):
        x = x.flatten(start_dim=1)
        x = self.classifier(x)
        return x

#def get_model_FRCNN(num_classes):
#    
#        # modified from this issue page https://github.com/pytorch/vision/issues/1116
#        vgg = torchvision.models.vgg16(pretrained=True)
#        backbone = vgg.features[:-1]
#        for layer in backbone[:10]:
#            for p in layer.parameters():
#                p.requires_grad = False
#        backbone.out_channels = 512
#        anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
#                                           aspect_ratios=((0.5, 1.0, 2.0),))
#        roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],
#                                                        output_size=7,
#                                                        sampling_ratio=2)
#
#        box_head = BoxHead(vgg)
#        in_features = box_head.in_features
#        
#        model = torchvision.models.detection.faster_rcnn.FasterRCNN(
#                        backbone, #num_classes,
#                        rpn_anchor_generator = anchor_generator,
#                        box_roi_pool = roi_pooler,
#                        box_head = box_head,
#                        box_predictor = FastRCNNPredictor(in_features, num_classes))
#        
#        return model


def get_model_FRCNN(num_classes):

    backbone = torchvision.models.mobilenet_v2(pretrained=True).features
    backbone.out_channels = 1280
    anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                       aspect_ratios=((0.5, 1.0, 2.0),))
    roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],
                                                    output_size=7,
                                                    sampling_ratio=2)
    model = torchvision.models.detection.faster_rcnn.FasterRCNN(backbone,
                       num_classes,
                       rpn_anchor_generator=anchor_generator,
                       box_roi_pool=roi_pooler)

    return  model


#%%

if __name__ == "__main__":
    # train on the GPU or on the CPU, if a GPU is not available
    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

    num_classes = 21 # 20 classes + background for VOC

    dataset = get_voc('.', 'trainval', transforms = get_transform(istrain=False))

    dataset_test = get_voc('.', 'test', transforms = get_transform(istrain=False))

    # define training and validation data loaders
    data_loader = torch.utils.data.DataLoader(
        dataset, batch_size=4, shuffle=True, num_workers=4,
        collate_fn=utils.collate_fn)


    data_loader_test = torch.utils.data.DataLoader(
        dataset_test, batch_size=8, shuffle=False, num_workers=4,
        collate_fn=utils.collate_fn)

    print('data prepared, train data: {}'.format(len(dataset)))
    print('data prepared, test data: {}'.format(len(dataset_test)))

#%%


    # get the model using our helper function
    model = get_model_FRCNN(num_classes)

    # move model to the right device
    model.to(device)

    # construct an optimizer
    params = [p for p in model.parameters() if p.requires_grad]
    optimizer = torch.optim.SGD(params, lr=0.005,
                                momentum=0.9, weight_decay=0.0005)
    # and a learning rate scheduler
    lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
                                                   step_size=30,
                                                   gamma=0.1)

    # let's train it for 10 epochs
    num_epochs = 40

    # setup log data writer
    if not os.path.exists('log'):
        os.makedirs('log')
    writer = SummaryWriter(log_dir='log')


#%%
    iters_per_epoch = int( len(data_loader) / data_loader.batch_size)
    for epoch in range(num_epochs):
        loss_epoch = {}
        loss_name = ['loss_classifier', 'loss_box_reg', 'loss_objectness', 'loss_rpn_box_reg']
        for ii, (images, targets) in tqdm(enumerate(data_loader),total=len(data_loader)):   
            model.train()
            optimizer.zero_grad()
            images = list(image.to(device) for image in images)
            targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
            # training
            loss_dict = model(images, targets)
            losses = sum(loss for loss in loss_dict.values())
            losses.backward()
            optimizer.step()
            lr_scheduler.step()
            info = {}
            for name in loss_dict:
                info[name] = loss_dict[name].item()

            writer.add_scalars("losses", info, epoch * iters_per_epoch + ii)

        if (epoch + 1 ) % 1 == 0:    
            # evaluate on the test dataset
            evaluate(model, data_loader_test, device=device)

    writer.close()

manoja328 on 7 Aug 2019

@lpuglia I think we should add one example with Pascal VOC somewhere. If you could send an initial PR, I could look into improving it and merging it in torchvision.

fmassa on 7 Aug 2019

👍1

Great I think that would be good. Also I noticed the default data set dowloader for VOC doesn't have the test split. But the test split (images and labels) has been released by the community already. I don't know why it was not included in the downloads though.

manoja328 on 7 Aug 2019

@fmassa this is a good idea, very often beginners do not have the power to crunch all the data in coco, VOC is much easier to start with.

lpuglia on 8 Aug 2019

@fmassa here is the pull request:
https://github.com/pytorch/vision/pull/1216
it should work out of the box.

lpuglia on 8 Aug 2019

👍1

Thanks @lpuglia for the PR!

I'll have a closer look to the PR (and get it merged) once I'm back from holidays

fmassa on 12 Aug 2019

👍1

@lpuglia
Hello friends, I have spent two weeks on torchvision.fasterrcnn_resnet, unfortunately, I still have not been able to complete the training. Can you provide me some training network code for me? thank you very much! My email is [email protected].

AFutureD on 23 Aug 2019

@AFutureD Here is the pull request code:

https://github.com/lpuglia/torchvision_voc

it uses resnet as default backbone

lpuglia on 23 Aug 2019

I have a new annotated dataset and have used tensorflow for faster-rcnn transfer learning and it works well, but want to migrate to pytorch. This thread has me worried it isn't quite out of the box yet? Am I wrong, and if so is there some tutorial/treatment specifically of faster-rcnn transfer learning?

Sorry to add noise to this thread I am very new to pytorch and really want to use it for my application as I'm tired of having to explain to people what a session is, am ready to move to something pythonic. :) I have followed the basic tutorials on transfer learning/torchvision at the web site and really love it.