Mask_rcnn: Class imbalance for RPN

Created on 10 Nov 2017  路  5Comments  路  Source: matterport/Mask_RCNN

Hi,

For my master thesis I am working on the topic of text detection in images and video frames. I implemented a modified version of Faster R-CNN/Mask R-CNN, which is very similar to your implementation, but tailored for text detection.

On average there are ~54 positive text anchors in the images I use, which result in a mini batch of ~54 positive and ~200 negative examples per image ( I use mini batches of size 254 per image). The problem I encounter is that my network overfits on only negative prediction (because on average, there are many more negative than positive examples). Now a few simple solutions would be to 1) take a smaller mini batch size(for example 112), 2) to remove all the images in which there are not enough positive examples from the data set or 3) use a weighted loss function.

I inspected your code very carefully, but (as far as I can see) in your implementation this doesn't seem to be an issue. Was this positive/negative class imbalance also a problem in your implementation, and if so, how did you solve this problem?

Thanks!

Maurits

Most helpful comment

I know what you're referring to. The Mask RCNN paper suggests using an ROI mini batch of 512. However, when training on COCO, most images have a few objects. So the RPN generates, on average, 40 positive proposals. That leaves around 470 negative ROIs, which is far from the suggested 1:3 positive to negative ratio suggested in the paper.

Originally, I set the ROI mini batch size to 128. I sample ~40 positive ROIs and the remaining (80 or so) are filled with negative ROIs. This solves the problem, but it is dataset dependent. If you train on a different dataset with fewer or more objects per image, you need to adjust the ROI mini-batch size accordingly.

I just pushed a new update that improves on that. Now I allow the mini-batch size to be different for each image. If an image has only 10 positive ROIs, I pick only 20 negative ROIs. This keeps the 1:3 ratio consistent regardless of the density of objects in the images and removes the need to adjust the setting for each dataset.

The question that I still don't have an answer to is why the paper suggests 512 ROIs and how they handle images that have few positive ROIs. One possibility is that, maybe, they use a higher NMS threshold in the RPN during training. The default threshold is 0.7, but if you raise it to, say, 0.9 (or remove the NMS step in the RPN completely) you'll get more positive proposals.

All 5 comments

Using balanced mini batches solved the problem!

I know what you're referring to. The Mask RCNN paper suggests using an ROI mini batch of 512. However, when training on COCO, most images have a few objects. So the RPN generates, on average, 40 positive proposals. That leaves around 470 negative ROIs, which is far from the suggested 1:3 positive to negative ratio suggested in the paper.

Originally, I set the ROI mini batch size to 128. I sample ~40 positive ROIs and the remaining (80 or so) are filled with negative ROIs. This solves the problem, but it is dataset dependent. If you train on a different dataset with fewer or more objects per image, you need to adjust the ROI mini-batch size accordingly.

I just pushed a new update that improves on that. Now I allow the mini-batch size to be different for each image. If an image has only 10 positive ROIs, I pick only 20 negative ROIs. This keeps the 1:3 ratio consistent regardless of the density of objects in the images and removes the need to adjust the setting for each dataset.

The question that I still don't have an answer to is why the paper suggests 512 ROIs and how they handle images that have few positive ROIs. One possibility is that, maybe, they use a higher NMS threshold in the RPN during training. The default threshold is 0.7, but if you raise it to, say, 0.9 (or remove the NMS step in the RPN completely) you'll get more positive proposals.

Thanks for your answer.

That is an interesting approach. Does it work well? I mean, the gradients you are using when updating the weights will be based on a different amount of samples every epoch. Does this work well with a fixed learning rate. When only having a few examples in your batch, the gradients will be very noisy right?

To prevent overfitting on negative examples, I used a weighted loss function. I scale up the loss of the positive examples with the ratio = (batch size/ 2)/ (positive examples), the same for the negative examples. By using this I can keep the batch size constant, and it will be impossible for the network to learn to overfit the negative examples.

I still do not understand why they use a 1:3 positive to negative ratio in the paper. I mean, it will still be really easy to overfit on only predicting negative examples, which will result in a correct classification of 2:3, which is better than random 1:2.

Maurits

When only having a few examples in your batch, the gradients will be very noisy right?

You're right, but most images would have around 40 or so positive ROIs (and therefore, 80 negative ROIs). Even an image with one object would have a total of 20 or more ROIs.

I still do not understand why they use a 1:3 positive to negative ratio in the paper.

My guess is that they experimented with different ratios and found that this worked the best. It would be interesting to compare it to your weighted loss approach and see if you get better results.

Thanks for your answer.

That is an interesting approach. Does it work well? I mean, the gradients you are using when updating the weights will be based on a different amount of samples every epoch. Does this work well with a fixed learning rate. When only having a few examples in your batch, the gradients will be very noisy right?

To prevent overfitting on negative examples, I used a weighted loss function. I scale up the loss of the positive examples with the ratio = (batch size/ 2)/ (positive examples), the same for the negative examples. By using this I can keep the batch size constant, and it will be impossible for the network to learn to overfit the negative examples.

I still do not understand why they use a 1:3 positive to negative ratio in the paper. I mean, it will still be really easy to overfit on only predicting negative examples, which will result in a correct classification of 2:3, which is better than random 1:2.

Maurits

Hello @MBleeker ,
I'm facing the same problem of an imbalanced class can you please show me how you used a weighted loss function?
I am suffering from overfitting.
thank you in advance

Was this page helpful?
0 / 5 - 0 ratings