Maskrcnn-benchmark: Evaluation results vary for same saved weight.

Created on 1 Jul 2019 · 6Comments · Source: facebookresearch/maskrcnn-benchmark

❓ Questions and Help

I have a question when I evaluated my model.
I ran the command below several times and it returned different mAp results.
python -m torch.distributed.launch --nproc_per_node=1 tools/test_net.py --config-file "stand_file/e2e_faster_rcnn_R_50_FPN_1x.yaml" TEST.IMS_PER_BATCH 16

I would like to know why this happened.

question

Source

xiaohai12

All 6 comments

This probably happens because when you batch different images together, you have different paddings and that affect slightly the output of the model (i.e., the predictions).

In your case, you are using a batch size of 16, and by default we are shuffling the images during evaluation https://github.com/facebookresearch/maskrcnn-benchmark/blob/55796a04ea770029a80cf5933cc5c3f3f6fa59cf/maskrcnn_benchmark/data/build.py#L126
so that every run will have different batches of images, and thus different paddings and different results.

Try removing the shuffling or making the batch size to be 1 (which is the most robust solution anyway)

fmassa on 1 Jul 2019

👎1 👍1

This probably happens because when you batch different images together, you have different paddings and that affect slightly the output of the model (i.e., the predictions).

In your case, you are using a batch size of 16, and by default we are shuffling the images during evaluation

https://github.com/facebookresearch/maskrcnn-benchmark/blob/55796a04ea770029a80cf5933cc5c3f3f6fa59cf/maskrcnn_benchmark/data/build.py#L126

so that every run will have different batches of images, and thus different paddings and different results.
Try removing the shuffling or making the batch size to be 1 (which is the most robust solution anyway)

I will try later and give you feedback!
thanks for your reply!

xiaohai12 on 1 Jul 2019

This probably happens because when you batch different images together, you have different paddings and that affect slightly the output of the model (i.e., the predictions).

In your case, you are using a batch size of 16, and by default we are shuffling the images during evaluation

https://github.com/facebookresearch/maskrcnn-benchmark/blob/55796a04ea770029a80cf5933cc5c3f3f6fa59cf/maskrcnn_benchmark/data/build.py#L126

so that every run will have different batches of images, and thus different paddings and different results.
Try removing the shuffling or making the batch size to be 1 (which is the most robust solution anyway)

Hello, I tried both solutions but the results are still different......

xiaohai12 on 1 Jul 2019

Hi @xiaohai12, how many GPUs did you use when running an evaluation? If you used only one GPU, then is_distributed was passed as False according to the following pieces of code which therefore would never enable shuffling whatever your TEST.IMS_PER_BATCH was.
https://github.com/facebookresearch/maskrcnn-benchmark/blob/55796a04ea770029a80cf5933cc5c3f3f6fa59cf/tools/test_net.py#L50-L51
https://github.com/facebookresearch/maskrcnn-benchmark/blob/55796a04ea770029a80cf5933cc5c3f3f6fa59cf/tools/test_net.py#L96

However, using different values of TEST.IMS_PER_BATCH does result in different mAPs as explained by @fmassa that is caused by paddings.

For your information, here are my results by running the evaluation of e2e_faster_rcnn_R_50_FPN_1x (the model weights file is downloaded from model id: 6358793 in MODEL_ZOO) on 2 GPUs with several times:

one image on each GPU (TEST.IMS_PER_BATCH = 2) whatever shuffle = True or shuffle = False
```bash
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.368
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.586
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.396
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.211
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.397
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.481
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.307
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.483
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.507
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.321
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.542
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.634

AP, AP50, AP75, APs, APm, APl
0.367747, 0.586073, 0.395546, 0.210593, 0.397355, 0.480963
```

two images on each GPU (TEST.IMS_PER_BATCH = 4) with shuffle = True (by default)

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.368
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.586
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.396
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.211
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.398
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.481
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.307
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.483
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.507
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.321
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.543
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.634

AP, AP50, AP75, APs, APm, APl
0.367857, 0.586037, 0.395842, 0.210618, 0.398223, 0.480817

two images on each GPU (TEST.IMS_PER_BATCH = 4) with shuffle = False (manually)

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.368
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.586
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.396
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.211
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.398
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.481
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.307
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.482
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.506
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.321
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.542
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.634

AP, AP50, AP75, APs, APm, APl
0.367695, 0.586173, 0.395646, 0.210543, 0.397500, 0.481209

As you can see, shuffling images or not (but keeping the same batch size) does result in different mAPs. However, I can always get the same results no matter how many times I run the evaluation with shuffle = True using 2 GPUs (the second results). This is because a random seed (0) is set before sampling batches according to the following code:
https://github.com/facebookresearch/maskrcnn-benchmark/blob/55796a04ea770029a80cf5933cc5c3f3f6fa59cf/maskrcnn_benchmark/data/samplers/distributed.py#L43-L47

Hi @fmassa, I'm curious about why we should shuffle images during testing when using multiple GPUs. It seems that running the evaluation without shuffling can also get the same mAPs (the first results, batch size is 1 on each GPU)

Johnqczhang on 2 Jul 2019

👍1

No need to shuffle during testing, it is an historical artifact

fmassa on 2 Jul 2019

👍1

Hi @xiaohai12, how many GPUs did you use when running an evaluation? If you used only one GPU, then is_distributed was passed as False according to the following pieces of code which therefore would never enable shuffling whatever your TEST.IMS_PER_BATCH was.
https://github.com/facebookresearch/maskrcnn-benchmark/blob/55796a04ea770029a80cf5933cc5c3f3f6fa59cf/tools/test_net.py#L50-L51

https://github.com/facebookresearch/maskrcnn-benchmark/blob/55796a04ea770029a80cf5933cc5c3f3f6fa59cf/tools/test_net.py#L96

However, using different values of TEST.IMS_PER_BATCH does result in different mAPs as explained by @fmassa that is caused by paddings.

For your information, here are my results by running the evaluation of e2e_faster_rcnn_R_50_FPN_1x (the model weights file is downloaded from model id: 6358793 in MODEL_ZOO) on 2 GPUs with several times:
one image on each GPU (TEST.IMS_PER_BATCH = 2) whatever shuffle = True or shuffle = False
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.368
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.586
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.396
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.211
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.397
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.481
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.307
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.483
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.507
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.321
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.542
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.634

AP, AP50, AP75, APs, APm, APl
0.367747, 0.586073, 0.395546, 0.210593, 0.397355, 0.480963
two images on each GPU (TEST.IMS_PER_BATCH = 4) with shuffle = True (by default)
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.368
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.586
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.396
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.211
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.398
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.481
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.307
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.483
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.507
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.321
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.543
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.634

AP, AP50, AP75, APs, APm, APl
0.367857, 0.586037, 0.395842, 0.210618, 0.398223, 0.480817
two images on each GPU (TEST.IMS_PER_BATCH = 4) with shuffle = False (manually)
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.368
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.586
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.396
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.211
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.398
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.481
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.307
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.482
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.506
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.321
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.542
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.634

AP, AP50, AP75, APs, APm, APl
0.367695, 0.586173, 0.395646, 0.210543, 0.397500, 0.481209
As you can see, shuffling images or not (but keeping the same batch size) does result in different mAPs. However, I can always get the same results no matter how many times I run the evaluation with shuffle = True using 2 GPUs (the second results). This is because a random seed (0) is set before sampling batches according to the following code:
https://github.com/facebookresearch/maskrcnn-benchmark/blob/55796a04ea770029a80cf5933cc5c3f3f6fa59cf/maskrcnn_benchmark/data/samplers/distributed.py#L43-L47

Hi @fmassa, I'm curious about why we should shuffle images during testing when using multiple GPUs. It seems that running the evaluation without shuffling can also get the same mAPs (the first results, batch size is 1 on each GPU)

Thanks for your response. I found that I added new Transform method and forgot to not do transforms when testing . Now it works well.

xiaohai12 on 2 Jul 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

size mismatch

CF2220160244 · 3Comments

Unable to reproduce the results of baseline on conv5 in FPN paper on CityScapes

krumo · 3Comments

Raise ValueError: Type mismatch (<type 'str'> vs. <type 'tuple'>) with values (coco_2017_train vs. ('coco_2017_train',)) for config key: DATASETS.TRAIN

SkeletonOne · 3Comments

cuda runtime error (77): an illegal memory access was encountered

IenLong · 4Comments

Loss is nan when trying to fine-tune all layers (FREEZE_CONV_BODY_AT: 0)

mrteera · 3Comments