Detectron2: [baseline] [reproducibility] Problem to reproduce the baseline in Model Zoo

Created on 7 Jan 2020  路  5Comments  路  Source: facebookresearch/detectron2

Hello,
I have tried to reproduce the Faster-RCNN baseline using R50-FPN_1x. However, there's a drop of around 4-5 points for the box AP, compared to the score 37.9 in Model Zoo. I would really appreciate it if anyone could give me some insights about what might have gone wrong ^ ^
My result:

COCO Evaluation results for bbox:
|   AP   |  AP50  |  AP75  |  APs   |  APm   |  APl   |
|:------:|:------:|:------:|:------:|:------:|:------:|
| 33.551 | 53.341 | 35.969 | 18.661 | 36.469 | 43.063 |

Instructions To Reproduce the Issue:

  1. what changes I made (git diff)
    The code version I used is e74a00c of Dec 26, 2019.
    No change has been made except for minor changements to run the code on AzureML.

  2. what exact command I run:

python tools/train_net.py --num-gpus 4 \
    --config-file configs/COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml
  1. what I observed (including the full logs):
    Final result after 90,000 iterations:
[01/05 18:28:18 d2.evaluation.coco_evaluation]: Evaluation results for bbox: 
|   AP   |  AP50  |  AP75  |  APs   |  APm   |  APl   |
|:------:|:------:|:------:|:------:|:------:|:------:|
| 33.551 | 53.341 | 35.969 | 18.661 | 36.469 | 43.063 |

I also compared the training loss of my experiment azureml (using 4 K80) with the official metrics
total_loss
loss_cls
loss_reg

The full log is here: loss-4gpu.log

What I tried to understand why

  1. Investigation on influence of GPU numbers
    At first I thought that it was because of the batch size had changed when changing num-gpus from 8 to 4. However the full config in the log indicates the same batch size (IMS_PER_BATCH: 16).
    Secondly, since the only difference is the number of GPUs (maybe I am wrong), I re-ran the same experiment with different number of GPUs for 10k iterations and compared with the official metrics.
    total_loss_2gpu_4gpu_8gpu_10k
    The loss curve shows that my problem is independent of GPU numbers.

  2. Investigation on other baselines
    Last but not least, I tried another baseline: mask_rcnn_R_50_FPN_1x of COCO Instance Segmentation Baselines with Mask R-CNN. And the similar performance drop happened again, around 4 AP point drop compared to the reference (38.6 box AP and 35.2 mask AP)

My result:

COCO Evaluation results for bbox:
|   AP   |  AP50  |  AP75  |  APs   |  APm   |  APl   |
|:------:|:------:|:------:|:------:|:------:|:------:|
| 34.483 | 54.024 | 37.521 | 19.586 | 36.835 | 44.503 |

COCO Evaluation results for segm:
|   AP   |  AP50  |  AP75  |  APs   |  APm   |  APl   |
|:------:|:------:|:------:|:------:|:------:|:------:|
| 31.698 | 51.398 | 33.640 | 14.372 | 33.611 | 45.871 |

My log: loss-4gpu-mrcnn.txt

Environment:

(py36) root@e07abda472cc:/ai-detectron2# python -m detectron2.utils.collect_env
------------------------  -------------------------------------------------------------------
sys.platform              linux
Python                    3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0]
Numpy                     1.15.0
Detectron2 Compiler       GCC 5.4
Detectron2 CUDA Compiler  10.1
DETECTRON2_ENV_MODULE     <not set>
PyTorch                   1.3.1
PyTorch Debug Build       False
torchvision               0.4.2
CUDA available            True
GPU 0,1                   Tesla K80
CUDA_HOME                 /usr/local/cuda
NVCC                      Cuda compilation tools, release 10.1, V10.1.243
Pillow                    5.2.0
cv2                       4.1.0
------------------------  -------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

bug

All 5 comments

There is a bug introduced in Dec 19 that affects accuracy, fixed in fd14855ad6c36b2881d6199cad59831473cb1a33 at the same day but after your commit

I rerun all the R50-FPN-1x baselines on Dec 31 and they are reproduced.

We retrain our models regularly, but certainly not as frequent as every commit, so bugs can sometimes happen but will eventually be found. Let us know if you still have trouble reproducing the results using latest code.

Sorry for the confusion and I'll add a section in docs to keep track of historical bugs.

Great ! Thanks for your quick response ! @ppwwyyxx

Yes, using the lastest code (commit 5e2a6f) works for me ! Thanks again for your help @ppwwyyxx

Was this page helpful?
0 / 5 - 0 ratings

Related issues

guy4261 picture guy4261  路  4Comments

Ormagardskvaedi picture Ormagardskvaedi  路  4Comments

choasup picture choasup  路  3Comments

jinfagang picture jinfagang  路  3Comments

aminekechaou picture aminekechaou  路  3Comments