Vision: Accuracy regression on MobileNetV2

Created on 25 Jul 2019 · 9Comments · Source: pytorch/vision

Reported by @andravin in https://github.com/pytorch/vision/pull/818#issuecomment-509337263

With PyTorch 1.1 and torchvision 0.3, we are able to reach 71.878 top1 accuracy on ImageNet for MobileNetV2.
The training command is the following:

cd references/classification

python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
     --model mobilenet_v2 --epochs 300 --lr 0.045 --wd 0.00004\
     --lr-step-size 1 --lr-gamma 0.98

with best accuracy at epoch 285.

@andravin tried running the same code with a more recent version of PyTorch and torchvision, and got 71.536 (@andravin do you have maybe the specific versions?), which is too high to just be random variations.

Investigate (and fix) the cause of this.

A few related changes (in torchvision) which I have looked into, but didn't find anything particularly suspicious:

https://github.com/pytorch/vision/pull/1005 : this doesn't affect the pre-trained models (which are already using a multiple of 8 for the channels), so I would discard this change as the culprit
https://github.com/pytorch/vision/pull/965 : doesn't change the behavior, so probably not the culprit
https://github.com/pytorch/vision/pull/972 and https://github.com/pytorch/vision/pull/1124 : could be potential culprits, but the functionality is not supposed to change. But I'm not 100% sure...

Note: it takes ~35h to train the model on 8-GPU machines.

bug help wanted models reference scripts classification

Source

fmassa

Most helpful comment

@andravin

My advice would be to have a separate page for each model that documents the hyperparameters used for training (ie the exact train.py commandline used, hopefully that program was used for all the models!). Additionally, would be great to know the mean accuracy and variance.

Totally, you know what, I'll be putting a README now with the hyperparameters that I used to train the the models that we have in the modelzoo. Thanks!

fmassa on 5 Aug 2019

👍2

All 9 comments

Here are the software versions used: https://github.com/pytorch/vision/pull/818#issuecomment-508428115

>>> torch.cuda.nccl.version()
2406
>>> torch.version.cuda
'10.1.168'
>>> torch.backends.cudnn.version()
7601
>>> torch.__version__
'1.2.0a0+ffa15d2'

I would think it enough to reproduce the error in HEAD. If it works now, then either the error was fixed, or I did something wrong.

andravin on 25 Jul 2019

👍1

@andravin @fmassa I dug into this a little by running some training jobs from scratch on 8 V100's and using different combinations of pytorch and torchvision versions.

The results were as follows:

| pytorch | torchvision | Best top1 acc | Epoch |
|---------|-------------|--------|-------|
| master | master | 71.806 | 292 |
| master | 0.3 | 71.638 | 279 |
| 1.1 | master | 71.764 | 300 |
| 1.1 | 0.3 | 71.676 | 289 |
| 1.1 | 0.3 | 71.674 | 278 |
| 1.1 | 0.3 | 71.692 | 284 |
| 1.1 | 0.3 | 71.512 | 281 |
| 1.1 | 0.3 | 71.828 | 300 |
| 1.1 | 0.3 | 71.584 | 295 |
| 1.1 | 0.3 | 71.874 | 298 |

There are a few points to note here. First, the pytorch master and torchvision master run I ran was able to attain 71.806 top1 accuracy.

Next, I tried running a lot of pytorch 1.1 and torchvision 0.3 runs for 300 epochs each. Most of the time, these were not able to attain numbers close to the advertised 71.878, but some of the runs came close at 71.828 and 71.874. This suggests that there is a lot of variance during training that is probably due to different random initializations and non-determinism.

Finally, I took a look through the PyTorch commit history from 1.1 to master while waiting for the jobs to finish. No commits related to the ops run in MobileNetV2 jumped out to me as suspicious, but it's possible that I missed some more subtle changes.

Here were the nccl/cuda/cudnn versions I used:

>>> torch.cuda.nccl.version()
2402
>>> torch.version.cuda
'10.0.130'
>>> torch.backends.cudnn.version()
7501

zou3519 on 5 Aug 2019

closing based on @zou3519 's conclusion. It seems to be more around variance (+-0.2%) than any other factors. Also, he verified that master actually converges if you are on good initialization

soumith on 5 Aug 2019

Thanks a lot for the investigation @zou3519 !

fmassa on 5 Aug 2019

It might be a good idea to document the expected accuracy. @zou3519 's 7 experiments on pytorch 1.1 and torchvision 0.3 have a mean and standard deviation of 71.691 +/- 0.127.

andravin on 5 Aug 2019

@andravin this is a very good point. We unfortunately only have point metrics instead of distributions for the numbers. This is the case for most papers as well up to date, but there are some work proposing different ways of reporting metrics for evaluating families of models, e.g., https://arxiv.org/abs/1905.13214

fmassa on 5 Aug 2019

@fmassa yah I think it is good that you currently report the ImageNet accuracy for the pretrained weights here https://pytorch.org/docs/stable/torchvision/models.html

I was making the point that the user does not know what accuracy to expect if they train the model from scratch.

But apparently there is no documentation about how any of the models were trained. So the user really has no way to reproduce your results.

My advice would be to have a separate page for each model that documents the hyperparameters used for training (ie the exact train.py commandline used, hopefully that program was used for all the models!). Additionally, would be great to know the mean accuracy and variance.

I would think that pytorch developers also need this information for regression testing.

andravin on 5 Aug 2019

👍1

@andravin

My advice would be to have a separate page for each model that documents the hyperparameters used for training (ie the exact train.py commandline used, hopefully that program was used for all the models!). Additionally, would be great to know the mean accuracy and variance.

Totally, you know what, I'll be putting a README now with the hyperparameters that I used to train the the models that we have in the modelzoo. Thanks!

fmassa on 5 Aug 2019

👍2

Would also be good to know the training time and hardware spec (eg 8xV100).

andravin on 5 Aug 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings