Does anyone remember how exactly we came about the channel means and stds we use for the preprocessing?
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
I think the first mention of the preprocessing in this repo is in #39. In that issue @soumith points to https://github.com/pytorch/examples/tree/master/imagenet for reference. If you look at the history of main.py the commit pytorch/examples@27e2a46c1d1505324032b1d94fc6ce24d5b67e97 first introduced the values. Unfortunately it contains no explanation, hence my question.
Specifically, I'm seeking answers to the following questions:
rounded, floored, or even ceiled?ImageNet or additionally the images of the validation set?I've tested some combinations and will post my results here.
| Parameters | mean | std |
| --- | --- | --- |
| train set only, no resizing / cropping| [0.4803, 0.4569, 0.4083] | [0.2806, 0.2736, 0.2877] |
| train set only, resize to 256 and center crop to 224 | [0.4845, 0.4541, 0.4025] | [0.2724, 0.2637, 0.2761] |
| train set only, center crop to 224 | [0.4701, 0.4340, 0.3832] | [0.2845, 0.2733, 0.2805] |
While the means match fairly well, the std differ significantly.
You need to go deeper ;)
https://github.com/facebook/fb.resnet.torch/blob/master/datasets/imagenet.lua
-- Computed from random subset of ImageNet training images
local meanstd = {
mean = { 0.485, 0.456, 0.406 },
std = { 0.229, 0.224, 0.225 },
}
For my project I need to know the covariances between the channels. Since they are not part of the current implementation, my hope was that I can calculate them myself if I know the necessary images and processing. Unfortunately
random subset
gives me little hope that I'm able to do that. I suppose no one remembers how this random subset was selected?
Should we investigate this further? I'm a little anxious that we simply use this normalization for all our models without being able to reproduce it.
@colesbury do you have more information here to clarify on the mean / std for imagenet that we use?
afaik we calculated the mean / std to use by running one pass on the training set of Imagenet
that being said, i see that std is not matching. possibly a bug of the past or some detail that we completely forgot about :-/
Can we put batch normalization layer before input so that mean/std will be computed automatically in the training time?
@apple2373 We currently implementing the transforms for tensors in order to be able to use them within a model (see #1375). Whether we want to include them within the models is AFAIK still up for discussion (see #782)
@fmassa @soumith
Any update on this? Do we investigate further or keep it as is?
@pmeier I don't know if we will ever be able to get back those numbers, given that they seem to have been computed on a randomly-sampled part of the dataset.
If we really want to see if this has any impact, we would run multiple runs of end-to-end training with the new mean/std and see if it brings any noticeable improvement.
I don't think we get significant improvement (or decline) of performance. I just think we shouldn't use numbers that are not reproducible. A change like this is of course a lot of work, BC breaking etc, but we don't know what the future brings. Maybe this becomes significant in the future and than its even harder to correct.
don't think we get significant improvement (or decline) of performance. I just think we shouldn't use numbers that are not reproducible.
I agree. But given the scale of how things would break with such a change, I think we should just live with it for now, and maybe document somewhere the findings you have shown in here.
It's been almost four years, so I don't remember, but I probably just used the mean / std from the previous Lua ImageNet training script:
It uses the average standard deviation of an individual image's channel instead of the an estimate of the standard deviation across the entire dataset.
I don't think we should change the mean/std, nor do I see any reproducibility issue. The scientific result here is the neural network, not mean/std values. Especially since the exact choice does not matter as long as they approximately whiten the input.
A change like this is of course a lot of work, BC breaking etc, but we don't know what the future brings.
These numbers have become a standard for most neural networks created so far, it's not just a lot of work — one need to retrain hundreds of neural networks (approx. 2 gpu x week each for a model like resnet50) and create pull requests for all the pretrainedmodels/dpn/wide resnets/etc. repos all over the github just to adjust normalizing std by 0.05. What the future can justify this?
Following the discussion that we had in here, I agree with @colesbury and @nizhib points above.
@pmeier would you like to send a PR adding some summary of the discussion that we had here, including @colesbury comment on how those numbers were obtained?
I'm covered for the next weeks. This will take some time.
Maybe the reason why the stds don't match is that it was originally called with unbiased=False?
@Stannislav in #1965 I've managed to get pretty close the the original numbers.
Most helpful comment
You need to go deeper ;)
https://github.com/facebook/fb.resnet.torch/blob/master/datasets/imagenet.lua