Models: preprocessing for VGG input

Created on 8 Oct 2016 · 17Comments · Source: tensorflow/models

Hi, I want to use the pretrained VGG to do other task.
I saw in other's vgg code, there is rgb to bgr transform. I don't see this transform in preprocessing in the models/slim/preprocessing/vgg_preprocessing.py.

Should there be rgb2bgr in preprocessing?

awaiting model gardener

Source

ruotianluo

👍11

Most helpful comment

So anybody can please clarify whether the input to vgg_16 or vgg_19 be

(i) in float32 [0, 1] scale or [0, 255] scale (I suppose 255)
(ii) with VGG mean pixels subtracted (I think so)
(iii) in RGB order or BGR order? (It is BGR)

There is no documentation or specification on slim models and pre-trained checkpoints ...

FYI: For my experiments,

    inputs = tf.placeholder(tf.float32, (None, 224, 224, 3), name='inputs')
    r, g, b = tf.split(axis=3, num_or_size_splits=3, value=inputs * 255.0)
    VGG_MEAN = [103.939, 116.779, 123.68]
    bgr = tf.concat(values=[b - VGG_MEAN[0], g - VGG_MEAN[1], r - VGG_MEAN[2]], axis=3)
    fc8, endpoints = vgg_16(bgr, is_training=False)

did work.

wookayin on 23 Apr 2017

👍16 😄5 🎉3 ❤2 👎2

All 17 comments

@ruotianluo We primarily use github issues to track bugs and feature requests. This is a question better suited for StackOverflow, which we also monitor. Please ask it there and tag it with the tensorflow tag. Thanks!

tatatodd on 8 Oct 2016

👎16

@tatatodd I feel like it could be a bug, but I'm not sure.

ruotianluo on 8 Oct 2016

@tatatodd Can I know if the vgg checkpoint you provide is converted from the original caffe-model? If it's converted, did you change the channel order in the first conv layer.

ruotianluo on 10 Oct 2016

@ruotianluo I see.

@nathansilberman might be able to answer your question about whether the channels are rgb or bgr.

tatatodd on 10 Oct 2016

I am also interested in the answer

Hediby on 17 Nov 2016

I compared fc7 and fc8 feature of the same image from slim and caffe. The results are very different.

ruotianluo on 26 Nov 2016

I am also interested in an answer.

heshameraqi on 14 Jan 2017

@ruotianluo Maybe they are different because of dropouts. Did you disable them ?

heshameraqi on 14 Jan 2017

The missing explanation in the source code certainly can be considered a bug. There should be a specification of the input data

MarcelSimon on 2 Mar 2017

So anybody can please clarify whether the input to vgg_16 or vgg_19 be

(i) in float32 [0, 1] scale or [0, 255] scale (I suppose 255)
(ii) with VGG mean pixels subtracted (I think so)
(iii) in RGB order or BGR order? (It is BGR)

There is no documentation or specification on slim models and pre-trained checkpoints ...

FYI: For my experiments,

    inputs = tf.placeholder(tf.float32, (None, 224, 224, 3), name='inputs')
    r, g, b = tf.split(axis=3, num_or_size_splits=3, value=inputs * 255.0)
    VGG_MEAN = [103.939, 116.779, 123.68]
    bgr = tf.concat(values=[b - VGG_MEAN[0], g - VGG_MEAN[1], r - VGG_MEAN[2]], axis=3)
    fc8, endpoints = vgg_16(bgr, is_training=False)

did work.

wookayin on 23 Apr 2017

👍16 😄5 🎉3 ❤2 👎2

Hi, in my tests I use vgg16 and vgg19 from slim and I do the following:

        # Read the image from file
        image_string = tf.read_file(filename)
        image_decoded = tf.image.decode_jpeg(image_string, channels=3)    # The decoded image is in RGB format
        image = tf.cast(image_decoded, tf.float32)

        # Isotropic scaling
        smallest_side = 256.0
        height, width = tf.shape(image)[0], tf.shape(image)[1]
        height = tf.to_float(height)
        width = tf.to_float(width)
        scale = tf.cond(tf.greater(height, width),lambda:smallest_side/width,lambda:smallest_side/height)
        new_height = tf.to_int32(height * scale)
        new_width = tf.to_int32(width * scale)
        image = tf.image.resize_images(image, [new_height, new_width]) 

        VGG_MEAN = [123.68, 116.78, 103.94]    # This is R-G-B for Imagenet

        image = tf.random_crop(image, [224, 224, 3])  
        means = tf.reshape(tf.constant(VGG_MEAN), [1, 1, 3])
        image = image - means

The input images are in range [0,255] read in RGB format, with the RGB means from Imagenet subtracted. My results are fine so I think this preprocessing is correct.

If you want the whole code take a look at this great gist: https://gist.github.com/omoindrot/dedc857cdc0e680dfb1be99762990c9c

Hope it helps!

simo23 on 12 Oct 2017

👍5

@simo23, Thanks for your example. I always have one question. Why we should substruct the training mean, not the testing mean? For example, if my input test image has mean value much lower than 103.94, though in the scale of [0 255], should I still use the Imagenet training mean? I notice you use rescaling to set the image size to 256, then truncate to 224. My input size is 128, can I padding it with zeros and enlarge it into 224? Or rescaling it from 128 to 256, then truncate to 224 is better?

zgongkuang on 21 Oct 2017

Hi, I m sure there are people with a 1M times better answer than me but I'll give it a try:

The VGG standard mean has been obtained by zero centering the distribution of the whole training set of Imagenet. This is usually done because it is proved to improve accuracy and help the learning process.
We do not subtract the testing mean because we should not know the testing mean, you should only rely on the training data that you have.
We do not subtract each image's mean because if we subtract the official VGG mean then the features that will be extracted by the network will be the "standard" VGG ones which are proved to achieve a certain accuracy. If you subtract each image mean then the feature extracted by the network will not be the "standard" ones so your results probably will not be the same. As a practical example: if you have two images, one totally red and one totally blue, by subtracting the single image mean they will result equal after preprocessing and we do not want that. So we do not subtract the single image mean to image itself but by a mean that comes from a distribution that represents the whole dataset.

For your other more practical questions, I've not experimented anything like that before so I can only guess:

If your input size is 128 you can still put the image in input. You will obtain a 4x4 feature vector after the last conv5 layer and you can try to put this into a fully connected and see what happens. If your model complains about the input size you should modify the number of connections of the first fully connected layer. Maybe you can try not to apply the last pooling layer and obtain an 8x8 feature vector and feed it into the fully connected.
Zero padding the image to be 224 should not improve your performances as the data is still a 128x128.
You can try to resize the image to 256 by zooming it and then apply random sampling to get the 224x224. The problem is that you are modifying the image a lot, so it really depends on the application.

You have to try both :)

Hope it helps!

simo23 on 27 Oct 2017

👍6

This question is better asked on StackOverflow since it is not a bug or feature request. There is also a larger community that reads questions there. Thanks!