Incubator-mxnet: Training ResNet on imagenet

Created on 9 Aug 2016 · 17Comments · Source: apache/incubator-mxnet

i have trained resnet-50 in imanget, top1 error is 25.41% and top-5 error is 7.93%, which is not very good, so do anyone reproduced it now? hope somebody can release the model for study other task~
(my code is in https://github.com/tornadomeet/ResNet, it support both imagnet and cifar training with different depth. )

Source

tornadomeet

Most helpful comment

@mli @antinucleon @winstywang @taoari @piiswrong i have reproduced the result of resnet-50 using @mli 's .rec with io of mxnet, all the code, model, training log can be found in https://github.com/tornadomeet/ResNet
thanks all for discussion.

tornadomeet on 22 Aug 2016

👍2

All 17 comments

it may be due to the image compression algorithm used by python is not good... please consider try to download a converted data at http://data.dmlc.ml/mxnet/private/ilsvrc12/, *_256_q90.rec

please shoot me an email [email protected] for the username and password

mli on 9 Aug 2016

@mli done

tornadomeet on 9 Aug 2016

@tornadomeet Min has successfully reproduced Res-50/101/152/200 even in distributed environment. For Res-200 he reported exact performance but trained in 36 hours with multiple machine. We will reproduce it together with you.

antinucleon on 9 Aug 2016

great!!

tornadomeet on 10 Aug 2016

@mli i just have download the train_256_q90.rec. it's sad that i can not retrain the model directly, because my synset.txt is come from ILSVRC2015 map_clsloc.txt, that is different from yours which is widely used in caffe.
so need to train the model from scratch.
wait for the pre-trained model from @mavenlin

tornadomeet on 10 Aug 2016

Min won't provide pretrained model. We plan to train by ourselves from
scratch, and check where is the issue.
On Tue, Aug 9, 2016 at 23:40 tornadomeet [email protected] wrote:

@mli https://github.com/mli i just have done the train_256_q90.rec.
it's sad that i can not retrain the model directly, because my synset.txt
is come from ILSVRC2015 map_clsloc.txt, that is different from yours which
is widely used in caffe.
so need to train the model from scratch.
wait for the pre-trained model from @mavenlin
https://github.com/mavenlin

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/2968#issuecomment-238778939, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABM13my0t7AmUPRatA87oKlzEvyW7k5lks5qeXJ6gaJpZM4Jfp4p
.

Sent from mobile phone

antinucleon on 10 Aug 2016

@antinucleon ok, i'll continue to reproduced it and try to find the problem.

tornadomeet on 10 Aug 2016

@antinucleon @mli i found a a potential problem of mx.io.ImageRecordIter --> when set rand_crop = True, then during each epoch the cropped area of the same image is the same., (wish i'm wrong..)

if this is a bug, then it will influence the resnet result, because our training rec is 480, and for the same image in each epoch, it only see the same area, so the scale augment does't work at all.

we can reproduced this like this:

    train = mx.io.ImageRecordIter(...)  # set random_crop=True, data_shape less than the size of .rec
    import numpy as np
    import cv2
    for i in range(10):
        X = train.getdata().asnumpy()[i]
        X=np.swapaxes(X, 0, 2)
        X=np.swapaxes(X, 0, 1)
        X = cv2.cvtColor(X, cv2.COLOR_RGB2BGR)
        cv2.imwrite("{}_result.jpg".format(i), X)
        print "image_{}--->{}".format(i, train.getlabel().asnumpy()[i])
    train.reset()
    for i in range(10):
        X = train.getdata().asnumpy()[i]
        X=np.swapaxes(X, 0, 2)
        X=np.swapaxes(X, 0, 1)
        X = cv2.cvtColor(X, cv2.COLOR_RGB2BGR)
        cv2.imwrite("{}_reset_result.jpg".format(i), X)
        print "reset_image_{}--->{}".format(i, train.getlabel().asnumpy()[i])
    train.reset()
    for i in range(10):
        X = train.getdata().asnumpy()[i]
        X=np.swapaxes(X, 0, 2)
        X=np.swapaxes(X, 0, 1)
        X = cv2.cvtColor(X, cv2.COLOR_RGB2BGR)
        cv2.imwrite("{}_reset2_result.jpg".format(i), X)
        print "reset2_image_{}--->{}".format(i, train.getlabel().asnumpy()[i])

then the corresponding image is the same.
the same phenomenon may happen with random mirror.

but when i debug into https://github.com/dmlc/mxnet/blob/master/src/io/image_aug_default.cc#L241, i print out the y and x and found it should changed after each epoch.
could anybody verify this?

tornadomeet on 10 Aug 2016

i just test that rand_mirror has the same problem. so all random parameter in mx.io.ImageRecordIter() may suffer the same thing.
@winstywang @piiswrong

tornadomeet on 10 Aug 2016

@tornadomeet I will be in trip from tomorrow. I think this is a really good catch. I guess this is just the reason why we cannot reproduce the results. I will let one of my interns to catch up with this issue.

winstywang on 10 Aug 2016

you didn't call train.next() so it's always the same batch of data...
The bellow code gives correct result

import mxnet as mx
import numpy as np
import cv2

train = mx.io.ImageRecordIter(
        path_imgrec = '/archive/imagenet/train.rec',
        data_shape  = (3, 224, 224),
        batch_size  = 10,
        rand_crop   = True,
        rand_mirror = True)  # set random_crop=True, data_shape less than the size of .rec
train.reset()
data = train.next().data[0].asnumpy()
for i in range(10):
    X = data[i]
    X=np.swapaxes(X, 0, 2)
    X=np.swapaxes(X, 0, 1)
    X = cv2.cvtColor(X, cv2.COLOR_RGB2BGR)
    cv2.imwrite("{}_result.jpg".format(i), X)
    print "image_{}--->{}".format(i, train.getlabel().asnumpy()[i])
train.reset()
data = train.next().data[0].asnumpy()
for i in range(10):
    X = data[i]
    X=np.swapaxes(X, 0, 2)
    X=np.swapaxes(X, 0, 1)
    X = cv2.cvtColor(X, cv2.COLOR_RGB2BGR)
    cv2.imwrite("{}_reset_result.jpg".format(i), X)
    print "reset_image_{}--->{}".format(i, train.getlabel().asnumpy()[i])
train.reset()
data = train.next().data[0].asnumpy()
for i in range(10):
    X = data[i]
    X=np.swapaxes(X, 0, 2)
    X=np.swapaxes(X, 0, 1)
    X = cv2.cvtColor(X, cv2.COLOR_RGB2BGR)
    cv2.imwrite("{}_reset2_result.jpg".format(i), X)
    print "reset2_image_{}--->{}".format(i, train.getlabel().asnumpy()[i])

piiswrong on 10 Aug 2016

@piiswrong @winstywang oh, i made a mistake, so this should be no problem.

tornadomeet on 11 Aug 2016

another small potential problem is that every time we start our program, the result is the same, it may not influence so much but if we change lr manual and retrain the model, then the randomness is the same each time.

tornadomeet on 11 Aug 2016

Another thing is that He shuffles the dataset for each epoch. I only tested on the CIFAR 10 dataset with ResNet56:

Note that 93.16% is about 1% higher than 92.17%. In the paper, the reference performance is 93.03% (93.16%>93.03%>92.17%). I am wondering this will also apply for the ImageNet dataset.

P.S.: Shuffling the whole dataset is expensive, so I am using the following RandomSkipResizeIter:

class RandomSkipResizeIter(mx.io.DataIter):
    """Resize a DataIter to given number of batches per epoch.
    May produce incomplete batch in the middle of an epoch due
    to padding from internal iterator.

    Parameters
    ----------
    data_iter : DataIter
        Internal data iterator.
    max_random_skip : maximum random skip number
        If max_random_skip is 1, no random skip.
    size : number of batches per epoch to resize to.
    reset_internal : whether to reset internal iterator on ResizeIter.reset
    """

    def __init__(self, data_iter, size, skip_ratio=0.5, reset_internal=False):
        super(RandomSkipResizeIter, self).__init__()
        self.data_iter = data_iter
        self.size = size
        self.reset_internal = reset_internal
        self.cur = 0
        self.current_batch = None
        self.prev_batch = None
        self.skip_ratio = skip_ratio

        self.provide_data = data_iter.provide_data
        self.provide_label = data_iter.provide_label
        self.batch_size = data_iter.batch_size

    def reset(self):
        self.cur = 0
        if self.reset_internal:
            self.data_iter.reset()

    def __get_next(self):
        try:
            return self.data_iter.next()
        except StopIteration:
            self.data_iter.reset()
            return self.data_iter.next()

    def iter_next(self):
        if self.cur == self.size:
            return False

        data, label = [], []
        if self.current_batch is None:
            # very first
            batch = self.__get_next()
            self.current_batch = mx.io.DataBatch(data=[mx.nd.empty(batch.data[0].shape)], label=[mx.nd.empty(batch.label[0].shape)])
            keep = np.random.rand(self.batch_size) > self.skip_ratio
            batch_data = batch.data[0].asnumpy()
            batch_label = batch.label[0].asnumpy()
            data.extend(batch_data[keep])
            label.extend(batch_label[keep])
        elif self.prev_batch is not None:
            # prev_batch
            batch_data, batch_label = self.prev_batch
            data.extend(batch_data)
            label.extend(batch_label)

        while len(data) < self.batch_size:
            batch = self.__get_next()
            keep = np.random.rand(self.batch_size) > self.skip_ratio
            batch_data = batch.data[0].asnumpy()
            batch_label = batch.label[0].asnumpy()
            data.extend(batch_data[keep])
            label.extend(batch_label[keep])

        if len(data) > self.batch_size:
            self.prev_batch = data[self.batch_size:], label[self.batch_size:]
        else:
            self.prev_batch = None
        self.current_batch.data[0][:] = np.asarray(data[:self.batch_size])
        self.current_batch.label[0][:] = np.asarray(label[:self.batch_size])

        self.cur += 1
        return True

    def getdata(self):
        return self.current_batch.data

    def getlabel(self):
        return self.current_batch.label

    def getindex(self):
        return self.current_batch.index

    def getpad(self):
        return self.current_batch.pad