i have trained resnet-50 in imanget, top1 error is 25.41% and top-5 error is 7.93%, which is not very good, so do anyone reproduced it now? hope somebody can release the model for study other task~
(my code is in https://github.com/tornadomeet/ResNet, it support both imagnet and cifar training with different depth. )
it may be due to the image compression algorithm used by python is not good... please consider try to download a converted data at http://data.dmlc.ml/mxnet/private/ilsvrc12/, *_256_q90.rec
please shoot me an email [email protected] for the username and password
@mli done
@tornadomeet Min has successfully reproduced Res-50/101/152/200 even in distributed environment. For Res-200 he reported exact performance but trained in 36 hours with multiple machine. We will reproduce it together with you.
great!!
@mli i just have download the train_256_q90.rec. it's sad that i can not retrain the model directly, because my synset.txt is come from ILSVRC2015 map_clsloc.txt, that is different from yours which is widely used in caffe.
so need to train the model from scratch.
wait for the pre-trained model from @mavenlin
Min won't provide pretrained model. We plan to train by ourselves from
scratch, and check where is the issue.
On Tue, Aug 9, 2016 at 23:40 tornadomeet [email protected] wrote:
@mli https://github.com/mli i just have done the train_256_q90.rec.
it's sad that i can not retrain the model directly, because my synset.txt
is come from ILSVRC2015 map_clsloc.txt, that is different from yours which
is widely used in caffe.
so need to train the model from scratch.
wait for the pre-trained model from @mavenlin
https://github.com/mavenlin—
You are receiving this because you commented.Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/2968#issuecomment-238778939, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABM13my0t7AmUPRatA87oKlzEvyW7k5lks5qeXJ6gaJpZM4Jfp4p
.
Sent from mobile phone
@antinucleon ok, i'll continue to reproduced it and try to find the problem.
@antinucleon @mli i found a a potential problem of mx.io.ImageRecordIter --> when set rand_crop = True, then during each epoch the cropped area of the same image is the same., (wish i'm wrong..)
if this is a bug, then it will influence the resnet result, because our training rec is 480, and for the same image in each epoch, it only see the same area, so the scale augment does't work at all.
we can reproduced this like this:
train = mx.io.ImageRecordIter(...) # set random_crop=True, data_shape less than the size of .rec
import numpy as np
import cv2
for i in range(10):
X = train.getdata().asnumpy()[i]
X=np.swapaxes(X, 0, 2)
X=np.swapaxes(X, 0, 1)
X = cv2.cvtColor(X, cv2.COLOR_RGB2BGR)
cv2.imwrite("{}_result.jpg".format(i), X)
print "image_{}--->{}".format(i, train.getlabel().asnumpy()[i])
train.reset()
for i in range(10):
X = train.getdata().asnumpy()[i]
X=np.swapaxes(X, 0, 2)
X=np.swapaxes(X, 0, 1)
X = cv2.cvtColor(X, cv2.COLOR_RGB2BGR)
cv2.imwrite("{}_reset_result.jpg".format(i), X)
print "reset_image_{}--->{}".format(i, train.getlabel().asnumpy()[i])
train.reset()
for i in range(10):
X = train.getdata().asnumpy()[i]
X=np.swapaxes(X, 0, 2)
X=np.swapaxes(X, 0, 1)
X = cv2.cvtColor(X, cv2.COLOR_RGB2BGR)
cv2.imwrite("{}_reset2_result.jpg".format(i), X)
print "reset2_image_{}--->{}".format(i, train.getlabel().asnumpy()[i])
then the corresponding image is the same.
the same phenomenon may happen with random mirror.
but when i debug into https://github.com/dmlc/mxnet/blob/master/src/io/image_aug_default.cc#L241, i print out the y and x and found it should changed after each epoch.
could anybody verify this?
i just test that rand_mirror has the same problem. so all random parameter in mx.io.ImageRecordIter() may suffer the same thing.
@winstywang @piiswrong
@tornadomeet I will be in trip from tomorrow. I think this is a really good catch. I guess this is just the reason why we cannot reproduce the results. I will let one of my interns to catch up with this issue.
you didn't call train.next() so it's always the same batch of data...
The bellow code gives correct result
import mxnet as mx
import numpy as np
import cv2
train = mx.io.ImageRecordIter(
path_imgrec = '/archive/imagenet/train.rec',
data_shape = (3, 224, 224),
batch_size = 10,
rand_crop = True,
rand_mirror = True) # set random_crop=True, data_shape less than the size of .rec
train.reset()
data = train.next().data[0].asnumpy()
for i in range(10):
X = data[i]
X=np.swapaxes(X, 0, 2)
X=np.swapaxes(X, 0, 1)
X = cv2.cvtColor(X, cv2.COLOR_RGB2BGR)
cv2.imwrite("{}_result.jpg".format(i), X)
print "image_{}--->{}".format(i, train.getlabel().asnumpy()[i])
train.reset()
data = train.next().data[0].asnumpy()
for i in range(10):
X = data[i]
X=np.swapaxes(X, 0, 2)
X=np.swapaxes(X, 0, 1)
X = cv2.cvtColor(X, cv2.COLOR_RGB2BGR)
cv2.imwrite("{}_reset_result.jpg".format(i), X)
print "reset_image_{}--->{}".format(i, train.getlabel().asnumpy()[i])
train.reset()
data = train.next().data[0].asnumpy()
for i in range(10):
X = data[i]
X=np.swapaxes(X, 0, 2)
X=np.swapaxes(X, 0, 1)
X = cv2.cvtColor(X, cv2.COLOR_RGB2BGR)
cv2.imwrite("{}_reset2_result.jpg".format(i), X)
print "reset2_image_{}--->{}".format(i, train.getlabel().asnumpy()[i])
@piiswrong @winstywang oh, i made a mistake, so this should be no problem.
another small potential problem is that every time we start our program, the result is the same, it may not influence so much but if we change lr manual and retrain the model, then the randomness is the same each time.
Another thing is that He shuffles the dataset for each epoch. I only tested on the CIFAR 10 dataset with ResNet56:

Note that 93.16% is about 1% higher than 92.17%. In the paper, the reference performance is 93.03% (93.16%>93.03%>92.17%). I am wondering this will also apply for the ImageNet dataset.
P.S.: Shuffling the whole dataset is expensive, so I am using the following RandomSkipResizeIter:
class RandomSkipResizeIter(mx.io.DataIter):
"""Resize a DataIter to given number of batches per epoch.
May produce incomplete batch in the middle of an epoch due
to padding from internal iterator.
Parameters
----------
data_iter : DataIter
Internal data iterator.
max_random_skip : maximum random skip number
If max_random_skip is 1, no random skip.
size : number of batches per epoch to resize to.
reset_internal : whether to reset internal iterator on ResizeIter.reset
"""
def __init__(self, data_iter, size, skip_ratio=0.5, reset_internal=False):
super(RandomSkipResizeIter, self).__init__()
self.data_iter = data_iter
self.size = size
self.reset_internal = reset_internal
self.cur = 0
self.current_batch = None
self.prev_batch = None
self.skip_ratio = skip_ratio
self.provide_data = data_iter.provide_data
self.provide_label = data_iter.provide_label
self.batch_size = data_iter.batch_size
def reset(self):
self.cur = 0
if self.reset_internal:
self.data_iter.reset()
def __get_next(self):
try:
return self.data_iter.next()
except StopIteration:
self.data_iter.reset()
return self.data_iter.next()
def iter_next(self):
if self.cur == self.size:
return False
data, label = [], []
if self.current_batch is None:
# very first
batch = self.__get_next()
self.current_batch = mx.io.DataBatch(data=[mx.nd.empty(batch.data[0].shape)], label=[mx.nd.empty(batch.label[0].shape)])
keep = np.random.rand(self.batch_size) > self.skip_ratio
batch_data = batch.data[0].asnumpy()
batch_label = batch.label[0].asnumpy()
data.extend(batch_data[keep])
label.extend(batch_label[keep])
elif self.prev_batch is not None:
# prev_batch
batch_data, batch_label = self.prev_batch
data.extend(batch_data)
label.extend(batch_label)
while len(data) < self.batch_size:
batch = self.__get_next()
keep = np.random.rand(self.batch_size) > self.skip_ratio
batch_data = batch.data[0].asnumpy()
batch_label = batch.label[0].asnumpy()
data.extend(batch_data[keep])
label.extend(batch_label[keep])
if len(data) > self.batch_size:
self.prev_batch = data[self.batch_size:], label[self.batch_size:]
else:
self.prev_batch = None
self.current_batch.data[0][:] = np.asarray(data[:self.batch_size])
self.current_batch.label[0][:] = np.asarray(label[:self.batch_size])
self.cur += 1
return True
def getdata(self):
return self.current_batch.data
def getlabel(self):
return self.current_batch.label
def getindex(self):
return self.current_batch.index
def getpad(self):
return self.current_batch.pad
cifar10 is no problem, i reproduced resnet-164 with top1: 94.54%
@mli @antinucleon @winstywang @taoari @piiswrong i have reproduced the result of resnet-50 using @mli 's .rec with io of mxnet, all the code, model, training log can be found in https://github.com/tornadomeet/ResNet
thanks all for discussion.
@taoari would you give a pr of RandomSkipResizeIter? i think it will be useful when training imagenet model.
i have add your RandomSkipResizeIter im resnet training code, and i'll verity it in imagenet trianing~
Most helpful comment
@mli @antinucleon @winstywang @taoari @piiswrong i have reproduced the result of resnet-50 using @mli 's
.recwith io of mxnet, all the code, model, training log can be found in https://github.com/tornadomeet/ResNetthanks all for discussion.