hi,
I have a data set which is not image but simple vectors, like [1 5 8 0 9]. The training data size is about 750G, and contains millions of vectors are used for training
I know the NDArrayIter can be used for myown dataset from the tutorials, https://github.com/dmlc/mxnet/tree/master/example/image-classification#use-your-own-datasets. However, it also says "small datasets which can be easily loaded into memory".
How to fit this data set to mxnet?
What is the format of your data? I think you can try NDArrayIter with http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.memmap.html
@piiswrong, thanks for reply.
Each sample of my data is 1D vector, like [1 5 8 0 9].
I just see there is another similar open issue https://github.com/dmlc/mxnet/issues/791. does the PR mentioned in it be merged?
Currently you can use CSV iterator. I am working on SFrame iter (https://github.com/dato-code/SFrame), which will be avaliable soon.
I see, will try this out. Thanks.
NDArrayIter does work on numpy memory maps. I am using it with multiple inputs, each ~250G in size.
@jmschrei Thanks, I will try it latter.
hi, @antinucleon . I try your suggestion to use CSV iterator to represent the input data.
So I prepare two csv files data.csv, label.csv for data, and label, respectively.
When I try to feed csv iterator to model, it keeps saying that, "Data CSV's row is smaller than the number of rows in label_csv". What is this problem, how to fix it?
format in data.csv is 1d vector each row.
format in label is 0 or 1 each row.
and this two csv files has same rows.
1, Please make sure there is no header in the csv
batch_size = 128
num_fea = 5
data_train = mx.io.CSVIter(data_csv="./train-data.csv", data_shape=(num_fea,),
label_csv="./train-stytole.csv", label_shape=(2,),
batch_size=batch_size)
the label csv is like
0,1\n
1,0\n
@antinucleon @piiswrong
I think i just solve this issue by adding one additional rows into label file, which means the rows number in label file is lager than training data's.
Is it a bug?
I think HDF5 data iter support should be useful. Though introducing an extra dependency on libhdf5 is not desired, we can implement it in the front-end side as most languages have mature 3rd party HDF5 support.
@mazeLinx What did you mean "by adding one additional rows into label file"?
Most helpful comment
I think HDF5 data iter support should be useful. Though introducing an extra dependency on libhdf5 is not desired, we can implement it in the front-end side as most languages have mature 3rd party HDF5 support.