Incubator-mxnet: Large data set for mxnet (no image format)

Created on 28 Dec 2015 · 11Comments · Source: apache/incubator-mxnet

hi,
I have a data set which is not image but simple vectors, like [1 5 8 0 9]. The training data size is about 750G, and contains millions of vectors are used for training

I know the NDArrayIter can be used for myown dataset from the tutorials, https://github.com/dmlc/mxnet/tree/master/example/image-classification#use-your-own-datasets. However, it also says "small datasets which can be easily loaded into memory".

How to fit this data set to mxnet?

Source

mazeLinx

Most helpful comment

I think HDF5 data iter support should be useful. Though introducing an extra dependency on libhdf5 is not desired, we can implement it in the front-end side as most languages have mature 3rd party HDF5 support.

pluskid on 28 Dec 2015

👍9

All 11 comments

What is the format of your data? I think you can try NDArrayIter with http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.memmap.html

piiswrong on 28 Dec 2015

@piiswrong, thanks for reply.

Each sample of my data is 1D vector, like [1 5 8 0 9].

I just see there is another similar open issue https://github.com/dmlc/mxnet/issues/791. does the PR mentioned in it be merged?

mazeLinx on 28 Dec 2015

Currently you can use CSV iterator. I am working on SFrame iter (https://github.com/dato-code/SFrame), which will be avaliable soon.

antinucleon on 28 Dec 2015

I see, will try this out. Thanks.

mazeLinx on 28 Dec 2015

NDArrayIter does work on numpy memory maps. I am using it with multiple inputs, each ~250G in size.

jmschrei on 28 Dec 2015

@jmschrei Thanks, I will try it latter.

hi, @antinucleon . I try your suggestion to use CSV iterator to represent the input data.
So I prepare two csv files data.csv, label.csv for data, and label, respectively.

When I try to feed csv iterator to model, it keeps saying that, "Data CSV's row is smaller than the number of rows in label_csv". What is this problem, how to fix it?

format in data.csv is 1d vector each row.
format in label is 0 or 1 each row.
and this two csv files has same rows.

mazeLinx on 28 Dec 2015

1, Please make sure there is no header in the csv

sample:

batch_size = 128
num_fea = 5
data_train = mx.io.CSVIter(data_csv="./train-data.csv", data_shape=(num_fea,),
                           label_csv="./train-stytole.csv", label_shape=(2,),
                           batch_size=batch_size)

the label csv is like

0,1\n
1,0\n

antinucleon on 28 Dec 2015

Thanks for reply, @antinucleon !

yes, there is not header in the csv.
And I do the same way with your provided.

I upload my files, could you help to take a look? They are only 10 rows used to test the functionality.

label file, ll.txt
training data file, tt.txt

mazeLinx on 28 Dec 2015

@antinucleon @piiswrong

I think i just solve this issue by adding one additional rows into label file, which means the rows number in label file is lager than training data's.

Is it a bug?

mazeLinx on 28 Dec 2015

pluskid on 28 Dec 2015

👍9

@mazeLinx What did you mean "by adding one additional rows into label file"?