Mlpack: The format of the sparse instances in the input file

Created on 29 May 2016  路  4Comments  路  Source: mlpack/mlpack

Hi,

Since the input training data is often organized as the sparse matrix in many datasets, what is the format of such those sparse training instances presented in the input files? For example, if one training instance consists of 1000 dimensions, and 100 of those dimensions are not zero, then is the training instance presented in the input file by using 900 zeros and 100 non-zeros? The same sparse instances in the datasets which are published on the LibSVM website are organized as the format of "dimension_id : value".

Best wishes,

Yawei

fixed bug report

Most helpful comment

Hi Yawei,

mlpack unfortunately doesn't have any current support for loading sparse matrices from disk. In addition, because of this, the command-line programs only load dense data.

So if you want to use sparse data specifically, I think the best way is to write a C++ program using arma::sp_mat. But to make it harder... Armadillo does not have good documentation for their support for loading sparse matrices. You can load a coordinate list of the form

1 2 10.3
3 1 5.2
3 2 1.3

and this represents a matrix with three nonzero elements. You can load it using the function

arma::sp_mat m;
m.load("file.txt", arma::coord_ascii);

and then you can use that in mlpack methods. I wish that this was documented in the Armadillo docs but currently it is not.

I hope this is helpful... let me know if I can clarify anything.

All 4 comments

Hi Yawei,

mlpack unfortunately doesn't have any current support for loading sparse matrices from disk. In addition, because of this, the command-line programs only load dense data.

So if you want to use sparse data specifically, I think the best way is to write a C++ program using arma::sp_mat. But to make it harder... Armadillo does not have good documentation for their support for loading sparse matrices. You can load a coordinate list of the form

1 2 10.3
3 1 5.2
3 2 1.3

and this represents a matrix with three nonzero elements. You can load it using the function

arma::sp_mat m;
m.load("file.txt", arma::coord_ascii);

and then you can use that in mlpack methods. I wish that this was documented in the Armadillo docs but currently it is not.

I hope this is helpful... let me know if I can clarify anything.

Hi Ryan,

Thanks for your answer! It really helps me understand how to use MLPACK. Could I (or you) add your suggestion into the doc of MLPACK?

I updated the documentation in e36eec5; it's online at
http://mlpack.org/docs/mlpack-git/doxygen.php?doc=formatdoc.html

Let me know what you think, if anything can be clarified. I'll mark this as resolved since I've updated the documentation, but let me know if there is anything else to be done.

Thanks for pointing this out!

Ryan

Hi Ryan,

I have read the update documentation. I think it is clear and understandable. Thanks for your time. Nice work!!!!

Yawei

Was this page helpful?
0 / 5 - 0 ratings

Related issues

FloopCZ picture FloopCZ  路  6Comments

birm picture birm  路  4Comments

mirraaj picture mirraaj  路  3Comments

KoushikSahu picture KoushikSahu  路  6Comments

zoq picture zoq  路  6Comments