Hi,
Since the input training data is often organized as the sparse matrix in many datasets, what is the format of such those sparse training instances presented in the input files? For example, if one training instance consists of 1000 dimensions, and 100 of those dimensions are not zero, then is the training instance presented in the input file by using 900 zeros and 100 non-zeros? The same sparse instances in the datasets which are published on the LibSVM website are organized as the format of "dimension_id : value".
Best wishes,
Yawei
Hi Yawei,
mlpack unfortunately doesn't have any current support for loading sparse matrices from disk. In addition, because of this, the command-line programs only load dense data.
So if you want to use sparse data specifically, I think the best way is to write a C++ program using arma::sp_mat. But to make it harder... Armadillo does not have good documentation for their support for loading sparse matrices. You can load a coordinate list of the form
1 2 10.3
3 1 5.2
3 2 1.3
and this represents a matrix with three nonzero elements. You can load it using the function
arma::sp_mat m;
m.load("file.txt", arma::coord_ascii);
and then you can use that in mlpack methods. I wish that this was documented in the Armadillo docs but currently it is not.
I hope this is helpful... let me know if I can clarify anything.
Hi Ryan,
Thanks for your answer! It really helps me understand how to use MLPACK. Could I (or you) add your suggestion into the doc of MLPACK?
I updated the documentation in e36eec5; it's online at
http://mlpack.org/docs/mlpack-git/doxygen.php?doc=formatdoc.html
Let me know what you think, if anything can be clarified. I'll mark this as resolved since I've updated the documentation, but let me know if there is anything else to be done.
Thanks for pointing this out!
Ryan
Hi Ryan,
I have read the update documentation. I think it is clear and understandable. Thanks for your time. Nice work!!!!
Yawei
Most helpful comment
Hi Yawei,
mlpack unfortunately doesn't have any current support for loading sparse matrices from disk. In addition, because of this, the command-line programs only load dense data.
So if you want to use sparse data specifically, I think the best way is to write a C++ program using
arma::sp_mat. But to make it harder... Armadillo does not have good documentation for their support for loading sparse matrices. You can load a coordinate list of the formand this represents a matrix with three nonzero elements. You can load it using the function
and then you can use that in mlpack methods. I wish that this was documented in the Armadillo docs but currently it is not.
I hope this is helpful... let me know if I can clarify anything.