Hello,
This is related to this stack overflow question
In breif, the documentation for XGBoost, and in particular, creating a DMatrix in CSR / CSC format. The documentation is not very helpful:
To load sparse matrix in CSR/CSC format is a little complicated,
the usage is like : suppose a sparse matrix : 1 0 2 0 4 0 0 3 3 1 2 0
It is not clear what is meant by the suppose above unless you are knowledgeable in sparse matrix formats already. This request is to improve the documentation. For example, it is unclear how this maps to the standard Yale format A, IA, JA
vectors.
I tried loading the first row from the Agaricus test data set (in libSVM format)
0 1:1 9:1 19:1 21:1 24:1 34:1 36:1 39:1 42:1 53:1 56:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 117:1 122:1
I created a DMatrix as follows:
DMatrix dMatrix = new DMatrix(new long[] {0},
new int[] {1, 9, 19, 21, 24, 34, 36, 39, 42, 53, 56, 65, 69, 77, 86, 88, 92, 95, 102, 106, 117, 122},
new float[] {1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f},
DMatrix.SparseType.CSC, 1);
But this does not give any kind of sensible answer it should be about 4.941254E-4 but it gives 0.5. Could someone please explain how to construct a simple one-row DMatrix using existing column index -> float value pairs as shown above?
This also does not work:
DMatrix dMatrix = new DMatrix(new long[] {0, 22},
new int[] {1, 9, 19, 21, 24, 34, 36, 39, 42, 53, 56, 65, 69, 77, 86, 88, 92, 95, 102, 106, 117, 122},
new float[] {1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f, 1f},
DMatrix.SparseType.CSR, 1);
Which errors out with
Check failed: mat.info.num_col <= num_col (123 vs. 1) num_col=1 vs 123
Stack trace returned 5 entries:
[bt] (0) 0 libxgboost4j8770323223057210570.dylib 0x0000000126e05f29 _ZN4dmlc15LogMessageFatalD2Ev + 41
[bt] (1) 1 libxgboost4j8770323223057210570.dylib 0x0000000126e07f5d XGDMatrixCreateFromCSREx + 957
[bt] (2) 2 libxgboost4j8770323223057210570.dylib 0x0000000126e03baa Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromCSREx + 170
[bt] (3) 3 ??? 0x000000010fe20a88 0x0 + 4561439368
[bt] (4) 4 ??? 0x000000010fe07da0 0x0 + 4561337760
Also I have observed that if I remove the shapeParam, it works. But that's the deprecated API. Now it comes down to finding out what the shapeParam is doing.
Finally I have used the shapeParam to be the number of rows. So 123 in this case. I would still like to keep this issue open as this is one place in the documentation which produces an obstacle and could be much more clear. Thanks!
@mobiusinversion Would you like to help us clarify the XGBoost documentation? I agree that the documentation should have a section describing the sparse data format. I'd like to refer to other tutorials on sparse format and see how they explain CSR and CSC. Do you have a reference to "the standard Yale format"?
Hi @hcho3, thank you for the reply and interest in a section describing the sparse matrix format. I think that would be great as well. One reference I have found helpful in cross referencing with XGBoost is the Wikipedia entry on the CSR and CSC Yale formats. Reading that helped me decode the current XGBoost documentation on the sparse matrix format. Another thing that might help is an example or a test in the BasicWalkThrough test that shows the construction of an in memory DMatrix
using the non-deprecated constructor. The documentation was a little bit confusing because of the plaint text sparse matrix "To load sparse matrix in CSR/CSC format is a little complicated, the usage is like : suppose a sparse matrix : 1 0 2 0 4 0 0 3 3 1 2 0".
Consolidating to #3439. This issue should be re-opened if someone decides to actively work on writing the document.