Hi,
I'm have some problems with CSR sparse matrices. I train the model on dataset created by sklearn TfidfVectorizer, then use the same vectorizer to transform test dataset.
During prediction following error occurs:
<ipython-input-48-bb14ec4ec14c> in <module>()
2
3 for cat, classifier in tqdm(classifiers.items()):
----> 4 pred[cat] = classifier.predict_proba(words)
/home/inexxt/anaconda3/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/sklearn.py in predict_proba(self, data, output_margin, ntree_limit)
475 class_probs = self.booster().predict(test_dmatrix,
476 output_margin=output_margin,
--> 477 ntree_limit=ntree_limit)
478 if self.objective == "multi:softprob":
479 return class_probs
/home/inexxt/anaconda3/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/core.py in predict(self, data, output_margin, ntree_limit, pred_leaf)
937 option_mask |= 0x02
938
--> 939 self._validate_features(data)
940
941 length = ctypes.c_ulong()
/home/inexxt/anaconda3/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/core.py in _validate_features(self, data)
1177
1178 raise ValueError(msg.format(self.feature_names,
-> 1179 data.feature_names))
1180
1181 def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):
ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', ...
f38732', 'f38733', 'f38734', 'f38735', 'f38736', 'f38737', 'f38738', 'f38739'] []
expected f4057, f36350, f1683, f1914, f33121, f16637, f21443, f10995, f36221, f24340, f15968, f7863, f38732, ...
f19897, f33500, f37792, f30259, f20094, f27943, f5788, f14369, f9074 in input data
The dimensions are the same in training and prediction time, the second list appears to be permutation of the first (...
symbol is mine, log is very long).
It may be related to #1238
You need transform sparse matrix to array, like this:
xg_train = xgb.DMatrix(X_train.toarray(), label=y_train)
xg_test = xgb.DMatrix(X_test.toarray())
@SpLin12 No, that would be ridiculous. Converting a sparse array to be dense is not an intended fix nor is it possible in the majority of sparse feature spaces.
It's definitely #1238, code works fine with csc matrices (although the performance is ~20% worse)
passing a dataframe solved the issue for me
@nazirmubbashir could you clarify?
@jpbm I used pandas dataframes and passed them directly, without converting to arrays.
Most helpful comment
@SpLin12 No, that would be ridiculous. Converting a sparse array to be dense is not an intended fix nor is it possible in the majority of sparse feature spaces.