Xgboost: feature_names mismatch on sparse matrices

Created on 5 Aug 2016 · 6Comments · Source: dmlc/xgboost

Hi,

I'm have some problems with CSR sparse matrices. I train the model on dataset created by sklearn TfidfVectorizer, then use the same vectorizer to transform test dataset.
During prediction following error occurs:

<ipython-input-48-bb14ec4ec14c> in <module>()
      2 
      3 for cat, classifier in tqdm(classifiers.items()):
----> 4     pred[cat] = classifier.predict_proba(words)

/home/inexxt/anaconda3/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/sklearn.py in predict_proba(self, data, output_margin, ntree_limit)
    475         class_probs = self.booster().predict(test_dmatrix,
    476                                              output_margin=output_margin,
--> 477                                              ntree_limit=ntree_limit)
    478         if self.objective == "multi:softprob":
    479             return class_probs

/home/inexxt/anaconda3/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/core.py in predict(self, data, output_margin, ntree_limit, pred_leaf)
    937             option_mask |= 0x02
    938 
--> 939         self._validate_features(data)
    940 
    941         length = ctypes.c_ulong()

/home/inexxt/anaconda3/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/core.py in _validate_features(self, data)
   1177 
   1178                 raise ValueError(msg.format(self.feature_names,
-> 1179                                             data.feature_names))
   1180 
   1181     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):


ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', ...
f38732', 'f38733', 'f38734', 'f38735', 'f38736', 'f38737', 'f38738', 'f38739'] []
expected f4057, f36350, f1683, f1914, f33121, f16637, f21443, f10995, f36221, f24340, f15968, f7863, f38732, ...
f19897, f33500, f37792, f30259, f20094, f27943, f5788, f14369, f9074 in input data

The dimensions are the same in training and prediction time, the second list appears to be permutation of the first (... symbol is mine, log is very long).

It may be related to #1238

Source

inexxt

👍1

Most helpful comment

@SpLin12 No, that would be ridiculous. Converting a sparse array to be dense is not an intended fix nor is it possible in the majority of sparse feature spaces.

dmcgarry on 1 Sep 2016

👍9

All 6 comments

You need transform sparse matrix to array, like this:

xg_train = xgb.DMatrix(X_train.toarray(), label=y_train)
xg_test = xgb.DMatrix(X_test.toarray())

bahshetsian on 7 Aug 2016

👎3

@SpLin12 No, that would be ridiculous. Converting a sparse array to be dense is not an intended fix nor is it possible in the majority of sparse feature spaces.

dmcgarry on 1 Sep 2016

👍9

It's definitely #1238, code works fine with csc matrices (although the performance is ~20% worse)

inexxt on 2 Sep 2016

passing a dataframe solved the issue for me