Xgboost: feature_names mismatch on sparse matrices

Created on 5 Aug 2016  路  6Comments  路  Source: dmlc/xgboost

Hi,

I'm have some problems with CSR sparse matrices. I train the model on dataset created by sklearn TfidfVectorizer, then use the same vectorizer to transform test dataset.
During prediction following error occurs:

<ipython-input-48-bb14ec4ec14c> in <module>()
      2 
      3 for cat, classifier in tqdm(classifiers.items()):
----> 4     pred[cat] = classifier.predict_proba(words)

/home/inexxt/anaconda3/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/sklearn.py in predict_proba(self, data, output_margin, ntree_limit)
    475         class_probs = self.booster().predict(test_dmatrix,
    476                                              output_margin=output_margin,
--> 477                                              ntree_limit=ntree_limit)
    478         if self.objective == "multi:softprob":
    479             return class_probs

/home/inexxt/anaconda3/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/core.py in predict(self, data, output_margin, ntree_limit, pred_leaf)
    937             option_mask |= 0x02
    938 
--> 939         self._validate_features(data)
    940 
    941         length = ctypes.c_ulong()

/home/inexxt/anaconda3/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/core.py in _validate_features(self, data)
   1177 
   1178                 raise ValueError(msg.format(self.feature_names,
-> 1179                                             data.feature_names))
   1180 
   1181     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):


ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', ...
f38732', 'f38733', 'f38734', 'f38735', 'f38736', 'f38737', 'f38738', 'f38739'] []
expected f4057, f36350, f1683, f1914, f33121, f16637, f21443, f10995, f36221, f24340, f15968, f7863, f38732, ...
f19897, f33500, f37792, f30259, f20094, f27943, f5788, f14369, f9074 in input data

The dimensions are the same in training and prediction time, the second list appears to be permutation of the first (... symbol is mine, log is very long).

It may be related to #1238

Most helpful comment

@SpLin12 No, that would be ridiculous. Converting a sparse array to be dense is not an intended fix nor is it possible in the majority of sparse feature spaces.

All 6 comments

You need transform sparse matrix to array, like this:

xg_train = xgb.DMatrix(X_train.toarray(), label=y_train)
xg_test = xgb.DMatrix(X_test.toarray())

@SpLin12 No, that would be ridiculous. Converting a sparse array to be dense is not an intended fix nor is it possible in the majority of sparse feature spaces.

It's definitely #1238, code works fine with csc matrices (although the performance is ~20% worse)

passing a dataframe solved the issue for me

@nazirmubbashir could you clarify?

@jpbm I used pandas dataframes and passed them directly, without converting to arrays.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lizsz picture lizsz  路  3Comments

FabHan picture FabHan  路  4Comments

hx364 picture hx364  路  3Comments

frankzhangrui picture frankzhangrui  路  3Comments

RanaivosonHerimanitra picture RanaivosonHerimanitra  路  3Comments