Xgboost: ValueError if last DMatrix col is zero and feature_names provided

Created on 7 Apr 2016 · 11Comments · Source: dmlc/xgboost

When using python version and creating DMatrix object of a single row with last column being zero with provided feature_names this raises ValueError.

>>> import xgboost
>>> col_names = ('a', 'b', 'c')

# this is ok
>>> dtest1 = xgboost.DMatrix([1, 0, 3], feature_names=col_names)
# this fails
>>> dtest2 = xgboost.DMatrix([1, 2, 0], feature_names=col_names)
ValueError: feature_names must have the same length as data

Truncating the last col doesn't help as that just causes column mismatch when used with model:

>>> dtrain = xgboost.DMatrix([1, 2, 3], label=[1], feature_names=col_names)
>>> model = xgboost.train({"method": "xgboost", "objective": "binary:logistic", }, dtrain, 1)
>>> model.predict(dtest1)
array([ 0.5], dtype=float32)
>>> model.predict(xgboost.DMatrix([1, 2], feature_names=('a', 'b')))
ValueError: feature_names mismatch: ['a', 'b', 'c'] ['a', 'b']

Source

antonymayi

Most helpful comment

A workaround is to insert nonzero elements at right

X = sp.hstack((X, sp.csr_matrix(np.ones((X.shape[0], 1)))))

luoq on 27 Jul 2016

👍4

All 11 comments

Looked a little, and it is caused by here.

https://github.com/dmlc/xgboost/blob/master/src/c_api/c_api.cc#L237

It can't get correct number of columns if left-most column is filled with 0. It also means left-most column is meaningless as a feature.

It can be fixed by passing matrix width as arg (it also affects to java and R), but not sure worth to do it.

sinhrks on 29 Apr 2016

@tqchen what do you think about this issue? For me it's pretty critical and it's worth to change API as @sinhrks proposed.

sh1ng on 24 Jun 2016

A workaround is to insert nonzero elements at right

X = sp.hstack((X, sp.csr_matrix(np.ones((X.shape[0], 1)))))

luoq on 27 Jul 2016

👍4

This is a big issue for me as well. For a quick fix, could we use @luoq 's suggestion above, but build that into XGBoost's checks? So, in the case that it finds this issue (ValueError: feature_names mismatch ... expected f64, f63 in input data), XGBoost internally adds the nonzero elements at right?

For me, this is a critical feature of using sparse matrices in production. When I train a model, obviously I want to train it on all relevant features in the training set. However, when I'm making a prediction, there's a very good chance that whatever new data I'm making a prediction on is _not_ going to have all those same features. Particularly if a lot of those features are sparse.

ClimbsRocks on 23 Aug 2016

Make the production data frame to have the same number of columns and names
and of same order to solve feature name mismatch issue.

On Wednesday, 24 August 2016, Preston Parry [email protected]
wrote:

This is a big issue for me as well. For a quick fix, could we use @luoq
https://github.com/luoq 's suggestion above, but build that into
XGBoost's checks? So, in the case that it finds this issue (ValueError:
feature_names mismatch ... expected f64, f63 in input data), XGBoost
internally adds the nonzero elements at right?

For me, this is a critical feature of using sparse matrices in production.
When I train a model, obviously I want to train it on all relevant features
in the training set. However, when I'm making a prediction, there's a very
good chance that whatever new data I'm making a prediction on is _not_
going to have all those same features. Particularly if a lot of those
features are sparse.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/xgboost/issues/1091#issuecomment-241847904, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AIARThrm-q6m0GxGh2EatcnqNdfrCqXdks5qi0kpgaJpZM4ICNOd
.

Regards,
Cheu Eng Yeow

eycheu on 24 Aug 2016

I think this issue can be closed because in the latest version of xgb this problem no longer happens

alexeygrigorev on 28 Oct 2016

👎5

I got this following error when i tried running this below code:
test_preds = clf.predict(xgtest, ntree_limit=clf.best_iteration)
untitled
What is the cause for this error?

deepakkm007 on 20 Dec 2016

@alexeygrigorev: This occures at me too for the latest version (0.6a2) as well.

@deepakkm007 However there is a workaround what works for me: use pandas dataframe instead of scipy csr matrices.

vuchetichbalint on 21 Dec 2016

can confirm having the same problem with csr_matrix

Edit: a very nasty but seemingly worked solution (if you are absolutely desperate and didn't want to revert to previous version, like me) is to brute forcely remove/comment out the following line in the code, if you are using Python

                if my_missing:
                    msg += ('\ntraining data did not have the following fields: ' +
                            ', '.join(str(s) for s in my_missing))

               #  raise ValueError(msg.format(self.feature_names, data.feature_names))

https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py#L1192

But it seems weird to me that the validation fails on Python side, while the Native code seems to take it quite well.

kaonashi-tyc on 2 Jan 2017

I guess your version 0.6a2 refers to the package available from pypi https://pypi.python.org/pypi/xgboost/ which is dated Aug 9th. The fix was submitted on Sep 23rd #1606. Please check out the latest code from github. Version numbers or pypi packages sometimes are not updated for long time.