When using python version and creating DMatrix object of a single row with last column being zero with provided feature_names
this raises ValueError
.
>>> import xgboost
>>> col_names = ('a', 'b', 'c')
# this is ok
>>> dtest1 = xgboost.DMatrix([1, 0, 3], feature_names=col_names)
# this fails
>>> dtest2 = xgboost.DMatrix([1, 2, 0], feature_names=col_names)
ValueError: feature_names must have the same length as data
Truncating the last col doesn't help as that just causes column mismatch when used with model:
>>> dtrain = xgboost.DMatrix([1, 2, 3], label=[1], feature_names=col_names)
>>> model = xgboost.train({"method": "xgboost", "objective": "binary:logistic", }, dtrain, 1)
>>> model.predict(dtest1)
array([ 0.5], dtype=float32)
>>> model.predict(xgboost.DMatrix([1, 2], feature_names=('a', 'b')))
ValueError: feature_names mismatch: ['a', 'b', 'c'] ['a', 'b']
Looked a little, and it is caused by here.
It can't get correct number of columns if left-most column is filled with 0. It also means left-most column is meaningless as a feature.
It can be fixed by passing matrix width as arg (it also affects to java and R), but not sure worth to do it.
@tqchen what do you think about this issue? For me it's pretty critical and it's worth to change API as @sinhrks proposed.
A workaround is to insert nonzero elements at right
X = sp.hstack((X, sp.csr_matrix(np.ones((X.shape[0], 1)))))
This is a big issue for me as well. For a quick fix, could we use @luoq 's suggestion above, but build that into XGBoost's checks? So, in the case that it finds this issue (ValueError: feature_names mismatch
... expected f64, f63 in input data
), XGBoost internally adds the nonzero elements at right?
For me, this is a critical feature of using sparse matrices in production. When I train a model, obviously I want to train it on all relevant features in the training set. However, when I'm making a prediction, there's a very good chance that whatever new data I'm making a prediction on is _not_ going to have all those same features. Particularly if a lot of those features are sparse.
Make the production data frame to have the same number of columns and names
and of same order to solve feature name mismatch issue.
On Wednesday, 24 August 2016, Preston Parry [email protected]
wrote:
This is a big issue for me as well. For a quick fix, could we use @luoq
https://github.com/luoq 's suggestion above, but build that into
XGBoost's checks? So, in the case that it finds this issue (ValueError:
feature_names mismatch ... expected f64, f63 in input data), XGBoost
internally adds the nonzero elements at right?For me, this is a critical feature of using sparse matrices in production.
When I train a model, obviously I want to train it on all relevant features
in the training set. However, when I'm making a prediction, there's a very
good chance that whatever new data I'm making a prediction on is _not_
going to have all those same features. Particularly if a lot of those
features are sparse.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/xgboost/issues/1091#issuecomment-241847904, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AIARThrm-q6m0GxGh2EatcnqNdfrCqXdks5qi0kpgaJpZM4ICNOd
.
Regards,
Cheu Eng Yeow
I think this issue can be closed because in the latest version of xgb this problem no longer happens
I got this following error when i tried running this below code:
test_preds = clf.predict(xgtest, ntree_limit=clf.best_iteration)
What is the cause for this error?
@alexeygrigorev: This occures at me too for the latest version (0.6a2) as well.
@deepakkm007 However there is a workaround what works for me: use pandas dataframe instead of scipy csr matrices.
can confirm having the same problem with csr_matrix
Edit: a very nasty but seemingly worked solution (if you are absolutely desperate and didn't want to revert to previous version, like me) is to brute forcely remove/comment out the following line in the code, if you are using Python
if my_missing:
msg += ('\ntraining data did not have the following fields: ' +
', '.join(str(s) for s in my_missing))
# raise ValueError(msg.format(self.feature_names, data.feature_names))
https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py#L1192
But it seems weird to me that the validation fails on Python side, while the Native code seems to take it quite well.
I guess your version 0.6a2 refers to the package available from pypi https://pypi.python.org/pypi/xgboost/ which is dated Aug 9th. The fix was submitted on Sep 23rd #1606. Please check out the latest code from github. Version numbers or pypi packages sometimes are not updated for long time.
@khotilov thanks for reaching out. Can confirm now, after I did a clean install from the latest github repo, the problem now solved.
Most helpful comment
A workaround is to insert nonzero elements at right