Xgboost: Booster gblinear - feature importance is Nan

Created on 3 Oct 2018 · 22Comments · Source: dmlc/xgboost

Hi,

I'm currently testing an XGBClassifier with a booster gblinear.

XGBClassifier(base_score=0.4, booster='gblinear',
                                 learning_rate=0.1,
                                 min_child_weight=1, n_estimators=40,
                                 n_jobs=1, objective='binary:logistic', random_state=0,
                                 reg_alpha=0.001, reg_lambda=0.01, scale_pos_weight=0.5,
                                 silent=True)

The results are pretty decent in comparison of gbtree and the model predictions are working OK. However when I try to get clf.feature_importances_ the output is NAN for each feature.

What I tried :

booster = clf.get_booster()
print(booster.get_dump())
['bias:\n0.0198594\nweight:\n-0.244868\n-0.0706597\n-0.278962\n0.361617\n0.0015507\n0.0610324\n0.0900545\n-0.304605\n-0.00666418\n0.000360869\n-0.444427\n0.00217489\n-0.00460279\n-0.12992\n0.0314143\n-0.0242503\n-0.0554258\n-0.0125138\n-0.293126\n9.8356e-05\n-0.038388\n0.0535105\n0.00518177\n0.00366128\n1.06038e-06\n0.00921716\n0.0589181\n-0.0629616\n0.000288531\n-2.10018e-06\n1.57876e-05\n0.000442199\n0.0667466\n-0.0599928\n1.8569e-05\n-0.00974454\n0.0180388\n0.0253814\n0.10383\n-0.0865306\n-0.095947\n0\n0\n-0.0425289\n0.000278106\n0.00921256\n-0.0456559\n-0.192751\n0\n0\n-0.308957\n0\n-0.179057\n0.136655\n0.00010597\n8.40341e-06\n0.00739328\n0.00135821\n']

But :

print(booster.get_fscore) 
output: {}

If you have some clue it would be really great!

Thanks a lot,
Flo

Source

Ravisik

All 22 comments

Is #3261 related?

hcho3 on 3 Oct 2018

It doesn't seem related, as my coefficients look good and my predictions are working (but I can be wrong)

Edit: How the feature_importance is calculated when gblinear is used? I imagine that the weights are used?

Ravisik on 3 Oct 2018

Do you have an example data?

hcho3 on 5 Oct 2018

I can reproduce my error with iris dataset:

from xgboost import XGBClassifier
from sklearn import datasets

# import some data to play with
iris = datasets.load_iris()
X = iris.data # we only take the first two features.
y = iris.target


Boost_linear = XGBClassifier(base_score=0.4, booster='gblinear',
                                 learning_rate=0.1,
                                 n_estimators=40,
                                 n_jobs=1, objective='reg:logistic', random_state=0,
                                 reg_alpha=0.001, reg_lambda=0.01, scale_pos_weight=0.5,
                                 silent=True)

Boost_linear.fit(X, y)
print (Boost_linear.predict_proba(X[0:5,:]))
print(Boost_linear.feature_importances_)
booster = Boost_linear.get_booster()
print(booster.get_dump())
print(booster.get_fscore())

Output:

[[ 0.71836758  0.18413587  0.09749654]
 [ 0.67803079  0.20802365  0.11394557]
 [ 0.6990937   0.19522834  0.10567797]
 [ 0.67483324  0.20943993  0.11572678]
 [ 0.7246449   0.18021189  0.09514326]]
[ nan  nan  nan  nan]
['bias:\n0.631748\n-0.00504506\n-0.475901\nweight:\n0.0415705\n0.000766932\n-0.0270381\n0.263995\n-0.0792167\n-0.135947\n-0.283108\n0.0528165\n0.113391\n-1.02109\n0.0515633\n0.504346\n']
{}

As you can see, predictions & weights seem good but I can't get feature importance.

Ravisik on 5 Oct 2018

Thanks! It should help for debugging efforts.

hcho3 on 5 Oct 2018

👍1

I looked at the Python code and it looks like GBLinear doesn't support feature importances. The concept of feature importance is specific to decision trees, as the definition specifically refers to "splits":
https://github.com/dmlc/xgboost/blob/2405c59352b05379d9e01ca10bea1d68b2f840e6/python-package/xgboost/core.py#L1392-L1398

A check should be added to throw exception when get_fscore is called on the GBLinear object.

@RAMitchell What do you think?

hcho3 on 7 Oct 2018

I'm not sure if there is a standard way to compute feature importance for linear models. I guess someone could implement something but throwing an exception seems like a good solution for now to manage expectations.

RAMitchell on 7 Oct 2018

Hey guys,

Thanks, It seems legit that feature_importance only works with trees.

However it seems that it create an error for RFE & RFECV (sklearn package).

You can add this to the previous code using iris dataset:

rfecv = RFECV(estimator=clf, step=1, cv=10,
                                    scoring=my_auc, n_jobs=n_jobs)
rfecv.fit(feat, y_true)

Error :

  File "/home/v8.py", line 510, in run_feat_selection
    rfecv.fit(feat, y_true)
  File "/home/datascience/venv1/local/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 134, in fit
    return self._fit(X, y)
  File "/home/datascience/venv1/local/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 189, in _fit
    ranks = np.argsort(safe_sqr(coefs))
  File "/home/datascience/venv1/local/lib/python2.7/site-packages/sklearn/utils/__init__.py", line 361, in safe_sqr
    X = check_array(X, accept_sparse=['csr', 'csc', 'coo'], ensure_2d=False)
  File "/home/datascience/venv1/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 453, in check_array
    _assert_all_finite(array)
  File "/home/datascience/venv1/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

I looked at the code of RFE, basically it seems that without feature_importances_ we look at the coefficients. However, the output of the Xgboost gblinear witht NaNs as feature_importances creates an error.

Edit: @hcho3 should I open a new issue for this particular problem?

Ravisik on 9 Oct 2018

@Ravisik Exposing coef_ as a property should solve the problem.

hcho3 on 9 Oct 2018

Re-opening this issue, to remind myself that coef_ should be added.

hcho3 on 9 Oct 2018

Linear coefficients are returned as feature importance in the R interface (assuming that a user has standardized the inputs). There are a number of ways to assess the relative importance of features in linear regression (e.g., see https://www.jstatsoft.org/article/view/v017i01/v17i01.pdf ) and standardized betas is one of them - while not the best but is usually good enough.

khotilov on 11 Oct 2018

@khotilov Interesting. We can certainly use linear coefficients as feature importances, but scikit-learn makes distinctions between coefficients (coef_) and feature importances (feature_importances_), so I think we should follow that convention as well.

hcho3 on 13 Oct 2018

@Ravisik #3855 add coef_ property so that RFECV would work.

hcho3 on 2 Nov 2018

Hey,

Thanks a lot!
I'm new on github, in order to use this update I have to install again Xgboost 0.8 ?

Thanks a lot mate!

Ravisik on 5 Nov 2018

The fix will be available in version 0.81.

hcho3 on 5 Nov 2018

Hey !

I test the new version 0.81 with RFECV :

from xgboost import XGBClassifier
from sklearn import datasets
from sklearn.feature_selection import RFECV

# import some data to play with
iris = datasets.load_iris()
X = iris.data # we only take the first two features.
y = iris.target

Boost_linear = XGBClassifier(base_score=0.4, booster='gblinear',
                                 learning_rate=0.1,
                                 n_estimators=40,
                                 n_jobs=1, objective='reg:logistic', random_state=0,
                                 reg_alpha=0.001, reg_lambda=0.01, scale_pos_weight=0.5,
                                 silent=True)
rfecv = RFECV(estimator=Boost_linear, step=1, cv=10)
rfecv.fit(X, y)

However I got an error that come from "rfe.py':
output:
AttributeError: 'list' object has no attribute 'ndim'

The line in rfe.py is:

# Get ranks
            if coefs.ndim > 1:
                ranks = np.argsort(safe_sqr(coefs).sum(axis=0))
            else:
                ranks = np.argsort(safe_sqr(coefs))

It seems that coefs from Xgboost have a problem of type?

Ravisik on 6 Nov 2018

@Ravisik It looks like the type needs to be np.array. Preparing a pull request now.

hcho3 on 6 Nov 2018

👍1

@Ravisik #3873 should fix it. I added a test this time to make sure it works.

hcho3 on 6 Nov 2018

👍1

Yeah it is working ;)

Ravisik on 6 Nov 2018

Hi,
Related to this issue, I was trying to plot the importance of the features of a XGBClassifier instance using gblinear as objective.
The plot_importance function fails with the following error:

ValueError: Feature importance is not defined for Booster type gblinear

I am currently solving this issue with the following code:

model_coef = model.coef_
feature_importances = dict(zip(feature_names, model_coef))
feature_importances = {k: abs(v / sum(model_coef)) for k, v in feature_importances.items()}

I personally think that right now that there is a sort of importance for gblinearobjective, xgboost should at least refers to it, or implement the generation of the importance plot.
This discussion is the only one regarding this problem and it would be useful to have a reference in the documentation.

muscionig on 16 Jan 2019

@muscionig It would be certainly reasonable to use coefficients for plotting feature importances. Can you submit a pull request to update the plotting function?

hcho3 on 16 Jan 2019

I will do that.
The only thing I am wondering is related to plot the coefficients as they are, or the absolute values of the normalized coefficients as it is currently done in the plot_importance method.
I think that maybe the first method should be considered. From a linear booster, I would expect negative and positive coefficients.
Thank you

muscionig on 17 Jan 2019

Was this page helpful?

0 / 5 - 0 ratings