Hi,
I'm using scikit-learn automatic feature selection together with a trained XGBoost model.
I set up a threshold to interrupt the feature reduction process when accuracy falls below it.
I think everything is fine in the loop, but when I use SelectFromModel.transform()
I receive the following error:
Traceback (most recent call last):
File "boost.py", line 581, in <module>
s_train_x = selection.transform(train_x)
File "/home/guido/.virtualenvs/ml/lib/python3.6/site-packages/sklearn/feature_selection/_base.py", line 77, in transform
mask = self.get_support()
File "/home/guido/.virtualenvs/ml/lib/python3.6/site-packages/sklearn/feature_selection/_base.py", line 46, in get_support
mask = self._get_support_mask()
File "/home/guido/.virtualenvs/ml/lib/python3.6/site-packages/sklearn/feature_selection/_from_model.py", line 178, in _get_support_mask
scores = _get_feature_importances(estimator, self.norm_order)
File "/home/guido/.virtualenvs/ml/lib/python3.6/site-packages/sklearn/feature_selection/_from_model.py", line 18, in _get_feature_importances
coef_ = getattr(estimator, "coef_", None)
File "/home/guido/.virtualenvs/ml/lib/python3.6/site-packages/xgboost/sklearn.py", line 716, in coef_
coef = np.array(json.loads(b.get_dump(dump_format='json')[0])['weight'])
KeyError: 'weight'
I'm using the latest xgboost 1.0.2 with scikit-learn 0.22 and below there is the code I wrote. It's part of a bigger script, so some variable are defined before, but the KeyError should not depend on that.
report = []
prev_t = -1
scores = np.sort(model.feature_importances_)
indices = np.argsort(model.feature_importances_)
misc.msg('Feature selection (threshold = {})...'.format(autosel))
iterator = tqdm(scores)
for i, t in enumerate(iterator):
if -1 < prev_t == t:
continue
prev_t = t
selection = SelectFromModel(model, threshold=t, prefit=True)
try:
s_train_x = selection.transform(train_x)
except ValueError:
misc.msg('Incompatible number of features!', 'err')
sys.exit(1)
kwargs = {'tree_method': 'hist' if not gpu else 'gpu_hist',
'grow_policy': 'lossguide' if useloss else 'depthwise'} \
if not exact else {}
s_model = xgb.XGBClassifier(objective=model.objective, n_jobs=-1, n_estimators=model.n_estimators,
max_depth=model.max_depth, learning_rate=model.learning_rate,
subsample=model.subsample, colsample_bytree=model.colsample_bytree,
min_child_weight=model.min_child_weight, gamma=model.gamma,
reg_alpha=model.reg_alpha, reg_lambda=model.reg_lambda,
max_delta_step=model.max_delta_step, random_state=model.random_state,
scale_pos_weight=model.scale_pos_weight, **kwargs)
try:
s_model.fit(s_train_x, train_y)
except KeyboardInterrupt:
misc.msg('Feature selection interrupted', 'warn')
sys.exit(0)
s_test_x = selection.transform(test_x)
s_pred_y = s_model.predict(s_test_x)
s_accuracy = accuracy_score(test_y, s_pred_y)
subset = str(list(reversed(indices[i:]))).replace(',', ';')
report.append([t, s_train_x.shape[1], s_accuracy, subset])
if s_accuracy < args.autosel:
iterator.close()
misc.msg('Accuracy below threshold ({:.6f})'.format(s_accuracy), 'warn')
misc.msg('Feature subset: {}'.format(conv.values2ranges(indices[i:])))
break
gc.collect()
Anyone can reproduce this behaviour?
Many thanks in advance!
Hi, could you please post a more complete script that I can run?
Sure, I will post it here this afternoon, so you can take a look at it.
Thanks!
This is a minimal test.py
:
from h5py import File
from joblib import load
from sklearn.feature_selection import SelectFromModel
if __name__ == '__main__':
h5 = File('dataset.h5', 'r')
data = h5['data'][:]
model = load('model.mdl')
selection = SelectFromModel(model, threshold=0.95, prefit=True).transform(data)
This is the corresponding requirements.txt
:
h5py==2.10.0
joblib==0.14.1
numpy==1.18.4
scikit-learn==0.23.0
scipy==1.4.1
six==1.14.0
threadpoolctl==2.0.0
xgboost==1.0.2
Here are the dataset and model to be unzipped in the same folder as the script. The model is a xgb.XGBClassifier
previously trained on the same data with the standard fit()
function.
You can reproduce the reported problem with python test.py
.
@GuidoBartoli Hi dude. I've had the same problem. I have used xgboost==1.0.0 version. Upgrading up to recent 1.1.0 helped.
The issue is fixed in #5505 and the example script runs fine on XGBoost 1.1.0.
The issue is fixed in #5505 and the example script runs fine on XGBoost 1.1.0.
Perfect, many thanks!
Most helpful comment
This is a minimal
test.py
:This is the corresponding
requirements.txt
:Here are the dataset and model to be unzipped in the same folder as the script. The model is a
xgb.XGBClassifier
previously trained on the same data with the standardfit()
function.You can reproduce the reported problem with
python test.py
.