Operating System: Linux Mint Serena 18.1
CPU: AMD A10-7300, x86_64
Python version: 3.5.2
lightgbm.__version__: '2.1.1'
LightGBM is not using all available features
print(train.shape, test.shape, y.shape)
(307493, 439) (48744, 439) (307493,)
As seen above, there are 439 features in train data but it is using only 428 features as shown in the log message:
[LightGBM] [Info] Number of positive: 24823, number of negative: 282670
[LightGBM] [Info] Total Bins 57256
[LightGBM] [Info] Number of data: 307493, number of used features: 428
cat_feature_names = ['A', 'B'... around 40 cat features]
params = {
'task': 'train',
'num_leaves':32,
'min_data_in_leaf': 420,
'application': 'binary',
'boosting': 'gbdt',
'metric': 'auc',
'learning_rate': 0.01,
'min_child_weight': 18,
'lambda_l1': 1.5,
'lambda_l2': 1,
'num_threads': 3,
}
dataset = lgb.Dataset(train, y)
model = lgb.train(params, dataset, verbose_eval=1000, categorical_feature=cat_feature_names )
The feature_fraction defaults to 1, so why is it not using all features ?
The number of features used seems to be inversely related to the size of min_data_in_leaf.
If i reduce it all the way down to 1, it is using 438 features (1 less than actual). Any thing greater than 1 and I am loosing more features. But it never uses all the features.
@quakig LightGBM will auto disable the feature that cannot be splitted, like the feature with almost all values are zeros (or the same). And min_data_in_leaf can control this.
Most helpful comment
@quakig LightGBM will auto disable the feature that cannot be splitted, like the feature with almost all values are zeros (or the same). And
min_data_in_leafcan control this.