Lightgbm: Different sequence of feature cause different result when using LightGBM?

Created on 31 Mar 2018 · 4Comments · Source: microsoft/LightGBM

When I train a model using LightGBM, as follow:

xgtrain = lgb.Dataset(dtrain[predictors].values, label=dtrain[target].values,
                  feature_name=predictors,
                  categorical_feature=categorical_features
                  )
xgvalid = lgb.Dataset(dvalid[predictors].values, label=dvalid[target].values,
                  feature_name=predictors,
                  categorical_feature=categorical_features
                  )

evals_results = {}

bst1 = lgb.train(lgb_params, 
                 xgtrain, 
                 valid_sets=[xgtrain, xgvalid], 
                 valid_names=['train','valid'], 
                 evals_result=evals_results, 
                 num_boost_round=num_boost_round,
                 early_stopping_rounds=early_stopping_rounds,
                 verbose_eval=10, 
                 feval=feval)

n_estimators = bst1.best_iteration
print("\nModel Report")
print("n_estimators : ", n_estimators)
print(metrics+":", evals_results['valid'][metrics][n_estimators-1])

I run the code twice, everything is same except:

(1) the first time,

predictors = ['context_page_id', 'item_city_id', 'item_collected_level', 'item_price_level', 'item_pv_level', 'item_sales_level', 'shop_review_num_level', 'shop_review_positive_rate', 'shop_score_delivery', 'shop_score_description', 'shop_score_service', 'shop_star_level', 'user_age_level', 'user_gender_id', 'user_occupation_id', 'user_star_level', 'category_1', 'category_2', 'min', 'hour', 'day', 'week', 'buy_item', 'buy_shop', 'buy_brand', 'browse_total', 'buy_total', 'browse_buy_rate', 'item_browse', 'item_buy', 'item_browse_buy_rate', 'shop_browse', 'shop_buy', 'shop_browse_buy_rate', 'hour_bin_1', 'hour_bin_2', 'hour_bin_3', 'is_new_user_0', 'is_new_user_1', 'is_new_item_0', 'is_new_item_1', 'is_new_shop_0', 'is_new_shop_1', 'is_new_brand_0', 'is_new_brand_1']

(2) the second time,

predictors = ['browse_buy_rate', 'browse_total', 'buy_brand', 'buy_item', 'buy_shop', 'buy_total', 'category_1', 'category_2', 'context_page_id', 'day', 'hour', 'item_browse', 'item_browse_buy_rate', 'item_buy', 'item_city_id', 'item_collected_level', 'item_price_level', 'item_pv_level', 'item_sales_level', 'min', 'shop_browse', 'shop_browse_buy_rate', 'shop_buy', 'shop_review_num_level', 'shop_review_positive_rate', 'shop_score_delivery', 'shop_score_description', 'shop_score_service', 'shop_star_level', 'user_age_level', 'user_gender_id', 'user_occupation_id', 'user_star_level', 'week', 'hour_bin_1', 'hour_bin_2', 'hour_bin_3', 'is_new_user_0', 'is_new_user_1', 'is_new_item_0', 'is_new_item_1', 'is_new_shop_0', 'is_new_shop_1', 'is_new_brand_0', 'is_new_brand_1']

Just change the order, but the result is different:

(1) the first time:

...
[1030]  train's binary_logloss: 0.0781902   valid's binary_logloss: 0.0821433
Early stopping, best iteration is:
[837]   train's binary_logloss: 0.0799938   valid's binary_logloss: 0.0820824

Model Report
n_estimators :  837
binary_logloss: 0.08208239967439723

(2) the second time:

...
[930]   train's binary_logloss: 0.0792041   valid's binary_logloss: 0.0821642
Early stopping, best iteration is:
[738]   train's binary_logloss: 0.0810454   valid's binary_logloss: 0.0821186

Model Report
n_estimators :  738
binary_logloss: 0.08211859038553634

Can anyone explain it? Thank you very much.

Source

w-zm

Most helpful comment

This is by design. feature order will affect the accuracy.
The reason is, when choose a feature to split tree node, if two features have the same split gain, the feature with smaller index(id) will be chosen.