Hello!
I have made several simple feature matrix transformations with the aim of
stability check.
Transformations:
I have tried to minimise randomisation in training process by setting feature_fraction and bagging_fraction to one. According to "academic" gradient boosting trees these transformations shouldn't change final result. But the predictions between original data and transformed match only after columns permutation.
Can you please explain why lgbm fitting can vary after linear data transformations?
Operating System:
RedHat
CPU/GPU model:
CPU
C++/Python/R version:
Python
LightGBM version or commit hash:
2.2.3
import lightgbm as lgb
import numpy as np
def add_constant(x_train, x_test, common=True):
if common:
rand_constant = np.random.randint(-1000, 1000)
else:
rand_constant = np.random.randint(
-1000, 1000, size=(1, x_train.shape[1]))
return x_train + rand_constant, x_test + rand_constant
def mult_constant(x_train, x_test, common=True):
if common:
rand_constant = np.random.randint(-1000, 1000)
if rand_constant == 0:
rand_constant = 1
else:
rand_constant = np.random.randint(
-1000, 1000, size=(1, x_train.shape[1]))
rand_constant[rand_constant == 0] = 1
return x_train * rand_constant, x_test * rand_constant
def permute(x_train, x_test, data=True):
if data:
indeces = np.random.permutation(np.arange(x_train.shape[0]))
return x_train[indeces].copy(), x_test.copy()
else:
indeces = np.random.permutation(np.arange(x_train.shape[1]))
return x_train[:, indeces].copy(), x_test[:, indeces].copy()
def original(x_train, x_test):
return x_train.copy(), x_test.copy()
transformations = [
original,
add_constant,
lambda x,y: add_constant(x, y, False),
mult_constant,
lambda x,y: mult_constant(x, y, False),
permute,
lambda x,y: permute(x, y, False),
]
n_dots = 10000
n_features = 300
x_train = np.random.randn(n_dots, n_features)
x_test = np.random.randn(n_dots, n_features)
y_train = np.random.randn(n_dots)
y_test = np.random.randn(n_dots)
params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'regression_l2',
'num_leaves': 32,
'learning_rate': 0.2,
'feature_fraction': 1.00,
'bagging_fraction': 1.00,
'bagging_freq': 1,
'verbose': 0,
'num_threads': 20,
'min_data_in_leaf': 10,
'use_missing': False,
'seed': 0
}
total_gbms = []
total_preds = []
for trans in transformations:
print(trans.__name__)
x_tr_, x_te_ = trans(x_train, x_test)
lgb_train = lgb.Dataset(
x_tr_,
label=y_train,
free_raw_data=True,
)
lgb_test = lgb.Dataset(
x_te_, label=y_test, reference=lgb_train, free_raw_data=True)
evals_result = {}
gbm = lgb.train(
params,
lgb_train,
num_boost_round=100,
valid_sets=[lgb_train, lgb_test],
evals_result=evals_result,
verbose_eval=False)
total_gbms.append(gbm)
total_preds.append((gbm.predict(x_tr_), gbm.predict(x_te_)))
del lgb_train, lgb_test
for i in range(1, len(total_preds)):
print(np.std(total_preds[0][0] - total_preds[i][0]))
The first think I would check is whether you've set num_boost_round high enough to achieve convergence in all cases? I'm not actually that familiar with lightgbm yet, I've been mostly using catboost but 100 doesn't seem like many iterations. If lgb.train() has early stopping but in some (all) scenarios it's still hitting the limit of 100 that would be a red flag.
The first think I would check is whether you've set num_boost_round high enough to achieve convergence in all cases? I'm not actually that familiar with lightgbm yet, I've been mostly using catboost but 100 doesn't seem like many iterations. If
lgb.train()has early stopping but in some (all) scenarios it's still hitting the limit of 100 that would be a red flag.
The problem is that you should get exactly the same results if one adds constant to features vector, regardless of num_boost_round. On each boosting step exact gbtd algorithm looks for a split that gives minimal var, so if you don't change mutual order in features, split stays the same from the very first iteration. So my question is what kind of heuristic is used in lgbm that can give such difference.
@qashqay654
More specifically, as there are many sparse data (zero values), LightGBM will always treat the zero feature value as one unique bin. Therefore, the binning algorithm firstly processes the negative values, then zero values, and then positive values. And add/mul constant will change the values ranges, and result in the different binning values.
In short, to test add_constant, you should ensure the original values are all positive(or all negative), and the constants are the same sign as original values.
For the mul_constant, you should ensure the constants are positive.
@guolinke
Hi, thanks.
According to your responses, I have made several changes in the reproducible examples as follows:
def reverse(x_train, x_test):
return 1/x_train, 1/x_test
Indeed, the prediction after add_constant and mult_constant transformation match the original prediction. However, using reverse function changed original prediction. May I ask why would this happen?
In my view, after reverse transformation, the order of values in X is reversed, but this would not change the split points of features. Also since all the values are positive, reversing the order of values would not affect binning values.
Edited Reproducible Examples
import lightgbm as lgb
import numpy as np
def add_constant(x_train, x_test, common=True):
if common:
rand_constant = np.random.randint(0, 1000)
else:
rand_constant = np.random.randint(
0, 1000, size=(1, x_train.shape[1]))
return x_train + rand_constant, x_test + rand_constant
def mult_constant(x_train, x_test, common=True):
if common:
rand_constant = np.random.randint(0, 1000)
if rand_constant == 0:
rand_constant = 1
else:
rand_constant = np.random.randint(
0, 1000, size=(1, x_train.shape[1]))
rand_constant[rand_constant == 0] = 1
return x_train * rand_constant, x_test * rand_constant
def permute(x_train, x_test, data=True):
if data:
indeces = np.random.permutation(np.arange(x_train.shape[0]))
return x_train[indeces].copy(), x_test.copy()
else:
indeces = np.random.permutation(np.arange(x_train.shape[1]))
return x_train[:, indeces].copy(), x_test[:, indeces].copy()
def reverse(x_train, x_test): # newly added
return 1/x_train, 1/x_test
def original(x_train, x_test):
return x_train.copy(), x_test.copy()
transformations = [
original,
add_constant,
lambda x,y: add_constant(x, y, False),
mult_constant,
lambda x,y: mult_constant(x, y, False),
reverse
]
n_dots = 10000
n_features = 300
x_train = np.absolute(np.random.randn(n_dots, n_features))
x_test = np.absolute(np.random.randn(n_dots, n_features))
y_train = np.random.randn(n_dots)
y_test = np.random.randn(n_dots)
params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'regression_l2',
'num_leaves': 32,
'learning_rate': 0.2,
'feature_fraction': 1.00,
'bagging_fraction': 1.00,
'bagging_freq': 1,
'verbose': 0,
'num_threads': 20,
'min_data_in_leaf': 10,
'use_missing': False,
'seed': 0
}
total_gbms = []
total_preds = []
for trans in transformations:
print(trans.__name__)
x_tr_, x_te_ = trans(x_train, x_test)
lgb_train = lgb.Dataset(
x_tr_,
label=y_train,
free_raw_data=True,
)
lgb_test = lgb.Dataset(
x_te_, label=y_test, reference=lgb_train, free_raw_data=True)
evals_result = {}
gbm = lgb.train(
params,
lgb_train,
num_boost_round=100,
valid_sets=[lgb_train, lgb_test],
evals_result=evals_result,
verbose_eval=False)
total_gbms.append(gbm)
total_preds.append((gbm.predict(x_tr_), gbm.predict(x_te_)))
del lgb_train, lgb_test
for i in range(1, len(total_preds)):
print(np.std(total_preds[0][0] - total_preds[i][0]))
@Qiuyan918
The binning algorithm in LightGBM is the greedy equally density solution, from smallest values to the largest values.
Therefore, as you reverse the values, it will produce different binning results, and result in different split points.
Therefore, as you reverse the values, it will produce different binning results, and result in different split points.
Hi @guolinke
Is binning applied per feature, or all feature per observation? Aka. if i have 200 features, will there be 200 different binning schemas?
@vishalbajaj2000 the binning is column-wise and independent among different features.
Most helpful comment
@Qiuyan918
The binning algorithm in LightGBM is the greedy equally density solution, from smallest values to the largest values.
Therefore, as you reverse the values, it will produce different binning results, and result in different split points.