Thanks a lot for making lightgbm available, I am very impressed by the performance!
I am trying to use the built-in categorical features support, but am getting consistently much better result using one-hot encoding instead. I did not expect this from the various comments on the implementation, but after reviewing the related docs & issues I'm not sure where I am going wrong.
As a simple regression example with two categorical variables:
X1= np.repeat(np.arange(10), 1000)
X2= np.repeat(np.arange(10), 1000)
np.random.shuffle(X2)
y = (X1 + np.random.randn(10000)) * (X2 + np.random.randn(10000))
data = pd.DataFrame({'y': y, 'X1': X1, 'X2': X2});
running
lgb_params = {'learning_rate' : 0.1,
'boosting' : 'dart',
'objective' : 'regression',
'metric' : 'rmse',
'feature_fraction' : 0.9,
'bagging_fraction' : 0.75,
'num_leaves' : 31,
'bagging_freq' : 1,
'min_data_per_leaf': 250}
lgb_train = lgb.Dataset(data=data[['X1', 'X2']], label=data.y, categorical_feature=['X1', 'X2'])
cv = lgb.cv(lgb_params,
lgb_train,
num_boost_round=100,
early_stopping_rounds=15,
stratified=False,
verbose_eval=50)
pd.DataFrame(cv).min()
yields:
[50] cv_agg's rmse: 12.4452 + 0.0824018
Out[265]:
rmse-mean 12.063244
rmse-stdv 0.058267
but
lgb_train = lgb.Dataset(data=pd.get_dummies(data, columns=['X1', 'X2']), label=data.y)
cv = lgb.cv(lgb_params,
lgb_train,
num_boost_round=100,
early_stopping_rounds=25,
stratified=False,
verbose_eval=50)
pd.DataFrame(cv).min()
yields:
[50] cv_agg's rmse: 9.08683 + 0.860496
rmse-mean 8.666307
rmse-stdv 0.385004
There is no free lunch. A algorithm cannot work in all situations. However, I test it in many real world datasets, most of the new results are better.
In your case, the categories are small, as a result, the one hot can work well.
As for new categorical feature algorithm, can you provide the training error? it can be overfitting.
Fair enough. I did create the mock example because I came across this when trying to use a variable with 3,000 distinct categories and found the same behavior. The data is from the kaggle mercari competition: https://www.kaggle.com/c/mercari-price-suggestion-challenge/data.
I'm using the brand_name to predict the log price:
brand_freq = train_df.brand_name.value_counts()
rare_brands = brand_freq[brand_freq < 15].index.tolist()
train_df.loc[train_df.brand_name.isin(rare_brands), 'brand_name'] == 'Other'
print(train_df.brand_name.nunique())
train_df.brand_name.fillna('Unknown', inplace=True)
train = lgb.Dataset(
data=LabelEncoder().fit_transform(train_df.brand_name).reshape(-1, 1),
label=np.log1p(train_df.price),
categorical_feature=[0]
)
lgb_params = {'learning_rate' : 0.05,
'boosting' : 'dart',
'metric' : 'rmse',
'feature_fraction' : 0.75,
'bagging_fraction' : 0.75,
'max_depth': 10,
'num_leaves' : 31,
'objective' : 'regression',
'bagging_freq' : 1,
'min_data_per_leaf': 250}
cv = lgb.cv(lgb_params,
num_boost_round=100,
train_set=train,
stratified=False,
verbose_eval=1,
early_stopping_rounds=25)
which results in:
[1] cv_agg's rmse: 0.748823 + 0.00217222
[2] cv_agg's rmse: 3.07093 + 0.00207028
[3] cv_agg's rmse: 6.00335 + 0.00256017
[4] cv_agg's rmse: 8.96602 + 0.00298171
[5] cv_agg's rmse: 11.9364 + 0.00338412
[6] cv_agg's rmse: 14.91 + 0.00377867
[7] cv_agg's rmse: 17.8851 + 0.00416925
[8] cv_agg's rmse: 20.8611 + 0.00455755
[9] cv_agg's rmse: 23.8376 + 0.00494443
[10] cv_agg's rmse: 26.8145 + 0.00533036
[11] cv_agg's rmse: 29.7917 + 0.00571563
[12] cv_agg's rmse: 32.7691 + 0.00610041
[13] cv_agg's rmse: 35.7466 + 0.00648482
[14] cv_agg's rmse: 38.7242 + 0.00686896
[15] cv_agg's rmse: 41.702 + 0.00725288
[16] cv_agg's rmse: 44.6797 + 0.00763663
[17] cv_agg's rmse: 47.6576 + 0.00802023
[18] cv_agg's rmse: 50.6355 + 0.00840372
[19] cv_agg's rmse: 53.6134 + 0.00878711
[20] cv_agg's rmse: 56.5913 + 0.00917041
[21] cv_agg's rmse: 59.5693 + 0.00955365
[22] cv_agg's rmse: 62.5473 + 0.00993683
[23] cv_agg's rmse: 65.5254 + 0.01032
[24] cv_agg's rmse: 68.5034 + 0.010703
[25] cv_agg's rmse: 71.4815 + 0.0110861
[26] cv_agg's rmse: 74.4595 + 0.0114691
I find the same using the lightgbm test function referenced in another issue:
X = pd.DataFrame({"A": np.random.permutation(['a', 'b', 'c', 'd'] * 75), # str
"B": np.random.permutation([1, 2, 3] * 100), # int
"C": np.random.permutation([0.1, 0.2, -0.1, -0.1, 0.2] * 60), # float
"D": np.random.permutation([True, False] * 150)}) # bool
y = np.random.permutation([0, 1] * 150)
X_test = pd.DataFrame({"A": np.random.permutation(['a', 'b', 'e'] * 20),
"B": np.random.permutation([1, 3] * 30),
"C": np.random.permutation([0.1, -0.1, 0.2, 0.2] * 15),
"D": np.random.permutation([True, False] * 30)})
for col in ["A", "B", "C", "D"]:
X[col] = X[col].astype('category')
X_test[col] = X_test[col].astype('category')
y_test = np.random.permutation([0, 1] * 30)
params = {
'objective': 'binary',
'metric': 'binary_logloss',
'verbose': -1
}
lgb_train = lgb.Dataset(X, y)
lgb_test = lgb.Dataset(X_test, y_test)
gbm0 = lgb.train(params, lgb_train, num_boost_round=100, verbose_eval=10, valid_sets=[lgb_train, lgb_test])
[10] training's binary_logloss: 0.668304 valid_1's binary_logloss: 0.710231
[20] training's binary_logloss: 0.645826 valid_1's binary_logloss: 0.71817
[30] training's binary_logloss: 0.627857 valid_1's binary_logloss: 0.74892
[40] training's binary_logloss: 0.617312 valid_1's binary_logloss: 0.777039
[50] training's binary_logloss: 0.611327 valid_1's binary_logloss: 0.805029
[60] training's binary_logloss: 0.605812 valid_1's binary_logloss: 0.836496
[70] training's binary_logloss: 0.601962 valid_1's binary_logloss: 0.852982
[80] training's binary_logloss: 0.599277 valid_1's binary_logloss: 0.867156
[90] training's binary_logloss: 0.596671 valid_1's binary_logloss: 0.881359
[100] training's binary_logloss: 0.594423 valid_1's binary_logloss: 0.89199
I'm of course assuming I'm doing something wrong but it's not clear where things are going wrong. I've tried using categorical types, column names instead of indices, etc. In all cases, one-hot delivers the expected improvement in the objective.
Seems to work fine under the last commit c2191aa, strangely does better than xgboost but this is just because of different default hyperparameters:
> library(lightgbm)
>
> set.seed(1)
> mat <- matrix(round(runif(20000, min = -0.5, max = 9.5)), nrow = 10000)
> table(mat)
mat
0 1 2 3 4 5 6 7 8 9
2041 1959 2037 1981 2023 1977 1961 1951 1994 2076
>
> y <- (mat[, 1] + runif(10000, 0, 1)) * (mat[, 2] * runif(10000, 0, 1))
> dtrain <- lgb.Dataset(data = mat, label = y)
>
> params <- list(objective = "regression", metric = "rmse")
> set.seed(1)
> model <- lgb.cv(params,
+ dtrain,
+ 10000,
+ nfold = 5,
+ learning_rate = 0.1,
+ early_stopping_rounds = 10,
+ verbose = -1)
> model$best_score
[1] -8.910689
>
>
> rm(dtrain)
> dtrain <- lgb.Dataset(data = mat, label = y, categorical_feature = c(1, 2), colnames = c("X1", "X2"))
> params <- list(objective = "regression", metric = "rmse")
> set.seed(1)
> model <- lgb.cv(params,
+ dtrain,
+ 10000,
+ nfold = 5,
+ learning_rate = 0.1,
+ early_stopping_rounds = 10,
+ verbose = -1)
> model$best_score
[1] -8.910112
>
>
> library(xgboost)
> dtrain <- xgb.DMatrix(data = mat, label = y)
> model <- xgb.cv(data = dtrain,
+ nrounds = 10000,
+ nfold = 5,
+ metrics = "rmse",
+ max_depth = 0,
+ max_leaves = 63,
+ eta = 0.1,
+ objective = "reg:linear",
+ early_stopping_rounds = 10,
+ tree_method = "hist",
+ grow_policy = "lossguide",
+ nthread = 1,
+ verbose = 0)
[18:04:37] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[18:04:37] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[18:04:37] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[18:04:37] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[18:04:37] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
> model$evaluation_log$test_rmse_mean[model$best_iteration]
[1] 8.933057
Copy & paste code in R:
library(lightgbm)
set.seed(1)
mat <- matrix(round(runif(20000, min = -0.5, max = 9.5)), nrow = 10000)
table(mat)
y <- (mat[, 1] + runif(10000, 0, 1)) * (mat[, 2] * runif(10000, 0, 1))
dtrain <- lgb.Dataset(data = mat, label = y)
params <- list(objective = "regression", metric = "rmse")
set.seed(1)
model <- lgb.cv(params,
dtrain,
10000,
nfold = 5,
learning_rate = 0.1,
early_stopping_rounds = 10,
verbose = -1)
model$best_score
rm(dtrain)
dtrain <- lgb.Dataset(data = mat, label = y, categorical_feature = c(1, 2), colnames = c("X1", "X2"))
params <- list(objective = "regression", metric = "rmse")
set.seed(1)
model <- lgb.cv(params,
dtrain,
10000,
nfold = 5,
learning_rate = 0.1,
early_stopping_rounds = 10,
verbose = -1)
model$best_score
library(xgboost)
dtrain <- xgb.DMatrix(data = mat, label = y)
model <- xgb.cv(data = dtrain,
nrounds = 10000,
nfold = 5,
metrics = "rmse",
max_depth = 0,
max_leaves = 63,
eta = 0.1,
objective = "reg:linear",
early_stopping_rounds = 10,
tree_method = "hist",
grow_policy = "lossguide",
nthread = 1,
verbose = 0)
model$evaluation_log$test_rmse_mean[model$best_iteration]
Thanks again, I really appreciate your quick response. I've reinstalled the python package as below, yet still receive the same results. Do you see anything that suggests this installation could yield different results?
$ git clone --recursive https://github.com/Microsoft/LightGBM.git
Cloning into 'LightGBM'...
remote: Counting objects: 8922, done.
remote: Compressing objects: 100% (23/23), done.
remote: Total 8922 (delta 1), reused 2 (delta 0), pack-reused 8899
Receiving objects: 100% (8922/8922), 7.27 MiB | 0 bytes/s, done.
Resolving deltas: 100% (6205/6205), done.
Checking connectivity... done.
Submodule 'include/boost/compute' (https://github.com/boostorg/compute) registered for path 'compute'
Cloning into 'compute'...
remote: Counting objects: 21244, done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 21244 (delta 3), reused 9 (delta 3), pack-reused 21230
Receiving objects: 100% (21244/21244), 8.41 MiB | 0 bytes/s, done.
Resolving deltas: 100% (17249/17249), done.
Checking connectivity... done.
Submodule path 'compute': checked out '6de7f6448796f67958dde8de4569fb1ae649ee91'
stefan@applied-ai:~/src$ cd LightGBM/python-package
stefan@applied-ai:~/src/LightGBM/python-package$ python setup.py install
running install
creating compile
creating compile/include
creating compile/include/LightGBM
copying ../include/LightGBM/R_object_helper.h -> ./compile/include/LightGBM
copying ../include/LightGBM/lightgbm_R.h -> ./compile/include/LightGBM
copying ../include/LightGBM/network.h -> ./compile/include/LightGBM
copying ../include/LightGBM/dataset.h -> ./compile/include/LightGBM
copying ../include/LightGBM/config.h -> ./compile/include/LightGBM
copying ../include/LightGBM/meta.h -> ./compile/include/LightGBM
copying ../include/LightGBM/feature_group.h -> ./compile/include/LightGBM
copying ../include/LightGBM/prediction_early_stop.h -> ./compile/include/LightGBM
creating compile/include/LightGBM/utils
copying ../include/LightGBM/utils/text_reader.h -> ./compile/include/LightGBM/utils
copying ../include/LightGBM/utils/pipeline_reader.h -> ./compile/include/LightGBM/utils
copying ../include/LightGBM/utils/log.h -> ./compile/include/LightGBM/utils
copying ../include/LightGBM/utils/threading.h -> ./compile/include/LightGBM/utils
copying ../include/LightGBM/utils/array_args.h -> ./compile/include/LightGBM/utils
copying ../include/LightGBM/utils/openmp_wrapper.h -> ./compile/include/LightGBM/utils
copying ../include/LightGBM/utils/common.h -> ./compile/include/LightGBM/utils
copying ../include/LightGBM/utils/random.h -> ./compile/include/LightGBM/utils
copying ../include/LightGBM/boosting.h -> ./compile/include/LightGBM
copying ../include/LightGBM/application.h -> ./compile/include/LightGBM
copying ../include/LightGBM/c_api.h -> ./compile/include/LightGBM
copying ../include/LightGBM/tree.h -> ./compile/include/LightGBM
copying ../include/LightGBM/bin.h -> ./compile/include/LightGBM
copying ../include/LightGBM/objective_function.h -> ./compile/include/LightGBM
copying ../include/LightGBM/metric.h -> ./compile/include/LightGBM
copying ../include/LightGBM/export.h -> ./compile/include/LightGBM
copying ../include/LightGBM/tree_learner.h -> ./compile/include/LightGBM
copying ../include/LightGBM/dataset_loader.h -> ./compile/include/LightGBM
creating compile/src
creating compile/src/metric
copying ../src/metric/regression_metric.hpp -> ./compile/src/metric
copying ../src/metric/dcg_calculator.cpp -> ./compile/src/metric
copying ../src/metric/map_metric.hpp -> ./compile/src/metric
copying ../src/metric/binary_metric.hpp -> ./compile/src/metric
copying ../src/metric/rank_metric.hpp -> ./compile/src/metric
copying ../src/metric/xentropy_metric.hpp -> ./compile/src/metric
copying ../src/metric/multiclass_metric.hpp -> ./compile/src/metric
copying ../src/metric/metric.cpp -> ./compile/src/metric
creating compile/src/treelearner
copying ../src/treelearner/parallel_tree_learner.h -> ./compile/src/treelearner
copying ../src/treelearner/serial_tree_learner.cpp -> ./compile/src/treelearner
copying ../src/treelearner/feature_parallel_tree_learner.cpp -> ./compile/src/treelearner
copying ../src/treelearner/feature_histogram.hpp -> ./compile/src/treelearner
copying ../src/treelearner/gpu_tree_learner.cpp -> ./compile/src/treelearner
copying ../src/treelearner/split_info.hpp -> ./compile/src/treelearner
copying ../src/treelearner/leaf_splits.hpp -> ./compile/src/treelearner
copying ../src/treelearner/voting_parallel_tree_learner.cpp -> ./compile/src/treelearner
copying ../src/treelearner/data_parallel_tree_learner.cpp -> ./compile/src/treelearner
copying ../src/treelearner/serial_tree_learner.h -> ./compile/src/treelearner
copying ../src/treelearner/gpu_tree_learner.h -> ./compile/src/treelearner
copying ../src/treelearner/data_partition.hpp -> ./compile/src/treelearner
creating compile/src/treelearner/ocl
copying ../src/treelearner/ocl/histogram16.cl -> ./compile/src/treelearner/ocl
copying ../src/treelearner/ocl/histogram64.cl -> ./compile/src/treelearner/ocl
copying ../src/treelearner/ocl/histogram256.cl -> ./compile/src/treelearner/ocl
copying ../src/treelearner/tree_learner.cpp -> ./compile/src/treelearner
creating compile/src/boosting
copying ../src/boosting/boosting.cpp -> ./compile/src/boosting
copying ../src/boosting/dart.hpp -> ./compile/src/boosting
copying ../src/boosting/gbdt.cpp -> ./compile/src/boosting
copying ../src/boosting/goss.hpp -> ./compile/src/boosting
copying ../src/boosting/rf.hpp -> ./compile/src/boosting
copying ../src/boosting/gbdt.h -> ./compile/src/boosting
copying ../src/boosting/gbdt_prediction.cpp -> ./compile/src/boosting
copying ../src/boosting/score_updater.hpp -> ./compile/src/boosting
copying ../src/boosting/prediction_early_stop.cpp -> ./compile/src/boosting
copying ../src/boosting/gbdt_model_text.cpp -> ./compile/src/boosting
copying ../src/lightgbm_R.cpp -> ./compile/src
creating compile/src/objective
copying ../src/objective/binary_objective.hpp -> ./compile/src/objective
copying ../src/objective/multiclass_objective.hpp -> ./compile/src/objective
copying ../src/objective/xentropy_objective.hpp -> ./compile/src/objective
copying ../src/objective/objective_function.cpp -> ./compile/src/objective
copying ../src/objective/regression_objective.hpp -> ./compile/src/objective
copying ../src/objective/rank_objective.hpp -> ./compile/src/objective
copying ../src/c_api.cpp -> ./compile/src
creating compile/src/application
copying ../src/application/predictor.hpp -> ./compile/src/application
copying ../src/application/application.cpp -> ./compile/src/application
copying ../src/main.cpp -> ./compile/src
creating compile/src/io
copying ../src/io/dense_bin.hpp -> ./compile/src/io
copying ../src/io/parser.hpp -> ./compile/src/io
copying ../src/io/dataset.cpp -> ./compile/src/io
copying ../src/io/ordered_sparse_bin.hpp -> ./compile/src/io
copying ../src/io/tree.cpp -> ./compile/src/io
copying ../src/io/bin.cpp -> ./compile/src/io
copying ../src/io/dense_nbits_bin.hpp -> ./compile/src/io
copying ../src/io/metadata.cpp -> ./compile/src/io
copying ../src/io/sparse_bin.hpp -> ./compile/src/io
copying ../src/io/dataset_loader.cpp -> ./compile/src/io
copying ../src/io/config.cpp -> ./compile/src/io
copying ../src/io/parser.cpp -> ./compile/src/io
creating compile/src/network
copying ../src/network/linkers_mpi.cpp -> ./compile/src/network
copying ../src/network/network.cpp -> ./compile/src/network
copying ../src/network/linkers_socket.cpp -> ./compile/src/network
copying ../src/network/linkers.h -> ./compile/src/network
copying ../src/network/linker_topo.cpp -> ./compile/src/network
copying ../src/network/socket_wrapper.hpp -> ./compile/src/network
copying ../windows/LightGBM.sln -> ./compile/windows
copying ../windows/LightGBM.vcxproj -> ./compile/windows
copying ../CMakeLists.txt -> ./compile/
copying ../LICENSE -> ./
INFO:LightGBM:Starting to compile the library.
INFO:LightGBM:Starting to compile with CMake.
running build
running build_py
INFO:root:Generating grammar tables from /usr/lib/python3.5/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python3.5/lib2to3/PatternGrammar.txt
creating build
creating build/lib
creating build/lib/lightgbm
copying lightgbm/libpath.py -> build/lib/lightgbm
copying lightgbm/engine.py -> build/lib/lightgbm
copying lightgbm/basic.py -> build/lib/lightgbm
copying lightgbm/sklearn.py -> build/lib/lightgbm
copying lightgbm/callback.py -> build/lib/lightgbm
copying lightgbm/compat.py -> build/lib/lightgbm
copying lightgbm/__init__.py -> build/lib/lightgbm
copying lightgbm/plotting.py -> build/lib/lightgbm
running egg_info
creating lightgbm.egg-info
writing top-level names to lightgbm.egg-info/top_level.txt
writing dependency_links to lightgbm.egg-info/dependency_links.txt
writing requirements to lightgbm.egg-info/requires.txt
writing lightgbm.egg-info/PKG-INFO
writing manifest file 'lightgbm.egg-info/SOURCES.txt'
reading manifest file 'lightgbm.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
no previously-included directories found matching 'build'
warning: no files found matching '*.txt'
warning: no files found matching '*.so' under directory 'lightgbm'
warning: no files found matching '*.dll' under directory 'compile/Release'
warning: no files found matching '*' under directory 'compile/compute'
warning: no files found matching 'LightGBM.vcxproj.filters' under directory 'compile/windows'
warning: no files found matching '*.dll' under directory 'compile/windows/x64/DLL'
warning: no previously-included files matching '*.py[co]' found anywhere in distribution
writing manifest file 'lightgbm.egg-info/SOURCES.txt'
copying lightgbm/VERSION.txt -> build/lib/lightgbm
running install_lib
creating /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
copying build/lib/lightgbm/libpath.py -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
copying build/lib/lightgbm/engine.py -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
copying build/lib/lightgbm/basic.py -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
copying build/lib/lightgbm/sklearn.py -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
copying build/lib/lightgbm/callback.py -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
copying build/lib/lightgbm/VERSION.txt -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
copying build/lib/lightgbm/compat.py -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
copying build/lib/lightgbm/__init__.py -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
copying build/lib/lightgbm/plotting.py -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
INFO:root:Installing lib_lightgbm from: ['compile/lib_lightgbm.so']
copying compile/lib_lightgbm.so -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
byte-compiling /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm/libpath.py to libpath.cpython-35.pyc
byte-compiling /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm/engine.py to engine.cpython-35.pyc
byte-compiling /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm/basic.py to basic.cpython-35.pyc
byte-compiling /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm/sklearn.py to sklearn.cpython-35.pyc
byte-compiling /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm/callback.py to callback.cpython-35.pyc
byte-compiling /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm/compat.py to compat.cpython-35.pyc
byte-compiling /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm/__init__.py to __init__.cpython-35.pyc
byte-compiling /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm/plotting.py to plotting.cpython-35.pyc
running install_egg_info
Copying lightgbm.egg-info to /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm-2.0.11-py3.5.egg-info
running install_scripts
@stefai
from your result, they are all overfitting, you can see the validation error increasing.
I guess your data is too small. All my test is based on datasets with millions instances.
For the small dataset, maybe the new algorithm is easily overfitting.
I think you may need to tune the parameters to avoid overfitting, Or only use the new algorithm for large datasets.
@guolinke
I posted a toy example to illustrate the issue, but I came across this using 1.5m observations and 5,000 categories (at least 50 obs/category). It works ok using 1-hot but fails to improve on even a single step using categorical_feature, it rather deteriorates dramatically. I am using version 2.11 and have tried a range of parameters and am at a loss what's going on here.
Setup:
lgb_params = {'learning_rate' : 0.05,
'boosting' : 'dart',
'metric' : 'rmse',
'feature_fraction' : 0.75,
'bagging_fraction' : 0.75,
'max_depth': 10,
'num_leaves' : 61,
'objective' : 'regression',
'bagging_freq' : 1,
'min_data_per_leaf': 250}
brand_freq = train_df.brand_name.value_counts()
rare_brands = brand_freq[brand_freq < 50].index.tolist()
train_df.loc[train_df.brand_name.isin(rare_brands), 'brand_name'] == 'Other'
train_df.brand_name.fillna('Unknown', inplace=True)
print('Unique brands:', train_df.brand_name.nunique())
print('Number of obs:', train_df.brand_name.count())
Unique brands: 4810
Number of obs: 1482535
Example using categorical_feature:
train = lgb.Dataset(
data=pd.Series(LabelEncoder().fit_transform(train_df.brand_name)).to_frame('brand'),
label=np.log1p(train_df.price),
categorical_feature=['brand']
)
cv = lgb.cv(lgb_params,
num_boost_round=100,
train_set=train,
stratified=False,
verbose_eval=1,
categorical_feature=['brand'],
early_stopping_rounds=25)
[1] cv_agg's rmse: 0.749208 + 0.00158038
[2] cv_agg's rmse: 3.07182 + 0.00108038
[3] cv_agg's rmse: 6.00504 + 0.00150103
[4] cv_agg's rmse: 8.96853 + 0.00180245
[5] cv_agg's rmse: 11.9398 + 0.0020732
[6] cv_agg's rmse: 14.9141 + 0.00233171
[7] cv_agg's rmse: 17.89 + 0.00258413
[8] cv_agg's rmse: 20.8669 + 0.00283307
[9] cv_agg's rmse: 23.8442 + 0.00307984
[10] cv_agg's rmse: 26.822 + 0.00332518
[11] cv_agg's rmse: 29.8 + 0.00356951
[12] cv_agg's rmse: 32.7782 + 0.00381312
[13] cv_agg's rmse: 35.7566 + 0.00405617
[14] cv_agg's rmse: 38.735 + 0.00429881
[15] cv_agg's rmse: 41.7136 + 0.00454112
[16] cv_agg's rmse: 44.6922 + 0.00478316
[17] cv_agg's rmse: 47.6708 + 0.00502499
[18] cv_agg's rmse: 50.6495 + 0.00526665
[19] cv_agg's rmse: 53.6283 + 0.00550815
[20] cv_agg's rmse: 56.6071 + 0.00574954
[21] cv_agg's rmse: 59.5859 + 0.00599082
[22] cv_agg's rmse: 62.5647 + 0.00623201
[23] cv_agg's rmse: 65.5436 + 0.00647312
[24] cv_agg's rmse: 68.5225 + 0.00671417
[25] cv_agg's rmse: 71.5013 + 0.00695516
[26] cv_agg's rmse: 74.4802 + 0.00719609
and using dummy variables:
train = lgb.Dataset(
data=csr_matrix(pd.get_dummies(train_df.brand_name, sparse=True).to_coo(), dtype=np.float32),
label=np.log1p(train_df.price)
)
cv = lgb.cv(lgb_params,
num_boost_round=100,
train_set=train,
stratified=False,
verbose_eval=1,
early_stopping_rounds=25)
[1] cv_agg's rmse: 0.746013 + 0.00157357
[2] cv_agg's rmse: 0.743435 + 0.00156749
[3] cv_agg's rmse: 0.740547 + 0.0015579
[4] cv_agg's rmse: 0.737792 + 0.00154393
[5] cv_agg's rmse: 0.735673 + 0.00152661
[6] cv_agg's rmse: 0.733435 + 0.00152012
[7] cv_agg's rmse: 0.731248 + 0.00151158
[8] cv_agg's rmse: 0.732042 + 0.00151096
[9] cv_agg's rmse: 0.729926 + 0.00149891
[10] cv_agg's rmse: 0.727943 + 0.00149317
[11] cv_agg's rmse: 0.726249 + 0.00148875
[12] cv_agg's rmse: 0.726776 + 0.00149299
[13] cv_agg's rmse: 0.725156 + 0.0014948
[14] cv_agg's rmse: 0.723699 + 0.00149512
[15] cv_agg's rmse: 0.722188 + 0.00147971
[16] cv_agg's rmse: 0.720832 + 0.0014719
[17] cv_agg's rmse: 0.719681 + 0.00146852
[18] cv_agg's rmse: 0.718513 + 0.00146325
[19] cv_agg's rmse: 0.717504 + 0.00145445
[20] cv_agg's rmse: 0.716506 + 0.00144417
[21] cv_agg's rmse: 0.716806 + 0.00144845
[22] cv_agg's rmse: 0.715941 + 0.00143478
[23] cv_agg's rmse: 0.715052 + 0.00143763
[24] cv_agg's rmse: 0.714228 + 0.00143889
[25] cv_agg's rmse: 0.713396 + 0.00143754
[26] cv_agg's rmse: 0.712635 + 0.00143152
[27] cv_agg's rmse: 0.711891 + 0.00142661
[28] cv_agg's rmse: 0.71223 + 0.00142976
[29] cv_agg's rmse: 0.711512 + 0.00142371
[30] cv_agg's rmse: 0.710858 + 0.00143041
[31] cv_agg's rmse: 0.71102 + 0.00143371
[32] cv_agg's rmse: 0.710363 + 0.00143784
[33] cv_agg's rmse: 0.70973 + 0.00143083
[34] cv_agg's rmse: 0.709121 + 0.00143233
[35] cv_agg's rmse: 0.709464 + 0.00143547
[36] cv_agg's rmse: 0.710055 + 0.00143896
[37] cv_agg's rmse: 0.709456 + 0.00143036
[38] cv_agg's rmse: 0.708874 + 0.00142435
[39] cv_agg's rmse: 0.708316 + 0.00142331
[40] cv_agg's rmse: 0.708216 + 0.00141978
[41] cv_agg's rmse: 0.708609 + 0.00142266
[42] cv_agg's rmse: 0.708075 + 0.00142478
[43] cv_agg's rmse: 0.708427 + 0.00142704
[44] cv_agg's rmse: 0.70788 + 0.00142551
[45] cv_agg's rmse: 0.707351 + 0.00141685
[46] cv_agg's rmse: 0.791467 + 0.00170709
[47] cv_agg's rmse: 0.783244 + 0.00170456
[48] cv_agg's rmse: 0.77969 + 0.00169885
[49] cv_agg's rmse: 0.778683 + 0.00169826
[50] cv_agg's rmse: 0.779115 + 0.00169785
[51] cv_agg's rmse: 0.771977 + 0.00169143
[52] cv_agg's rmse: 0.765475 + 0.00168099
[53] cv_agg's rmse: 0.766355 + 0.00168197
[54] cv_agg's rmse: 0.760354 + 0.00168411
[55] cv_agg's rmse: 0.754844 + 0.00167271
[56] cv_agg's rmse: 0.755328 + 0.00167475
[57] cv_agg's rmse: 0.750248 + 0.00165985
[58] cv_agg's rmse: 0.930166 + 0.00171325
[59] cv_agg's rmse: 0.928975 + 0.00171676
[60] cv_agg's rmse: 0.90931 + 0.00171309
[61] cv_agg's rmse: 1.12278 + 0.00164735
[62] cv_agg's rmse: 1.08892 + 0.00165832
[63] cv_agg's rmse: 1.05741 + 0.00166842
[64] cv_agg's rmse: 1.0518 + 0.00167101
[65] cv_agg's rmse: 1.05332 + 0.00166839
[66] cv_agg's rmse: 1.02436 + 0.00168856
[67] cv_agg's rmse: 0.997464 + 0.00169999
[68] cv_agg's rmse: 0.972527 + 0.00170774
[69] cv_agg's rmse: 0.971796 + 0.00170561
[70] cv_agg's rmse: 0.969266 + 0.00170443
@stefai I guess there are something wrong for CV in python package.
The result is very abnormal. Can you try the train-valid ?
@guolinke
I'm afraid I'm getting the same result using train/valid instead of cv. I tried all objective options with the same result (loss remains constant for rf. Not sure if the warning is indicative of anything material ging wrong. Also attaching the data so you can check if this is due to my setup.
brand_freq = train_df.brand_name.value_counts()
rare_brands = brand_freq[brand_freq < 50].index.tolist()
train_df.loc[train_df.brand_name.isin(rare_brands), 'brand_name'] == 'Other'
train_df.brand_name.fillna('Unknown', inplace=True)
print('Unique brands:', train_df.brand_name.nunique())
print('Number of obs:', train_df.brand_name.count())
X = pd.Series(LabelEncoder().fit_transform(train_df.brand_name)).to_frame('brand')
y = np.log1p(train_df.price)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
train = lgb.Dataset(
data=X_train,
label=y_train,
categorical_feature=['brand']
)
valid = lgb.Dataset(
data=X_test,
label=y_test,
categorical_feature=['brand'],
reference=train
)
model = lgb.train(lgb_params,
num_boost_round=num_boost_round,
train_set=train,
valid_sets=[train, valid],
verbose_eval=1,
categorical_feature=['brand'],
early_stopping_rounds=25)
Unique brands: 4810
Number of obs: 1482535
/home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm/basic.py:662: UserWarning: categorical_feature in param dict is overrided.
warnings.warn('categorical_feature in param dict is overrided.')
[1] training's rmse: 0.748742 valid_1's rmse: 0.751076
Training until validation scores don't improve for 25 rounds.
[2] training's rmse: 3.07115 valid_1's rmse: 3.0689
[3] training's rmse: 6.00383 valid_1's rmse: 6.00124
[4] training's rmse: 8.96675 valid_1's rmse: 8.96406
[5] training's rmse: 11.9374 valid_1's rmse: 11.9347
[6] training's rmse: 14.9112 valid_1's rmse: 14.9084
[7] training's rmse: 17.8866 valid_1's rmse: 17.8838
[8] training's rmse: 20.8628 valid_1's rmse: 20.86
[9] training's rmse: 23.8396 valid_1's rmse: 23.8368
[10] training's rmse: 26.8168 valid_1's rmse: 26.8139
[11] training's rmse: 29.7942 valid_1's rmse: 29.7914
[12] training's rmse: 32.7718 valid_1's rmse: 32.769
[13] training's rmse: 35.7496 valid_1's rmse: 35.7467
[14] training's rmse: 38.7275 valid_1's rmse: 38.7246
[15] training's rmse: 41.7054 valid_1's rmse: 41.7026
[16] training's rmse: 44.6835 valid_1's rmse: 44.6806
[17] training's rmse: 47.6615 valid_1's rmse: 47.6587
[18] training's rmse: 50.6397 valid_1's rmse: 50.6368
[19] training's rmse: 53.6178 valid_1's rmse: 53.615
[20] training's rmse: 56.5961 valid_1's rmse: 56.5932
[21] training's rmse: 59.5743 valid_1's rmse: 59.5714
[22] training's rmse: 62.5525 valid_1's rmse: 62.5497
[23] training's rmse: 65.5308 valid_1's rmse: 65.5279
[24] training's rmse: 68.5091 valid_1's rmse: 68.5062
[25] training's rmse: 71.4874 valid_1's rmse: 71.4845
[26] training's rmse: 74.4657 valid_1's rmse: 74.4629
Early stopping, best iteration is:
[1] training's rmse: 0.748742 valid_1's rmse: 0.751076
@stefai can you provide full script ? It seems I cannot run your script directly
@guolinke
Here you go. Thanks again for your quick response!
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
df = pd.read_csv('data.csv.gz', compression='gzip')
print(df.info())
brand_freq = df.brand_name.value_counts()
rare_brands = brand_freq[brand_freq < 50].index.tolist()
df.loc[df.brand_name.isin(rare_brands), 'brand_name'] == 'Other'
df.brand_name.fillna('Unknown', inplace=True)
print('Unique brands:', df.brand_name.nunique())
print('Number of obs:', df.brand_name.count())
X = pd.Series(LabelEncoder().fit_transform(df.brand_name)).to_frame('brand')
y = np.log1p(df.price)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
train = lgb.Dataset(
data=X_train,
label=y_train,
categorical_feature=['brand']
)
valid = lgb.Dataset(
data=X_test,
label=y_test,
categorical_feature=['brand'],
reference=train
)
params = {'learning_rate' : 0.05,
'boosting' : 'gbdt',
'metric' : 'rmse',
'feature_fraction' : 0.75,
'bagging_fraction' : 0.75,
'max_depth': 10,
'num_leaves' : 61,
'objective' : 'regression',
'bagging_freq' : 1,
'min_data_per_leaf': 250}
model = lgb.train(params,
num_boost_round=100,
train_set=train,
valid_sets=[train, valid],
verbose_eval=1,
categorical_feature=['brand'],
early_stopping_rounds=25)
@stefai
the reason is the "feature_fraction". If you set it to "1", you can find the result is much better.
I guess there is a bug for the "feature_fraction", when #feature is very small, it may use zero features, which causes the bad accuracy.
@guolinke thanks a lot, much appreciated, works fine now. I guess I could have figured this out myself, but perhaps it may make sense to add a note in the docs until this is fixed?
example code:
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
df = pd.read_csv('data.csv.gz', compression='gzip')
print(df.info())
print('Unique brands:', df.brand_name.nunique())
print('Number of obs:', df.brand_name.count())
df["brand_name"] = df["brand_name"].astype("category")
X = df[["brand_name"]]
y = np.log1p(df.price)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
train = lgb.Dataset(
data=X_train,
label=y_train
)
valid = lgb.Dataset(
data=X_test,
label=y_test,
reference=train
)
params = {'learning_rate' : 0.05,
'boosting' : 'gbdt',
'metric' : 'rmse',
'feature_fraction' : 1,
'bagging_fraction' : 1,
'max_depth': 6,
'num_leaves' : 31,
'objective' : 'regression',
'bagging_freq' : 1,
"verbose": -1,
'min_data_per_leaf': 100}
model = lgb.train(params,
num_boost_round=500,
train_set=train,
valid_sets=[train, valid],
verbose_eval=50,
early_stopping_rounds=25)
categories = X_train["brand_name"].cat.categories
X_test["brand_name"] = X_test["brand_name"].cat.set_categories(categories)
train = lgb.Dataset(
data=pd.get_dummies(X_train["brand_name"], sparse=True),
label=y_train
)
valid = lgb.Dataset(
data=pd.get_dummies(X_test["brand_name"], sparse=True),
label=y_test,
reference=train
)
model = lgb.train(params,
num_boost_round=500,
train_set=train,
valid_sets=[train, valid],
verbose_eval=50,
early_stopping_rounds=25)
Most helpful comment
example code: