Lightgbm: Dummies result much better than categorical

Created on 2 Dec 2017 · 13Comments · Source: microsoft/LightGBM

Thanks a lot for making lightgbm available, I am very impressed by the performance!

I am trying to use the built-in categorical features support, but am getting consistently much better result using one-hot encoding instead. I did not expect this from the various comments on the implementation, but after reviewing the related docs & issues I'm not sure where I am going wrong.

As a simple regression example with two categorical variables:

X1= np.repeat(np.arange(10), 1000)
X2= np.repeat(np.arange(10), 1000)
np.random.shuffle(X2)
y = (X1 + np.random.randn(10000)) * (X2  + np.random.randn(10000))
data = pd.DataFrame({'y': y, 'X1': X1, 'X2': X2});

running

lgb_params = {'learning_rate'    : 0.1,
              'boosting'         : 'dart',
              'objective'        : 'regression',
              'metric'           : 'rmse',
              'feature_fraction' : 0.9,
              'bagging_fraction' : 0.75,
              'num_leaves'       : 31,
              'bagging_freq'     : 1,
              'min_data_per_leaf': 250}

lgb_train = lgb.Dataset(data=data[['X1', 'X2']], label=data.y, categorical_feature=['X1', 'X2'])

cv = lgb.cv(lgb_params, 
              lgb_train, 
              num_boost_round=100, 
              early_stopping_rounds=15,
              stratified=False, 
              verbose_eval=50) 

pd.DataFrame(cv).min()

yields:

[50]    cv_agg's rmse: 12.4452 + 0.0824018
Out[265]:
rmse-mean    12.063244
rmse-stdv     0.058267

but

lgb_train = lgb.Dataset(data=pd.get_dummies(data, columns=['X1', 'X2']), label=data.y)

cv = lgb.cv(lgb_params, 
              lgb_train, 
              num_boost_round=100, 
              early_stopping_rounds=25,
              stratified=False, 
              verbose_eval=50) 

pd.DataFrame(cv).min()

yields:

[50]    cv_agg's rmse: 9.08683 + 0.860496
rmse-mean    8.666307
rmse-stdv    0.385004

Source

stefan-jansen

Most helpful comment

example code:

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

df = pd.read_csv('data.csv.gz', compression='gzip')
print(df.info())


print('Unique brands:', df.brand_name.nunique())
print('Number of obs:', df.brand_name.count())

df["brand_name"] = df["brand_name"].astype("category")
X = df[["brand_name"]]
y = np.log1p(df.price)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

train = lgb.Dataset(
            data=X_train,
            label=y_train
        )


valid = lgb.Dataset(
            data=X_test,
            label=y_test,
            reference=train
        )

params = {'learning_rate'    : 0.05,
          'boosting'         : 'gbdt',
          'metric'           : 'rmse',
          'feature_fraction' : 1,
          'bagging_fraction' : 1,
          'max_depth': 6,
          'num_leaves'       : 31,
          'objective'        : 'regression',
          'bagging_freq'     : 1,
          "verbose": -1,
          'min_data_per_leaf': 100}

model = lgb.train(params, 
            num_boost_round=500,
            train_set=train,
            valid_sets=[train, valid],
            verbose_eval=50,
            early_stopping_rounds=25)

categories = X_train["brand_name"].cat.categories
X_test["brand_name"] = X_test["brand_name"].cat.set_categories(categories)

train = lgb.Dataset(
            data=pd.get_dummies(X_train["brand_name"], sparse=True), 
            label=y_train   
        )

valid = lgb.Dataset(
            data=pd.get_dummies(X_test["brand_name"], sparse=True), 
            label=y_test,
            reference=train   
        )


model = lgb.train(params, 
            num_boost_round=500,
            train_set=train,
            valid_sets=[train, valid],
            verbose_eval=50,
            early_stopping_rounds=25)

guolinke on 4 Dec 2017

👍4

All 13 comments

There is no free lunch. A algorithm cannot work in all situations. However, I test it in many real world datasets, most of the new results are better.

In your case, the categories are small, as a result, the one hot can work well.

As for new categorical feature algorithm, can you provide the training error? it can be overfitting.

guolinke on 3 Dec 2017

Fair enough. I did create the mock example because I came across this when trying to use a variable with 3,000 distinct categories and found the same behavior. The data is from the kaggle mercari competition: https://www.kaggle.com/c/mercari-price-suggestion-challenge/data.

I'm using the brand_name to predict the log price:

brand_freq = train_df.brand_name.value_counts()
rare_brands = brand_freq[brand_freq < 15].index.tolist()
train_df.loc[train_df.brand_name.isin(rare_brands), 'brand_name'] == 'Other'

print(train_df.brand_name.nunique())

train_df.brand_name.fillna('Unknown', inplace=True)

train = lgb.Dataset(
            data=LabelEncoder().fit_transform(train_df.brand_name).reshape(-1, 1),
            label=np.log1p(train_df.price),
    categorical_feature=[0]    
        )

lgb_params = {'learning_rate'    : 0.05,
              'boosting'         : 'dart',
              'metric'           : 'rmse',
              'feature_fraction' : 0.75,
              'bagging_fraction' : 0.75,
              'max_depth': 10,
              'num_leaves'       : 31,
              'objective'        : 'regression',
              'bagging_freq'     : 1,
              'min_data_per_leaf': 250}

cv = lgb.cv(lgb_params, 
            num_boost_round=100,
            train_set=train, 
            stratified=False, 
            verbose_eval=1,
            early_stopping_rounds=25)

which results in:

[1] cv_agg's rmse: 0.748823 + 0.00217222
[2] cv_agg's rmse: 3.07093 + 0.00207028
[3] cv_agg's rmse: 6.00335 + 0.00256017
[4] cv_agg's rmse: 8.96602 + 0.00298171
[5] cv_agg's rmse: 11.9364 + 0.00338412
[6] cv_agg's rmse: 14.91 + 0.00377867
[7] cv_agg's rmse: 17.8851 + 0.00416925
[8] cv_agg's rmse: 20.8611 + 0.00455755
[9] cv_agg's rmse: 23.8376 + 0.00494443
[10]    cv_agg's rmse: 26.8145 + 0.00533036
[11]    cv_agg's rmse: 29.7917 + 0.00571563
[12]    cv_agg's rmse: 32.7691 + 0.00610041
[13]    cv_agg's rmse: 35.7466 + 0.00648482
[14]    cv_agg's rmse: 38.7242 + 0.00686896
[15]    cv_agg's rmse: 41.702 + 0.00725288
[16]    cv_agg's rmse: 44.6797 + 0.00763663
[17]    cv_agg's rmse: 47.6576 + 0.00802023
[18]    cv_agg's rmse: 50.6355 + 0.00840372
[19]    cv_agg's rmse: 53.6134 + 0.00878711
[20]    cv_agg's rmse: 56.5913 + 0.00917041
[21]    cv_agg's rmse: 59.5693 + 0.00955365
[22]    cv_agg's rmse: 62.5473 + 0.00993683
[23]    cv_agg's rmse: 65.5254 + 0.01032
[24]    cv_agg's rmse: 68.5034 + 0.010703
[25]    cv_agg's rmse: 71.4815 + 0.0110861
[26]    cv_agg's rmse: 74.4595 + 0.0114691

I find the same using the lightgbm test function referenced in another issue:

X = pd.DataFrame({"A": np.random.permutation(['a', 'b', 'c', 'd'] * 75),  # str 
               "B": np.random.permutation([1, 2, 3] * 100),  # int 
               "C": np.random.permutation([0.1, 0.2, -0.1, -0.1, 0.2] * 60),  # float 
               "D": np.random.permutation([True, False] * 150)})  # bool 
y = np.random.permutation([0, 1] * 150) 
X_test = pd.DataFrame({"A": np.random.permutation(['a', 'b', 'e'] * 20), 
                    "B": np.random.permutation([1, 3] * 30), 
                    "C": np.random.permutation([0.1, -0.1, 0.2, 0.2] * 15), 
                    "D": np.random.permutation([True, False] * 30)}) 
for col in ["A", "B", "C", "D"]: 
    X[col] = X[col].astype('category') 
    X_test[col] = X_test[col].astype('category') 
y_test = np.random.permutation([0, 1] * 30)


params = { 
 'objective': 'binary', 
 'metric': 'binary_logloss', 
 'verbose': -1 
}

lgb_train = lgb.Dataset(X, y)
lgb_test = lgb.Dataset(X_test, y_test)

gbm0 = lgb.train(params, lgb_train, num_boost_round=100, verbose_eval=10, valid_sets=[lgb_train, lgb_test]) 

[10]    training's binary_logloss: 0.668304 valid_1's binary_logloss: 0.710231
[20]    training's binary_logloss: 0.645826 valid_1's binary_logloss: 0.71817
[30]    training's binary_logloss: 0.627857 valid_1's binary_logloss: 0.74892
[40]    training's binary_logloss: 0.617312 valid_1's binary_logloss: 0.777039
[50]    training's binary_logloss: 0.611327 valid_1's binary_logloss: 0.805029
[60]    training's binary_logloss: 0.605812 valid_1's binary_logloss: 0.836496
[70]    training's binary_logloss: 0.601962 valid_1's binary_logloss: 0.852982
[80]    training's binary_logloss: 0.599277 valid_1's binary_logloss: 0.867156
[90]    training's binary_logloss: 0.596671 valid_1's binary_logloss: 0.881359
[100]   training's binary_logloss: 0.594423 valid_1's binary_logloss: 0.89199

I'm of course assuming I'm doing something wrong but it's not clear where things are going wrong. I've tried using categorical types, column names instead of indices, etc. In all cases, one-hot delivers the expected improvement in the objective.

stefan-jansen on 3 Dec 2017

Seems to work fine under the last commit c2191aa, strangely does better than xgboost but this is just because of different default hyperparameters:

> library(lightgbm)
> 
> set.seed(1)
> mat <- matrix(round(runif(20000, min = -0.5, max = 9.5)), nrow = 10000)
> table(mat)
mat
   0    1    2    3    4    5    6    7    8    9 
2041 1959 2037 1981 2023 1977 1961 1951 1994 2076 
> 
> y <- (mat[, 1] + runif(10000, 0, 1)) * (mat[, 2] * runif(10000, 0, 1))
> dtrain <- lgb.Dataset(data = mat, label = y)
> 
> params <- list(objective = "regression", metric = "rmse")
> set.seed(1)
> model <- lgb.cv(params,
+                 dtrain,
+                 10000,
+                 nfold = 5,
+                 learning_rate = 0.1,
+                 early_stopping_rounds = 10,
+                 verbose = -1)
> model$best_score
[1] -8.910689
> 
> 
> rm(dtrain)
> dtrain <- lgb.Dataset(data = mat, label = y, categorical_feature = c(1, 2), colnames = c("X1", "X2"))
> params <- list(objective = "regression", metric = "rmse")
> set.seed(1)
> model <- lgb.cv(params,
+                 dtrain,
+                 10000,
+                 nfold = 5,
+                 learning_rate = 0.1,
+                 early_stopping_rounds = 10,
+                 verbose = -1)
> model$best_score
[1] -8.910112
> 
> 
> library(xgboost)
> dtrain <- xgb.DMatrix(data = mat, label = y)
> model <- xgb.cv(data = dtrain,
+                 nrounds = 10000,
+                 nfold = 5,
+                 metrics = "rmse",
+                 max_depth = 0,
+                 max_leaves = 63,
+                 eta = 0.1,
+                 objective = "reg:linear",
+                 early_stopping_rounds = 10,
+                 tree_method = "hist",
+                 grow_policy = "lossguide",
+                 nthread = 1,
+                 verbose = 0)
[18:04:37] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[18:04:37] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[18:04:37] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[18:04:37] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[18:04:37] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
> model$evaluation_log$test_rmse_mean[model$best_iteration]
[1] 8.933057

Copy & paste code in R:

library(lightgbm)

set.seed(1)
mat <- matrix(round(runif(20000, min = -0.5, max = 9.5)), nrow = 10000)
table(mat)

y <- (mat[, 1] + runif(10000, 0, 1)) * (mat[, 2] * runif(10000, 0, 1))
dtrain <- lgb.Dataset(data = mat, label = y)

params <- list(objective = "regression", metric = "rmse")
set.seed(1)
model <- lgb.cv(params,
                dtrain,
                10000,
                nfold = 5,
                learning_rate = 0.1,
                early_stopping_rounds = 10,
                verbose = -1)
model$best_score


rm(dtrain)
dtrain <- lgb.Dataset(data = mat, label = y, categorical_feature = c(1, 2), colnames = c("X1", "X2"))
params <- list(objective = "regression", metric = "rmse")
set.seed(1)
model <- lgb.cv(params,
                dtrain,
                10000,
                nfold = 5,
                learning_rate = 0.1,
                early_stopping_rounds = 10,
                verbose = -1)
model$best_score


library(xgboost)
dtrain <- xgb.DMatrix(data = mat, label = y)
model <- xgb.cv(data = dtrain,
                nrounds = 10000,
                nfold = 5,
                metrics = "rmse",
                max_depth = 0,
                max_leaves = 63,
                eta = 0.1,
                objective = "reg:linear",
                early_stopping_rounds = 10,
                tree_method = "hist",
                grow_policy = "lossguide",
                nthread = 1,
                verbose = 0)
model$evaluation_log$test_rmse_mean[model$best_iteration]

Laurae2 on 3 Dec 2017

Thanks again, I really appreciate your quick response. I've reinstalled the python package as below, yet still receive the same results. Do you see anything that suggests this installation could yield different results?

$ git clone --recursive https://github.com/Microsoft/LightGBM.git
Cloning into 'LightGBM'...
remote: Counting objects: 8922, done.
remote: Compressing objects: 100% (23/23), done.
remote: Total 8922 (delta 1), reused 2 (delta 0), pack-reused 8899
Receiving objects: 100% (8922/8922), 7.27 MiB | 0 bytes/s, done.
Resolving deltas: 100% (6205/6205), done.
Checking connectivity... done.
Submodule 'include/boost/compute' (https://github.com/boostorg/compute) registered for path 'compute'
Cloning into 'compute'...
remote: Counting objects: 21244, done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 21244 (delta 3), reused 9 (delta 3), pack-reused 21230
Receiving objects: 100% (21244/21244), 8.41 MiB | 0 bytes/s, done.
Resolving deltas: 100% (17249/17249), done.
Checking connectivity... done.
Submodule path 'compute': checked out '6de7f6448796f67958dde8de4569fb1ae649ee91'
stefan@applied-ai:~/src$ cd LightGBM/python-package
stefan@applied-ai:~/src/LightGBM/python-package$ python setup.py install
running install
creating compile
creating compile/include
creating compile/include/LightGBM
copying ../include/LightGBM/R_object_helper.h -> ./compile/include/LightGBM
copying ../include/LightGBM/lightgbm_R.h -> ./compile/include/LightGBM
copying ../include/LightGBM/network.h -> ./compile/include/LightGBM
copying ../include/LightGBM/dataset.h -> ./compile/include/LightGBM
copying ../include/LightGBM/config.h -> ./compile/include/LightGBM
copying ../include/LightGBM/meta.h -> ./compile/include/LightGBM
copying ../include/LightGBM/feature_group.h -> ./compile/include/LightGBM
copying ../include/LightGBM/prediction_early_stop.h -> ./compile/include/LightGBM
creating compile/include/LightGBM/utils
copying ../include/LightGBM/utils/text_reader.h -> ./compile/include/LightGBM/utils
copying ../include/LightGBM/utils/pipeline_reader.h -> ./compile/include/LightGBM/utils
copying ../include/LightGBM/utils/log.h -> ./compile/include/LightGBM/utils
copying ../include/LightGBM/utils/threading.h -> ./compile/include/LightGBM/utils
copying ../include/LightGBM/utils/array_args.h -> ./compile/include/LightGBM/utils
copying ../include/LightGBM/utils/openmp_wrapper.h -> ./compile/include/LightGBM/utils
copying ../include/LightGBM/utils/common.h -> ./compile/include/LightGBM/utils
copying ../include/LightGBM/utils/random.h -> ./compile/include/LightGBM/utils
copying ../include/LightGBM/boosting.h -> ./compile/include/LightGBM
copying ../include/LightGBM/application.h -> ./compile/include/LightGBM
copying ../include/LightGBM/c_api.h -> ./compile/include/LightGBM
copying ../include/LightGBM/tree.h -> ./compile/include/LightGBM
copying ../include/LightGBM/bin.h -> ./compile/include/LightGBM
copying ../include/LightGBM/objective_function.h -> ./compile/include/LightGBM
copying ../include/LightGBM/metric.h -> ./compile/include/LightGBM
copying ../include/LightGBM/export.h -> ./compile/include/LightGBM
copying ../include/LightGBM/tree_learner.h -> ./compile/include/LightGBM
copying ../include/LightGBM/dataset_loader.h -> ./compile/include/LightGBM
creating compile/src
creating compile/src/metric
copying ../src/metric/regression_metric.hpp -> ./compile/src/metric
copying ../src/metric/dcg_calculator.cpp -> ./compile/src/metric
copying ../src/metric/map_metric.hpp -> ./compile/src/metric
copying ../src/metric/binary_metric.hpp -> ./compile/src/metric
copying ../src/metric/rank_metric.hpp -> ./compile/src/metric
copying ../src/metric/xentropy_metric.hpp -> ./compile/src/metric
copying ../src/metric/multiclass_metric.hpp -> ./compile/src/metric
copying ../src/metric/metric.cpp -> ./compile/src/metric
creating compile/src/treelearner
copying ../src/treelearner/parallel_tree_learner.h -> ./compile/src/treelearner
copying ../src/treelearner/serial_tree_learner.cpp -> ./compile/src/treelearner
copying ../src/treelearner/feature_parallel_tree_learner.cpp -> ./compile/src/treelearner
copying ../src/treelearner/feature_histogram.hpp -> ./compile/src/treelearner
copying ../src/treelearner/gpu_tree_learner.cpp -> ./compile/src/treelearner
copying ../src/treelearner/split_info.hpp -> ./compile/src/treelearner
copying ../src/treelearner/leaf_splits.hpp -> ./compile/src/treelearner
copying ../src/treelearner/voting_parallel_tree_learner.cpp -> ./compile/src/treelearner
copying ../src/treelearner/data_parallel_tree_learner.cpp -> ./compile/src/treelearner
copying ../src/treelearner/serial_tree_learner.h -> ./compile/src/treelearner
copying ../src/treelearner/gpu_tree_learner.h -> ./compile/src/treelearner
copying ../src/treelearner/data_partition.hpp -> ./compile/src/treelearner
creating compile/src/treelearner/ocl
copying ../src/treelearner/ocl/histogram16.cl -> ./compile/src/treelearner/ocl
copying ../src/treelearner/ocl/histogram64.cl -> ./compile/src/treelearner/ocl
copying ../src/treelearner/ocl/histogram256.cl -> ./compile/src/treelearner/ocl
copying ../src/treelearner/tree_learner.cpp -> ./compile/src/treelearner
creating compile/src/boosting
copying ../src/boosting/boosting.cpp -> ./compile/src/boosting
copying ../src/boosting/dart.hpp -> ./compile/src/boosting
copying ../src/boosting/gbdt.cpp -> ./compile/src/boosting
copying ../src/boosting/goss.hpp -> ./compile/src/boosting
copying ../src/boosting/rf.hpp -> ./compile/src/boosting
copying ../src/boosting/gbdt.h -> ./compile/src/boosting
copying ../src/boosting/gbdt_prediction.cpp -> ./compile/src/boosting
copying ../src/boosting/score_updater.hpp -> ./compile/src/boosting
copying ../src/boosting/prediction_early_stop.cpp -> ./compile/src/boosting
copying ../src/boosting/gbdt_model_text.cpp -> ./compile/src/boosting
copying ../src/lightgbm_R.cpp -> ./compile/src
creating compile/src/objective
copying ../src/objective/binary_objective.hpp -> ./compile/src/objective
copying ../src/objective/multiclass_objective.hpp -> ./compile/src/objective
copying ../src/objective/xentropy_objective.hpp -> ./compile/src/objective
copying ../src/objective/objective_function.cpp -> ./compile/src/objective
copying ../src/objective/regression_objective.hpp -> ./compile/src/objective
copying ../src/objective/rank_objective.hpp -> ./compile/src/objective
copying ../src/c_api.cpp -> ./compile/src
creating compile/src/application
copying ../src/application/predictor.hpp -> ./compile/src/application
copying ../src/application/application.cpp -> ./compile/src/application
copying ../src/main.cpp -> ./compile/src
creating compile/src/io
copying ../src/io/dense_bin.hpp -> ./compile/src/io
copying ../src/io/parser.hpp -> ./compile/src/io
copying ../src/io/dataset.cpp -> ./compile/src/io
copying ../src/io/ordered_sparse_bin.hpp -> ./compile/src/io
copying ../src/io/tree.cpp -> ./compile/src/io
copying ../src/io/bin.cpp -> ./compile/src/io
copying ../src/io/dense_nbits_bin.hpp -> ./compile/src/io
copying ../src/io/metadata.cpp -> ./compile/src/io
copying ../src/io/sparse_bin.hpp -> ./compile/src/io
copying ../src/io/dataset_loader.cpp -> ./compile/src/io
copying ../src/io/config.cpp -> ./compile/src/io
copying ../src/io/parser.cpp -> ./compile/src/io
creating compile/src/network
copying ../src/network/linkers_mpi.cpp -> ./compile/src/network
copying ../src/network/network.cpp -> ./compile/src/network
copying ../src/network/linkers_socket.cpp -> ./compile/src/network
copying ../src/network/linkers.h -> ./compile/src/network
copying ../src/network/linker_topo.cpp -> ./compile/src/network
copying ../src/network/socket_wrapper.hpp -> ./compile/src/network
copying ../windows/LightGBM.sln -> ./compile/windows
copying ../windows/LightGBM.vcxproj -> ./compile/windows
copying ../CMakeLists.txt -> ./compile/
copying ../LICENSE -> ./
INFO:LightGBM:Starting to compile the library.
INFO:LightGBM:Starting to compile with CMake.
running build
running build_py
INFO:root:Generating grammar tables from /usr/lib/python3.5/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python3.5/lib2to3/PatternGrammar.txt
creating build
creating build/lib
creating build/lib/lightgbm
copying lightgbm/libpath.py -> build/lib/lightgbm
copying lightgbm/engine.py -> build/lib/lightgbm
copying lightgbm/basic.py -> build/lib/lightgbm
copying lightgbm/sklearn.py -> build/lib/lightgbm
copying lightgbm/callback.py -> build/lib/lightgbm
copying lightgbm/compat.py -> build/lib/lightgbm
copying lightgbm/__init__.py -> build/lib/lightgbm
copying lightgbm/plotting.py -> build/lib/lightgbm
running egg_info
creating lightgbm.egg-info
writing top-level names to lightgbm.egg-info/top_level.txt
writing dependency_links to lightgbm.egg-info/dependency_links.txt
writing requirements to lightgbm.egg-info/requires.txt
writing lightgbm.egg-info/PKG-INFO
writing manifest file 'lightgbm.egg-info/SOURCES.txt'
reading manifest file 'lightgbm.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
no previously-included directories found matching 'build'
warning: no files found matching '*.txt'
warning: no files found matching '*.so' under directory 'lightgbm'
warning: no files found matching '*.dll' under directory 'compile/Release'
warning: no files found matching '*' under directory 'compile/compute'
warning: no files found matching 'LightGBM.vcxproj.filters' under directory 'compile/windows'
warning: no files found matching '*.dll' under directory 'compile/windows/x64/DLL'
warning: no previously-included files matching '*.py[co]' found anywhere in distribution
writing manifest file 'lightgbm.egg-info/SOURCES.txt'
copying lightgbm/VERSION.txt -> build/lib/lightgbm
running install_lib
creating /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
copying build/lib/lightgbm/libpath.py -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
copying build/lib/lightgbm/engine.py -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
copying build/lib/lightgbm/basic.py -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
copying build/lib/lightgbm/sklearn.py -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
copying build/lib/lightgbm/callback.py -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
copying build/lib/lightgbm/VERSION.txt -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
copying build/lib/lightgbm/compat.py -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
copying build/lib/lightgbm/__init__.py -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
copying build/lib/lightgbm/plotting.py -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
INFO:root:Installing lib_lightgbm from: ['compile/lib_lightgbm.so']
copying compile/lib_lightgbm.so -> /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm
byte-compiling /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm/libpath.py to libpath.cpython-35.pyc
byte-compiling /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm/engine.py to engine.cpython-35.pyc
byte-compiling /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm/basic.py to basic.cpython-35.pyc
byte-compiling /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm/sklearn.py to sklearn.cpython-35.pyc
byte-compiling /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm/callback.py to callback.cpython-35.pyc
byte-compiling /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm/compat.py to compat.cpython-35.pyc
byte-compiling /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm/__init__.py to __init__.cpython-35.pyc
byte-compiling /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm/plotting.py to plotting.cpython-35.pyc
running install_egg_info
Copying lightgbm.egg-info to /home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm-2.0.11-py3.5.egg-info
running install_scripts

stefan-jansen on 3 Dec 2017

@stefai
from your result, they are all overfitting, you can see the validation error increasing.

I guess your data is too small. All my test is based on datasets with millions instances.
For the small dataset, maybe the new algorithm is easily overfitting.

I think you may need to tune the parameters to avoid overfitting, Or only use the new algorithm for large datasets.

guolinke on 3 Dec 2017

@guolinke
I posted a toy example to illustrate the issue, but I came across this using 1.5m observations and 5,000 categories (at least 50 obs/category). It works ok using 1-hot but fails to improve on even a single step using categorical_feature, it rather deteriorates dramatically. I am using version 2.11 and have tried a range of parameters and am at a loss what's going on here.

Setup:

lgb_params = {'learning_rate'    : 0.05,
              'boosting'         : 'dart',
              'metric'           : 'rmse',
              'feature_fraction' : 0.75,
              'bagging_fraction' : 0.75,
              'max_depth': 10,
              'num_leaves'       : 61,
              'objective'        : 'regression',
              'bagging_freq'     : 1,
              'min_data_per_leaf': 250}

brand_freq = train_df.brand_name.value_counts()
rare_brands = brand_freq[brand_freq < 50].index.tolist()

train_df.loc[train_df.brand_name.isin(rare_brands), 'brand_name'] == 'Other'
train_df.brand_name.fillna('Unknown', inplace=True)

print('Unique brands:', train_df.brand_name.nunique())
print('Number of obs:', train_df.brand_name.count())

Unique brands: 4810
Number of obs: 1482535

Example using categorical_feature:

train = lgb.Dataset(
            data=pd.Series(LabelEncoder().fit_transform(train_df.brand_name)).to_frame('brand'),
            label=np.log1p(train_df.price),
            categorical_feature=['brand']    
        )

cv = lgb.cv(lgb_params, 
            num_boost_round=100,
            train_set=train, 
            stratified=False, 
            verbose_eval=1,
            categorical_feature=['brand'],
            early_stopping_rounds=25)
[1] cv_agg's rmse: 0.749208 + 0.00158038
[2] cv_agg's rmse: 3.07182 + 0.00108038
[3] cv_agg's rmse: 6.00504 + 0.00150103
[4] cv_agg's rmse: 8.96853 + 0.00180245
[5] cv_agg's rmse: 11.9398 + 0.0020732
[6] cv_agg's rmse: 14.9141 + 0.00233171
[7] cv_agg's rmse: 17.89 + 0.00258413
[8] cv_agg's rmse: 20.8669 + 0.00283307
[9] cv_agg's rmse: 23.8442 + 0.00307984
[10]    cv_agg's rmse: 26.822 + 0.00332518
[11]    cv_agg's rmse: 29.8 + 0.00356951
[12]    cv_agg's rmse: 32.7782 + 0.00381312
[13]    cv_agg's rmse: 35.7566 + 0.00405617
[14]    cv_agg's rmse: 38.735 + 0.00429881
[15]    cv_agg's rmse: 41.7136 + 0.00454112
[16]    cv_agg's rmse: 44.6922 + 0.00478316
[17]    cv_agg's rmse: 47.6708 + 0.00502499
[18]    cv_agg's rmse: 50.6495 + 0.00526665
[19]    cv_agg's rmse: 53.6283 + 0.00550815
[20]    cv_agg's rmse: 56.6071 + 0.00574954
[21]    cv_agg's rmse: 59.5859 + 0.00599082
[22]    cv_agg's rmse: 62.5647 + 0.00623201
[23]    cv_agg's rmse: 65.5436 + 0.00647312
[24]    cv_agg's rmse: 68.5225 + 0.00671417
[25]    cv_agg's rmse: 71.5013 + 0.00695516
[26]    cv_agg's rmse: 74.4802 + 0.00719609

and using dummy variables:

train = lgb.Dataset(
            data=csr_matrix(pd.get_dummies(train_df.brand_name, sparse=True).to_coo(), dtype=np.float32),
            label=np.log1p(train_df.price)   
        )
cv = lgb.cv(lgb_params, 
            num_boost_round=100,
            train_set=train, 
            stratified=False, 
            verbose_eval=1,
            early_stopping_rounds=25)

[1] cv_agg's rmse: 0.746013 + 0.00157357
[2] cv_agg's rmse: 0.743435 + 0.00156749
[3] cv_agg's rmse: 0.740547 + 0.0015579
[4] cv_agg's rmse: 0.737792 + 0.00154393
[5] cv_agg's rmse: 0.735673 + 0.00152661
[6] cv_agg's rmse: 0.733435 + 0.00152012
[7] cv_agg's rmse: 0.731248 + 0.00151158
[8] cv_agg's rmse: 0.732042 + 0.00151096
[9] cv_agg's rmse: 0.729926 + 0.00149891
[10]    cv_agg's rmse: 0.727943 + 0.00149317
[11]    cv_agg's rmse: 0.726249 + 0.00148875
[12]    cv_agg's rmse: 0.726776 + 0.00149299
[13]    cv_agg's rmse: 0.725156 + 0.0014948
[14]    cv_agg's rmse: 0.723699 + 0.00149512
[15]    cv_agg's rmse: 0.722188 + 0.00147971
[16]    cv_agg's rmse: 0.720832 + 0.0014719
[17]    cv_agg's rmse: 0.719681 + 0.00146852
[18]    cv_agg's rmse: 0.718513 + 0.00146325
[19]    cv_agg's rmse: 0.717504 + 0.00145445
[20]    cv_agg's rmse: 0.716506 + 0.00144417
[21]    cv_agg's rmse: 0.716806 + 0.00144845
[22]    cv_agg's rmse: 0.715941 + 0.00143478
[23]    cv_agg's rmse: 0.715052 + 0.00143763
[24]    cv_agg's rmse: 0.714228 + 0.00143889
[25]    cv_agg's rmse: 0.713396 + 0.00143754
[26]    cv_agg's rmse: 0.712635 + 0.00143152
[27]    cv_agg's rmse: 0.711891 + 0.00142661
[28]    cv_agg's rmse: 0.71223 + 0.00142976
[29]    cv_agg's rmse: 0.711512 + 0.00142371
[30]    cv_agg's rmse: 0.710858 + 0.00143041
[31]    cv_agg's rmse: 0.71102 + 0.00143371
[32]    cv_agg's rmse: 0.710363 + 0.00143784
[33]    cv_agg's rmse: 0.70973 + 0.00143083
[34]    cv_agg's rmse: 0.709121 + 0.00143233
[35]    cv_agg's rmse: 0.709464 + 0.00143547
[36]    cv_agg's rmse: 0.710055 + 0.00143896
[37]    cv_agg's rmse: 0.709456 + 0.00143036
[38]    cv_agg's rmse: 0.708874 + 0.00142435
[39]    cv_agg's rmse: 0.708316 + 0.00142331
[40]    cv_agg's rmse: 0.708216 + 0.00141978
[41]    cv_agg's rmse: 0.708609 + 0.00142266
[42]    cv_agg's rmse: 0.708075 + 0.00142478
[43]    cv_agg's rmse: 0.708427 + 0.00142704
[44]    cv_agg's rmse: 0.70788 + 0.00142551
[45]    cv_agg's rmse: 0.707351 + 0.00141685
[46]    cv_agg's rmse: 0.791467 + 0.00170709
[47]    cv_agg's rmse: 0.783244 + 0.00170456
[48]    cv_agg's rmse: 0.77969 + 0.00169885
[49]    cv_agg's rmse: 0.778683 + 0.00169826
[50]    cv_agg's rmse: 0.779115 + 0.00169785
[51]    cv_agg's rmse: 0.771977 + 0.00169143
[52]    cv_agg's rmse: 0.765475 + 0.00168099
[53]    cv_agg's rmse: 0.766355 + 0.00168197
[54]    cv_agg's rmse: 0.760354 + 0.00168411
[55]    cv_agg's rmse: 0.754844 + 0.00167271
[56]    cv_agg's rmse: 0.755328 + 0.00167475
[57]    cv_agg's rmse: 0.750248 + 0.00165985
[58]    cv_agg's rmse: 0.930166 + 0.00171325
[59]    cv_agg's rmse: 0.928975 + 0.00171676
[60]    cv_agg's rmse: 0.90931 + 0.00171309
[61]    cv_agg's rmse: 1.12278 + 0.00164735
[62]    cv_agg's rmse: 1.08892 + 0.00165832
[63]    cv_agg's rmse: 1.05741 + 0.00166842
[64]    cv_agg's rmse: 1.0518 + 0.00167101
[65]    cv_agg's rmse: 1.05332 + 0.00166839
[66]    cv_agg's rmse: 1.02436 + 0.00168856
[67]    cv_agg's rmse: 0.997464 + 0.00169999
[68]    cv_agg's rmse: 0.972527 + 0.00170774
[69]    cv_agg's rmse: 0.971796 + 0.00170561
[70]    cv_agg's rmse: 0.969266 + 0.00170443

stefan-jansen on 3 Dec 2017

@stefai I guess there are something wrong for CV in python package.
The result is very abnormal. Can you try the train-valid ?

guolinke on 4 Dec 2017

@guolinke
I'm afraid I'm getting the same result using train/valid instead of cv. I tried all objective options with the same result (loss remains constant for rf. Not sure if the warning is indicative of anything material ging wrong. Also attaching the data so you can check if this is due to my setup.

brand_freq = train_df.brand_name.value_counts()
rare_brands = brand_freq[brand_freq < 50].index.tolist()

train_df.loc[train_df.brand_name.isin(rare_brands), 'brand_name'] == 'Other'
train_df.brand_name.fillna('Unknown', inplace=True)

print('Unique brands:', train_df.brand_name.nunique())
print('Number of obs:', train_df.brand_name.count())

X = pd.Series(LabelEncoder().fit_transform(train_df.brand_name)).to_frame('brand')
y = np.log1p(train_df.price)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

train = lgb.Dataset(
            data=X_train,
            label=y_train,
            categorical_feature=['brand']    
        )

valid = lgb.Dataset(
            data=X_test,
            label=y_test,
            categorical_feature=['brand'], 
            reference=train
        )

model = lgb.train(lgb_params, 
            num_boost_round=num_boost_round,
            train_set=train,
            valid_sets=[train, valid],
            verbose_eval=1,
            categorical_feature=['brand'],
            early_stopping_rounds=25)

Unique brands: 4810
Number of obs: 1482535
/home/stefan/.virtualenvs/kaggle/lib/python3.5/site-packages/lightgbm/basic.py:662: UserWarning: categorical_feature in param dict is overrided.
  warnings.warn('categorical_feature in param dict is overrided.')
[1] training's rmse: 0.748742   valid_1's rmse: 0.751076
Training until validation scores don't improve for 25 rounds.
[2] training's rmse: 3.07115    valid_1's rmse: 3.0689
[3] training's rmse: 6.00383    valid_1's rmse: 6.00124
[4] training's rmse: 8.96675    valid_1's rmse: 8.96406
[5] training's rmse: 11.9374    valid_1's rmse: 11.9347
[6] training's rmse: 14.9112    valid_1's rmse: 14.9084
[7] training's rmse: 17.8866    valid_1's rmse: 17.8838
[8] training's rmse: 20.8628    valid_1's rmse: 20.86
[9] training's rmse: 23.8396    valid_1's rmse: 23.8368
[10]    training's rmse: 26.8168    valid_1's rmse: 26.8139
[11]    training's rmse: 29.7942    valid_1's rmse: 29.7914
[12]    training's rmse: 32.7718    valid_1's rmse: 32.769
[13]    training's rmse: 35.7496    valid_1's rmse: 35.7467
[14]    training's rmse: 38.7275    valid_1's rmse: 38.7246
[15]    training's rmse: 41.7054    valid_1's rmse: 41.7026
[16]    training's rmse: 44.6835    valid_1's rmse: 44.6806
[17]    training's rmse: 47.6615    valid_1's rmse: 47.6587
[18]    training's rmse: 50.6397    valid_1's rmse: 50.6368
[19]    training's rmse: 53.6178    valid_1's rmse: 53.615
[20]    training's rmse: 56.5961    valid_1's rmse: 56.5932
[21]    training's rmse: 59.5743    valid_1's rmse: 59.5714
[22]    training's rmse: 62.5525    valid_1's rmse: 62.5497
[23]    training's rmse: 65.5308    valid_1's rmse: 65.5279
[24]    training's rmse: 68.5091    valid_1's rmse: 68.5062
[25]    training's rmse: 71.4874    valid_1's rmse: 71.4845
[26]    training's rmse: 74.4657    valid_1's rmse: 74.4629
Early stopping, best iteration is:
[1] training's rmse: 0.748742   valid_1's rmse: 0.751076

data.csv.gz

stefan-jansen on 4 Dec 2017

@stefai can you provide full script ? It seems I cannot run your script directly

guolinke on 4 Dec 2017

@guolinke
Here you go. Thanks again for your quick response!

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

df = pd.read_csv('data.csv.gz', compression='gzip')
print(df.info())

brand_freq = df.brand_name.value_counts()
rare_brands = brand_freq[brand_freq < 50].index.tolist()

df.loc[df.brand_name.isin(rare_brands), 'brand_name'] == 'Other'
df.brand_name.fillna('Unknown', inplace=True)

print('Unique brands:', df.brand_name.nunique())
print('Number of obs:', df.brand_name.count())

X = pd.Series(LabelEncoder().fit_transform(df.brand_name)).to_frame('brand')
y = np.log1p(df.price)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

train = lgb.Dataset(
            data=X_train,
            label=y_train,
            categorical_feature=['brand']    
        )


valid = lgb.Dataset(
            data=X_test,
            label=y_test,
            categorical_feature=['brand'], 
            reference=train
        )

params = {'learning_rate'    : 0.05,
          'boosting'         : 'gbdt',
          'metric'           : 'rmse',
          'feature_fraction' : 0.75,
          'bagging_fraction' : 0.75,
          'max_depth': 10,
          'num_leaves'       : 61,
          'objective'        : 'regression',
          'bagging_freq'     : 1,
          'min_data_per_leaf': 250}

model = lgb.train(params, 
            num_boost_round=100,
            train_set=train,
            valid_sets=[train, valid],
            verbose_eval=1,
            categorical_feature=['brand'],
            early_stopping_rounds=25)

stefan-jansen on 4 Dec 2017

@stefai
the reason is the "feature_fraction". If you set it to "1", you can find the result is much better.

I guess there is a bug for the "feature_fraction", when #feature is very small, it may use zero features, which causes the bad accuracy.

guolinke on 4 Dec 2017

@guolinke thanks a lot, much appreciated, works fine now. I guess I could have figured this out myself, but perhaps it may make sense to add a note in the docs until this is fixed?

stefan-jansen on 4 Dec 2017

example code:

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

df = pd.read_csv('data.csv.gz', compression='gzip')
print(df.info())


print('Unique brands:', df.brand_name.nunique())
print('Number of obs:', df.brand_name.count())

df["brand_name"] = df["brand_name"].astype("category")
X = df[["brand_name"]]
y = np.log1p(df.price)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

train = lgb.Dataset(
            data=X_train,
            label=y_train
        )


valid = lgb.Dataset(
            data=X_test,
            label=y_test,
            reference=train
        )

params = {'learning_rate'    : 0.05,
          'boosting'         : 'gbdt',
          'metric'           : 'rmse',
          'feature_fraction' : 1,
          'bagging_fraction' : 1,
          'max_depth': 6,
          'num_leaves'       : 31,
          'objective'        : 'regression',
          'bagging_freq'     : 1,
          "verbose": -1,
          'min_data_per_leaf': 100}

model = lgb.train(params, 
            num_boost_round=500,
            train_set=train,
            valid_sets=[train, valid],
            verbose_eval=50,
            early_stopping_rounds=25)

categories = X_train["brand_name"].cat.categories
X_test["brand_name"] = X_test["brand_name"].cat.set_categories(categories)

train = lgb.Dataset(
            data=pd.get_dummies(X_train["brand_name"], sparse=True), 
            label=y_train   
        )

valid = lgb.Dataset(
            data=pd.get_dummies(X_test["brand_name"], sparse=True), 
            label=y_test,
            reference=train   
        )


model = lgb.train(params, 
            num_boost_round=500,
            train_set=train,
            valid_sets=[train, valid],
            verbose_eval=50,
            early_stopping_rounds=25)

guolinke on 4 Dec 2017

👍4

Was this page helpful?

0 / 5 - 0 ratings