Hi my question is about the linear booster. I have posted it on stackoverflow too but have not got an answer yet. Maybe it is ok to post it here too?
Looking on the web I am still a confused about what the linear booster gblinear
precisely is and I am not alone.
Following the documentation it only has 3 parameters lambda
,lambda_bias
and alpha
- mayby it should say "additional parameters".
If I understand this correctly then the linear booster does (rather standard) linear boosting (with regularization).
In this context I can only make sense of the 3 parameters above and eta
(the boosting rate).
That's also how it is described on github.
Nevertheless I see that tree parameters gamma
,max_depth
and min_child_weight
also have an impact on the algorithm.
How can this be? Is there a totally clear description of the linear booster anywhere on the web?
See my examples:
library(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
Then the setup
set.seed(100)
model <- xgboost(data = train$data, label = train$label, nrounds = 5,
objective = "binary:logistic",
params = list(booster = "gblinear", eta = 0.5, lambda = 1, lambda_bias = 1,gamma = 2,
early_stopping_rounds = 3))
gives
> [1] train-error:0.018271 [2] train-error:0.003071
> [3] train-error:0.001075 [4] train-error:0.001075
> [5] train-error:0.000614
while gamma=1
set.seed(100)
model <- xgboost(data = train$data, label = train$label, nrounds = 5,
objective = "binary:logistic",
params = list(booster = "gblinear", eta = 0.5, lambda = 1, lambda_bias = 1,gamma = 1,
early_stopping_rounds = 3))
leads to
> [1] train-error:0.013051 [2] train-error:0.001842
> [3] train-error:0.001075 [4] train-error:0.001075
> [5] train-error:0.001075
which is another "path".
Similar for max_depth
:
set.seed(100)
model <- xgboost(data = train$data, label = train$label, nrounds = 5,
objective = "binary:logistic",
params = list(booster = "gblinear", eta = 0.5, lambda = 1, lambda_bias = 1, max_depth = 3,
early_stopping_rounds = 3))
> [1] train-error:0.016122 [2] train-error:0.002764
> [3] train-error:0.001075 [4] train-error:0.001075
> [5] train-error:0.000768
and
set.seed(100)
model <- xgboost(data = train$data, label = train$label, nrounds = 10,
objective = "binary:logistic",
params = list(booster = "gblinear", eta = 0.5, lambda = 1, lambda_bias = 1, max_depth = 4,
early_stopping_rounds = 3))
> [1] train-error:0.014740 [2] train-error:0.004453
> [3] train-error:0.001228 [4] train-error:0.000921
> [5] train-error:0.000614
See this topic.
Supposing the seed is fixed like you did, you must set the number of threads to 1 for reproducibility first and you will notice some of your parameters have no effect.
Multithreaded gblinear will never get the same results, even if the seed is identical.
> library(xgboost)
>
> data(agaricus.train, package='xgboost')
> data(agaricus.test, package='xgboost')
> train <- agaricus.train
> test <- agaricus.test
> set.seed(100)
> model <- xgboost(data = train$data, label = train$label, nrounds = 5, nthread = 1,
+ objective = "binary:logistic",
+ params = list(booster = "gblinear", eta = 0.5, lambda = 1, lambda_bias = 1,gamma = 2,
+ early_stopping_rounds = 3))
[1] train-error:0.006142
[2] train-error:0.002917
[3] train-error:0.001842
[4] train-error:0.001228
[5] train-error:0.000768
> set.seed(100)
> model <- xgboost(data = train$data, label = train$label, nrounds = 5, nthread = 1,
+ objective = "binary:logistic",
+ params = list(booster = "gblinear", eta = 0.5, lambda = 1, lambda_bias = 1,gamma = 1,
+ early_stopping_rounds = 3))
[1] train-error:0.006142
[2] train-error:0.002917
[3] train-error:0.001842
[4] train-error:0.001228
[5] train-error:0.000768
>
> set.seed(100)
> model <- xgboost(data = train$data, label = train$label, nrounds = 5, nthread = 1,
+ objective = "binary:logistic",
+ params = list(booster = "gblinear", eta = 0.5, lambda = 1, lambda_bias = 1, max_depth = 3,
+ early_stopping_rounds = 3))
[1] train-error:0.006142
[2] train-error:0.002917
[3] train-error:0.001842
[4] train-error:0.001228
[5] train-error:0.000768
>
> set.seed(100)
> model <- xgboost(data = train$data, label = train$label, nrounds = 5, nthread = 1,
+ objective = "binary:logistic",
+ params = list(booster = "gblinear", eta = 0.5, lambda = 1, lambda_bias = 1, max_depth = 4,
+ early_stopping_rounds = 3))
[1] train-error:0.006142
[2] train-error:0.002917
[3] train-error:0.001842
[4] train-error:0.001228
[5] train-error:0.000768
Thank you!
Most helpful comment
See this topic.
Supposing the seed is fixed like you did, you must set the number of threads to 1 for reproducibility first and you will notice some of your parameters have no effect.
Multithreaded gblinear will never get the same results, even if the seed is identical.