At least in the R package, it seems that the base_score defaults to 0.5. For a classification problem, this seems to be a reasonable choice. However, this strikes me as a bizarre choice for a regression problem. Would it make more sense to set base_score=mean(label) by default? This choice might even be an improvement over the default in both regression and classification settings.
If the learning rate/shrinkage were set to 1, then the choice of base_score would be irrelevant as the resulting model would just compensate for any changes in the mean. However, essentially all reasonable implementations of boosted trees use learning rate much smaller than 1. Thus, in principle, it would seem that the choice of base_score would affect the final model.
I understand that it is extremely straightforward to set the base_score manually, but if there is a way to pick a better default, it may be worth implementing.
Any thoughts?
Normally the choice of base_score won't affect final performance that much as long as enough steps are run. But I agree that allowing setting it to mean(label) could be helpful in some cases.
Agree that the effects should be minimal. I'll run some tests when I have some time
For highly imbalanced data, I sometimes restrict max_delta_step to smaller
values. And then it is not so rare to see xgboost struggling for several
iterations to produce any improvement. But the learning progress is much
better if I provide a meaningful initial base_score value. So, having an
adaptive initialization for it would definitely be useful.
On Tue, Feb 2, 2016 at 11:28 AM, ronmexico2718 [email protected]
wrote:
Agree that the effects should be minimal. I'll run some tests when I have
some time—
Reply to this email directly or view it on GitHub
https://github.com/dmlc/xgboost/issues/799#issuecomment-178701326.
For highly imbalanced data, why not tune parameter "scale_pos_weight" for classification problem? Do you know in what case parameter "scale_pos_weight" should be tuned? and in what case parameter "max_delta_step" should be tuned?
for regression task base_score = mean(label) reduce number of epochs about 25 percent... (on my dataset), I am setting this manually always for regression... so voting for this as well..
@gugatr0n1c I also am noticing substantial differences in the number of trees required, in both classification and regression. The number of trees required is smallest when I set base_score=mean(label). The performance doesn't change much (in terms of CV error). R code is below. Anyone have any objections to setting the default base_score to mean(label)?
# -runs 2 tests for xgboost R package (one for regression, another classification)
# -tests how cv error and optimal number of trees depends on base_score for a simple problem
library(xgboost)
library(MASS)
library(boot)
######################
# REGRESSION EXAMPLE #
######################
#try a few base_scores raning between +/- base_score_range
base_score_range <- 10
base_score_test_regression <- c(-base_score_range, -base_score_range/2, 0, base_score_range/2, base_score_range)
xg_pars_regression <- list(eta=0.01, max.depth=3)
n_trees_max_regression <- 1000
#run xgboost cross validation for each choice of base_score
NRuns_regression <- 10 #number of runs for each base score
cv_err_regression <- matrix(0, nrow=NRuns_regression, ncol=length(base_score_test_regression))
n_trees_opt_regression <- matrix(0, nrow=NRuns_regression, ncol=length(base_score_test_regression))
for (i in 1:NRuns_regression) {
#generate data from simple polynomial function +noise
n <- 1000
x <- mvrnorm(n=n, mu=c(0,0,0), Sigma=matrix(c(1,0,0,0,1,0,0,0,1), c(3,3)))
y_regression <- 2 * (x[,1]>=0.3) + 2 * (x[,2] < 0) + x[,3] ^ 2 + x[,1] * x[,2] + x[,2] * x[,3] + 10*rnorm(n=nrow(x))
y_regression <- y_regression - mean(y_regression) #make mean 0
for (j in 1:length(base_score_test_regression)) {
print(paste0("Run ", i, " of ", NRuns_regression, ", base_score ", j, " of ", length(base_score_test_regression)))
#set base score
xg_pars_regression$base_score <- base_score_test_regression[j]
#run cross validation
xg_cv_results_regression <- xgb.cv(xg_pars_regression
, data = xgb.DMatrix(data.matrix(x), label=data.matrix(y_regression), missing=NA)
, nrounds=n_trees_max_regression
, nfold=3
, verbose=FALSE
)
#keep track of cv error and optimal number of trees
cv_err_regression[i,j] <- min(xg_cv_results_regression$test.rmse.mean)
n_trees_opt_regression[i,j] <-which.min(xg_cv_results_regression$test.rmse.mean)
}
}
cv_reg_df <- as.data.frame(cv_err_regression)
colnames(cv_reg_df) <- as.character(base_score_test_regression)
write.csv(colMeans(cv_reg_df), file="cv_reg_df.csv")
ntrees_reg_df <- as.data.frame(n_trees_opt_regression)
colnames(ntrees_reg_df) <- as.character(base_score_test_regression)
write.csv(colMeans(ntrees_reg_df), file="ntrees_reg_df.csv")
##########################
# CLASSIFICATION EXAMPLE #
##########################
base_score_test_classification <- seq(from=0.01, to=0.99, length.out=5)
xg_pars_classification <- list(eta=0.01, max.depth=3, objective="binary:logistic", eval_metric="logloss")
n_trees_max_classification <- 1000
#run xgboost cross validation for each choice of base_score
NRuns_classification <- 10 #number of runs for each base score
cv_err_classification <- matrix(0, nrow=NRuns_classification, ncol=length(base_score_test_classification))
n_trees_opt_classification <- matrix(0, nrow=NRuns_classification, ncol=length(base_score_test_classification))
for (i in 1:NRuns_classification) {
#generate data from simple polynomial function +noise
n <- 1000
x <- mvrnorm(n=n, mu=c(0,0,0), Sigma=matrix(c(1,0,0,0,1,0,0,0,1), c(3,3)))
y_classification_logit <- 2 * (x[,1]>=0.3) + 2 * (x[,2] < 0) + x[,3] ^ 2 + x[,1] * x[,2] + x[,2] * x[,3] + 10*rnorm(n=nrow(x))
#make y -2 on logit scale. this makes P(y=1) decidedly different than 0.5
y_classification_logit <- y_classification_logit - mean(y_classification_logit) - 20
y_classification <- round(inv.logit(y_classification_logit))
for (j in 1:length(base_score_test_classification)) {
print(paste0("Run ", i, " of ", NRuns_classification, ", base_score ", j, " of ", length(base_score_test_classification)))
#set base score
xg_pars_classification$base_score <- base_score_test_classification[j]
#run cv
xg_cv_results_classification <- xgb.cv(xg_pars_classification
, data = xgb.DMatrix(data.matrix(x), label=data.matrix(y_classification), missing=NA)
, nrounds=n_trees_max_classification
, nfold=5
, verbose=FALSE
, nthread=1)
#keep track of cv error and optimal number of trees
cv_err_classification[i,j] <- min(xg_cv_results_classification$test.logloss.mean)
n_trees_opt_classification[i,j] <-which.min(xg_cv_results_classification$test.logloss.mean)
}
}
cv_class_df <- as.data.frame(cv_err_classification)
colnames(cv_class_df) <- as.character(base_score_test_classification)
write.csv(colMeans(cv_class_df), file="cv_class_df.csv")
ntrees_class_df <- as.data.frame(n_trees_opt_classification)
colnames(ntrees_class_df) <- as.character(base_score_test_classification)
write.csv(colMeans(ntrees_class_df), file="ntrees_class_df.csv")
Yeah, I agree.
It makes sence... when base_score is set to 0.5, and we get regression problem with np.average(label) very high and eta very small (I used eta = 0.0015 and num_tree = 5000) first hundres of trees just trying to catch this big mean - with eta = 0.0015 it take a lot of time just solving this problem... with base_score set mean(label) this is actually solved right away at begining... a next tree trying to solve original task...
+1
And with multiclass the issue is more painful #1380
Begin with a prior estimates of each class probability seams reasonable.
+1
If you think of all the computation time (and electric power) wasted by people not adusting their base_score manually, this is an important (ecological) issue!
We can think of all marketing scores, or credit scoring, or insurance pricing where final predicition distributions are centered around values below 10%.
I think Kaggle could sponsor the fix regarding how much money they would save with all those people running public scripts with xgboost and its default params ^^
Hi all,
I am new to xgboost and am interested in an automatic choice for the base_score parameter.
The choice I would make (some mentioned it above) is based on class prior probabilities (label frequencies) in the multi-class case and the signal mean for the regression (equivalent to a centering).
All these choices are consistent. A regression with discrete 0-1 target has the 1 frequency as target mean and produces the same model as a classification on the same dataset, which is interesting as invariant behavior of the algorithm.
This leads to this parameter not needed at all for the end user (internal choice).
Thanks
Antoine
It seems to me that adjusting the base_score to mean(label) improves results considerably for regression with 'reg:gamma' as objective, see https://stats.stackexchange.com/a/365803/219262 for an example.
Most helpful comment
Yeah, I agree.
It makes sence... when base_score is set to 0.5, and we get regression problem with np.average(label) very high and eta very small (I used eta = 0.0015 and num_tree = 5000) first hundres of trees just trying to catch this big mean - with eta = 0.0015 it take a lot of time just solving this problem... with base_score set mean(label) this is actually solved right away at begining... a next tree trying to solve original task...