I stumbled over the default metric of the binary:logistic
objective. It seems to be 1-accuracy, which is a rather unfortunate choice. Accuracy is not even a proper scoring rule, see e.g. Wiki.
In my view, it should be "logloss", which is a strictly proper scoring rule in estimating the expectation under the binary objective.
-- Edit
The problem occurs with early stopping without manually setting the eval_metric
. The default evaluation metric should at least be a strictly consistent scoring rule.
I am using R with XGBoost version 1.1.1.1
.
The log loss is actually what's being optimized internally, since the accuracy metric is not differentiable and cannot be directly optimized. The accuracy metric is only used to monitor the performance of the model and potentially perform early stopping.
See edit (sorry)
Can you clarify more? What goes wrong if you perform early stopping with the accuracy metric?
I'm hesitant about changing the default value, since this is going to be a breaking change (i.e. it changes behavior of existing code). Our policy is that all breaking changes should have a very good reason.
I perfectly agree that changing this default is potentially "breaking". Still, it might be worth considering it.
Here a famous example: Titanic
library(titanic)
library(xgboost)
head(titanic_train)
y <- "Survived"
x <- c("Pclass", "Sex", "SibSp", "Parch", "Fare")
dtrain <- xgb.DMatrix(data.matrix(titanic_train[, x]), label = titanic_train[[y]])
params <- list(objective = "binary:logistic",
eval_metric = "logloss",
learning_rate = 0.1)
fit <- xgb.cv(params = params,
data = dtrain,
nrounds = 1000, # selected by early stopping
nfold = 5,
verbose = FALSE,
stratified = TRUE,
early_stopping_rounds = 1)
fit
There is some training, we stop after 25 rounds.
With the default, there is no training and the algo stops after the first round....
Do you think the binary logistic case is the only one where the default metric is inconsistent with the objective?
I understand that changing a default value is better done hesitantly and well thought through. I think in this case, stopping early due to accuracy but really optimizing log-loss is not very consistent. On top of that, I consider log-loss a better metric in general compared to accuracy.
Changing a default would not break code, as code still executes, only potentially deliver different results—in this case only if early stopping applies. Are those results then "better" or "worse". I'm just trying to justify such a change of the default value. WDYT?
@jameslamb Do you have any opinion on this? Should we change the default evaluation metric to logloss?
I've been thinking through this. I think it is ok to change the default to logloss
in the next minor release (1.3.x).
I think it is ok for the same training code given the same data to produce a different model between minor releases (1.2.x to 1.3.x).
That won't cause anyone's code to raise an exception, won't have any effect on loading previously-trained models from older versions, and any retraining code should be looking at the performance of a new model based on a validation set and a fixed metric anyway.
As long as the changelog in the release makes it clear that that default was changed and that it only affects the case where you are using early stopping, I don't think it'll cause problems.
@jameslamb Thanks for your thoughtful reply. Indeed, the change will only affect the newly trained models. My only concern now is that some users may want to re-run their existing code for reproducibility purposes and would find their code to behave differently. If we were to change the default, how should we make the transition as painless as possible?
Also, does LightGBM use logloss for L2 regression objective?
Thanks for the discussion. LGB seems to use logloss for binary objective:
```{r}
library(lightgbm)
library(ggplot2)
X <- data.matrix(diamonds[, c("carat")])
y <- diamonds$price > 1000
dtrain <- lgb.Dataset(X, label = y)
fit <- lgb.cv(list(objective = "binary"),
data = dtrain,
nrounds = 100,
nfold = 5)
```
gives
They use (multi) log loss also for multi-class classification. What does XGBoost in that case?
XGBoost uses merror
by default, which is the error metric for multi-class classification.
@mayer79 How common do you think it is to use early stopping without explicitly specifying the evaluation metric?
If we were to change the default, how should we make the transition as painless as possible?
I think you can use missing()
to check if eval_metric
was not passed, and do something like this:
if (missing(eval_metric)){
print("Using early stopping without specifying an eval metric. In XGBoost 1.3.0, the default metric used for early stopping was changed from 'accuracy' to 'logloss'. To suppress this warning, explicitly provide an eval_metric")
}
does LightGBM use logloss for L2 regression objective?
In LightGBM, if you use objective = "regression"
and don't provide a metric
, L2 is used as objective and as the evaluation metric for early stopping.
For example, if you do this with {lightgbm}
3.0.0 in R, you can test with something like this
library(lightgbm)
data(agaricus.train, package = "lightgbm")
data(agaricus.test, package = "lightgbm")
train <- agaricus.train
test <- agaricus.test
bst <- lgb.train(
params = list(
objective = "regression"
, early_stopping_rounds = 5L
)
, data = lgb.Dataset(train$data, label = train$label)
, valids = list(
"valid1" = lgb.Dataset(
test$data
, label = test$label
)
)
, nrounds = 10L
)
@jameslamb Nice. Yes, let's throw a warning for a missing eval_metric
when early stopping is used. With the warning, the case I mentioned (reproducibility) is also covered, and we can change the default metric.
@mayer79 @lorentzenchr Thanks to the recent discussion, I changed my mind. Let us change the default metric with a clear documentation as well as a run-time warning.
@hcho3: Hard to say. I prefer to use the default because it makes the code more generic. Should we also consider switching to multi-logloss for multiclassification? I like the idea with the run-time warning very much.
@mayer79 Yes, let's change the default for multiclass classification as well.
To new contributors: If you're reading this and interested in contributing this feature, please comment here. Feel free to ping me with questions.