Xgboost: Change default eval_metric for binary:logistic objective + add warning for missing eval_metric when early stopping is enabled

Created on 28 Aug 2020  路  17Comments  路  Source: dmlc/xgboost

I stumbled over the default metric of the binary:logistic objective. It seems to be 1-accuracy, which is a rather unfortunate choice. Accuracy is not even a proper scoring rule, see e.g. Wiki.

In my view, it should be "logloss", which is a strictly proper scoring rule in estimating the expectation under the binary objective.

-- Edit

The problem occurs with early stopping without manually setting the eval_metric. The default evaluation metric should at least be a strictly consistent scoring rule.

I am using R with XGBoost version 1.1.1.1.

good first issue hacktoberfest

All 17 comments

The log loss is actually what's being optimized internally, since the accuracy metric is not differentiable and cannot be directly optimized. The accuracy metric is only used to monitor the performance of the model and potentially perform early stopping.

See edit (sorry)

Can you clarify more? What goes wrong if you perform early stopping with the accuracy metric?

I'm hesitant about changing the default value, since this is going to be a breaking change (i.e. it changes behavior of existing code). Our policy is that all breaking changes should have a very good reason.

I perfectly agree that changing this default is potentially "breaking". Still, it might be worth considering it.

Here a famous example: Titanic

library(titanic)
library(xgboost)

head(titanic_train)

y <- "Survived"
x <- c("Pclass", "Sex", "SibSp", "Parch", "Fare")

dtrain <- xgb.DMatrix(data.matrix(titanic_train[, x]), label = titanic_train[[y]])


params <- list(objective = "binary:logistic",
               eval_metric = "logloss",
               learning_rate = 0.1)

fit <- xgb.cv(params = params, 
              data = dtrain,
              nrounds = 1000, # selected by early stopping
              nfold = 5, 
              verbose = FALSE,
              stratified = TRUE,
              early_stopping_rounds = 1)

fit

There is some training, we stop after 25 rounds.

With the default, there is no training and the algo stops after the first round....

Do you think the binary logistic case is the only one where the default metric is inconsistent with the objective?

I understand that changing a default value is better done hesitantly and well thought through. I think in this case, stopping early due to accuracy but really optimizing log-loss is not very consistent. On top of that, I consider log-loss a better metric in general compared to accuracy.

Changing a default would not break code, as code still executes, only potentially deliver different results—in this case only if early stopping applies. Are those results then "better" or "worse". I'm just trying to justify such a change of the default value. WDYT?

@jameslamb Do you have any opinion on this? Should we change the default evaluation metric to logloss?

I've been thinking through this. I think it is ok to change the default to logloss in the next minor release (1.3.x).

I think it is ok for the same training code given the same data to produce a different model between minor releases (1.2.x to 1.3.x).

That won't cause anyone's code to raise an exception, won't have any effect on loading previously-trained models from older versions, and any retraining code should be looking at the performance of a new model based on a validation set and a fixed metric anyway.

As long as the changelog in the release makes it clear that that default was changed and that it only affects the case where you are using early stopping, I don't think it'll cause problems.

@jameslamb Thanks for your thoughtful reply. Indeed, the change will only affect the newly trained models. My only concern now is that some users may want to re-run their existing code for reproducibility purposes and would find their code to behave differently. If we were to change the default, how should we make the transition as painless as possible?

Also, does LightGBM use logloss for L2 regression objective?

Thanks for the discussion. LGB seems to use logloss for binary objective:

```{r}
library(lightgbm)
library(ggplot2)

X <- data.matrix(diamonds[, c("carat")])
y <- diamonds$price > 1000
dtrain <- lgb.Dataset(X, label = y)

fit <- lgb.cv(list(objective = "binary"),
data = dtrain,
nrounds = 100,
nfold = 5)
```

gives

image

They use (multi) log loss also for multi-class classification. What does XGBoost in that case?

XGBoost uses merror by default, which is the error metric for multi-class classification.

@mayer79 How common do you think it is to use early stopping without explicitly specifying the evaluation metric?

If we were to change the default, how should we make the transition as painless as possible?

I think you can use missing() to check if eval_metric was not passed, and do something like this:

if (missing(eval_metric)){
     print("Using early stopping without specifying an eval metric. In XGBoost 1.3.0, the default metric used for early stopping was changed from 'accuracy' to 'logloss'. To suppress this warning, explicitly provide an eval_metric")
}

does LightGBM use logloss for L2 regression objective?

In LightGBM, if you use objective = "regression" and don't provide a metric, L2 is used as objective and as the evaluation metric for early stopping.

For example, if you do this with {lightgbm} 3.0.0 in R, you can test with something like this

library(lightgbm)

data(agaricus.train, package = "lightgbm")
data(agaricus.test, package = "lightgbm")
train <- agaricus.train
test <- agaricus.test

bst <- lgb.train(
    params = list(
        objective = "regression"
        , early_stopping_rounds = 5L
    )
    , data = lgb.Dataset(train$data, label = train$label)
    , valids = list(
        "valid1" = lgb.Dataset(
            test$data
            , label = test$label
        )
    )
    , nrounds = 10L
)

image

@jameslamb Nice. Yes, let's throw a warning for a missing eval_metric when early stopping is used. With the warning, the case I mentioned (reproducibility) is also covered, and we can change the default metric.

@mayer79 @lorentzenchr Thanks to the recent discussion, I changed my mind. Let us change the default metric with a clear documentation as well as a run-time warning.

@hcho3: Hard to say. I prefer to use the default because it makes the code more generic. Should we also consider switching to multi-logloss for multiclassification? I like the idea with the run-time warning very much.

@mayer79 Yes, let's change the default for multiclass classification as well.

To new contributors: If you're reading this and interested in contributing this feature, please comment here. Feel free to ping me with questions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

frankzhangrui picture frankzhangrui  路  3Comments

trivialfis picture trivialfis  路  3Comments

nnorton24 picture nnorton24  路  3Comments

wenbo5565 picture wenbo5565  路  3Comments

vkuznet picture vkuznet  路  3Comments