Lightgbm: lgb.cv reports incorrect validation metrics when not using stratification

Created on 10 Oct 2019 · 5Comments · Source: microsoft/LightGBM

The example below reports incorrect validation logloss metrics when setting stratified = FALSE. If I set stratified = TRUE, the metric is calculated as expected. I tried debugging and this appears to be related to stratified fold indices being sorted while unstratified fold indices are not sorted. Does slicing a lgb.Dataset require indices be sorted? If I edit the generate.cv.folds function https://github.com/microsoft/LightGBM/blob/master/R-package/R/lgb.cv.R#L362
to sort the unstratified fold indices, the metrics are also reported correctly.

Using v2.3.0:

library(lightgbm)

# Simulate data
set.seed(1)
n <- 2000
m <- 10000
data <- data.frame(
  x = runif(m)
  ,x2 = c(rep(1, n), rep(0, m-n))
)

dtrain_label <- as.integer(data$x + data$x2^2 + runif(m) > 1)
data_matrix <- as.matrix(data)
dtrain <- lgb.Dataset(data_matrix, label = dtrain_label)
nfold <- 3

model <- lgb.cv(
  params = list(objective = "binary", metric = "binary_logloss")
  ,data = dtrain
  ,nrounds = 1
  ,nfold = nfold
  ,stratified = FALSE # validation metric is correct if this is set to TRUE
  ,verbose = -1
)

# Reported logloss = 0.6451437
model$record_evals$valid$binary_logloss$eval

# Manually calculated logloss = 0.6286639
compute_logloss <- function(predicted, actual) {
  -sum(actual * log(predicted) + (1 - actual) * log(1 - predicted))/length(actual)
}
mean(sapply(seq_len(nfold), function(i) {
  booster <- model$boosters[[i]]$booster
  fold_indices <- booster$.__enclos_env__$private$valid_sets[[1]]$.__enclos_env__$private$used_indices
  fold_data <- data_matrix[fold_indices, ]
  fold_label <- dtrain_label[fold_indices]
  fold_preds <- predict(booster, fold_data, rawscore = FALSE)
  compute_logloss(fold_preds, fold_label)
}))

bug r-package

Source

rgranvil

All 5 comments

@rgranvil yes, the subset requires the sorted indices.
@StrikerRUS could you check this in python side? It may have the same problem.

guolinke on 15 Oct 2019

@guolinke Yes, indeed, indices are not sorted:

import lightgbm as lgb
from sklearn.datasets import load_iris

X, y = load_iris(True)
lgb_data = lgb.Dataset(X, y)
lgb.cv({'objective': 'binary'}, lgb_data, num_boost_round=1, nfold=3, stratified=False)

train_idx: [ 69 135  56  80 123 133 106 146  50 147  85  30 101  94  64  89  91 125
  48  13 111  95  20  15  52   3 149  98   6  68 109  96  12 102 120 104
 128  46  11 110 124  41 148   1 113 139  42   4 129  17  38   5  53 143
 105   0  34  28  55  75  35  23  74  31 118  57 131  65  32 138  14 122
  19  29 130  49 136  99  82  79 115 145  72  77  25  81 140 142  39  58
  88  70  87  36  21   9 103  67 117  47]
test_idx: [114  62  33 107   7 100  40  86  76  71 134  51  73  54  63  37  78  90
  45  16 121  66  24   8 126  22  44  97  93  26 137  84  27 127 132  59
  18  83  61  92 112   2 141  43  10  60 116 144 119 108]
train_idx: [114  62  33 107   7 100  40  86  76  71 134  51  73  54  63  37  78  90
  45  16 121  66  24   8 126  22  44  97  93  26 137  84  27 127 132  59
  18  83  61  92 112   2 141  43  10  60 116 144 119 108  38   5  53 143
 105   0  34  28  55  75  35  23  74  31 118  57 131  65  32 138  14 122
  19  29 130  49 136  99  82  79 115 145  72  77  25  81 140 142  39  58
  88  70  87  36  21   9 103  67 117  47]
test_idx: [ 69 135  56  80 123 133 106 146  50 147  85  30 101  94  64  89  91 125
  48  13 111  95  20  15  52   3 149  98   6  68 109  96  12 102 120 104
 128  46  11 110 124  41 148   1 113 139  42   4 129  17]
train_idx: [114  62  33 107   7 100  40  86  76  71 134  51  73  54  63  37  78  90
  45  16 121  66  24   8 126  22  44  97  93  26 137  84  27 127 132  59
  18  83  61  92 112   2 141  43  10  60 116 144 119 108  69 135  56  80
 123 133 106 146  50 147  85  30 101  94  64  89  91 125  48  13 111  95
  20  15  52   3 149  98   6  68 109  96  12 102 120 104 128  46  11 110
 124  41 148   1 113 139  42   4 129  17]
test_idx: [ 38   5  53 143 105   0  34  28  55  75  35  23  74  31 118  57 131  65
  32 138  14 122  19  29 130  49 136  99  82  79 115 145  72  77  25  81
 140 142  39  58  88  70  87  36  21   9 103  67 117  47]

I guess sorting here should be enough:
https://github.com/microsoft/LightGBM/blob/b1c50d07cee4f6042a5d50c2c92ca81b321b000b/python-package/lightgbm/engine.py#L342-L343