The example below reports incorrect validation logloss metrics when setting stratified = FALSE. If I set stratified = TRUE, the metric is calculated as expected. I tried debugging and this appears to be related to stratified fold indices being sorted while unstratified fold indices are not sorted. Does slicing a lgb.Dataset require indices be sorted? If I edit the generate.cv.folds function https://github.com/microsoft/LightGBM/blob/master/R-package/R/lgb.cv.R#L362
to sort the unstratified fold indices, the metrics are also reported correctly.
Using v2.3.0:
library(lightgbm)
# Simulate data
set.seed(1)
n <- 2000
m <- 10000
data <- data.frame(
x = runif(m)
,x2 = c(rep(1, n), rep(0, m-n))
)
dtrain_label <- as.integer(data$x + data$x2^2 + runif(m) > 1)
data_matrix <- as.matrix(data)
dtrain <- lgb.Dataset(data_matrix, label = dtrain_label)
nfold <- 3
model <- lgb.cv(
params = list(objective = "binary", metric = "binary_logloss")
,data = dtrain
,nrounds = 1
,nfold = nfold
,stratified = FALSE # validation metric is correct if this is set to TRUE
,verbose = -1
)
# Reported logloss = 0.6451437
model$record_evals$valid$binary_logloss$eval
# Manually calculated logloss = 0.6286639
compute_logloss <- function(predicted, actual) {
-sum(actual * log(predicted) + (1 - actual) * log(1 - predicted))/length(actual)
}
mean(sapply(seq_len(nfold), function(i) {
booster <- model$boosters[[i]]$booster
fold_indices <- booster$.__enclos_env__$private$valid_sets[[1]]$.__enclos_env__$private$used_indices
fold_data <- data_matrix[fold_indices, ]
fold_label <- dtrain_label[fold_indices]
fold_preds <- predict(booster, fold_data, rawscore = FALSE)
compute_logloss(fold_preds, fold_label)
}))
@rgranvil yes, the subset requires the sorted indices.
@StrikerRUS could you check this in python side? It may have the same problem.
@guolinke Yes, indeed, indices are not sorted:
import lightgbm as lgb
from sklearn.datasets import load_iris
X, y = load_iris(True)
lgb_data = lgb.Dataset(X, y)
lgb.cv({'objective': 'binary'}, lgb_data, num_boost_round=1, nfold=3, stratified=False)
train_idx: [ 69 135 56 80 123 133 106 146 50 147 85 30 101 94 64 89 91 125
48 13 111 95 20 15 52 3 149 98 6 68 109 96 12 102 120 104
128 46 11 110 124 41 148 1 113 139 42 4 129 17 38 5 53 143
105 0 34 28 55 75 35 23 74 31 118 57 131 65 32 138 14 122
19 29 130 49 136 99 82 79 115 145 72 77 25 81 140 142 39 58
88 70 87 36 21 9 103 67 117 47]
test_idx: [114 62 33 107 7 100 40 86 76 71 134 51 73 54 63 37 78 90
45 16 121 66 24 8 126 22 44 97 93 26 137 84 27 127 132 59
18 83 61 92 112 2 141 43 10 60 116 144 119 108]
train_idx: [114 62 33 107 7 100 40 86 76 71 134 51 73 54 63 37 78 90
45 16 121 66 24 8 126 22 44 97 93 26 137 84 27 127 132 59
18 83 61 92 112 2 141 43 10 60 116 144 119 108 38 5 53 143
105 0 34 28 55 75 35 23 74 31 118 57 131 65 32 138 14 122
19 29 130 49 136 99 82 79 115 145 72 77 25 81 140 142 39 58
88 70 87 36 21 9 103 67 117 47]
test_idx: [ 69 135 56 80 123 133 106 146 50 147 85 30 101 94 64 89 91 125
48 13 111 95 20 15 52 3 149 98 6 68 109 96 12 102 120 104
128 46 11 110 124 41 148 1 113 139 42 4 129 17]
train_idx: [114 62 33 107 7 100 40 86 76 71 134 51 73 54 63 37 78 90
45 16 121 66 24 8 126 22 44 97 93 26 137 84 27 127 132 59
18 83 61 92 112 2 141 43 10 60 116 144 119 108 69 135 56 80
123 133 106 146 50 147 85 30 101 94 64 89 91 125 48 13 111 95
20 15 52 3 149 98 6 68 109 96 12 102 120 104 128 46 11 110
124 41 148 1 113 139 42 4 129 17]
test_idx: [ 38 5 53 143 105 0 34 28 55 75 35 23 74 31 118 57 131 65
32 138 14 122 19 29 130 49 136 99 82 79 115 145 72 77 25 81
140 142 39 58 88 70 87 36 21 9 103 67 117 47]
I guess sorting here should be enough:
https://github.com/microsoft/LightGBM/blob/b1c50d07cee4f6042a5d50c2c92ca81b321b000b/python-package/lightgbm/engine.py#L342-L343
@jameslamb @Laurae2 Can you please fix it ASAP? Refer to https://github.com/microsoft/LightGBM/pull/2510#issuecomment-544485006.
Thanks @rgranvil for reporting. Apologies for the delayed response, as I just returned from a few weeks of travel.
@StrikerRUS thanks for the ping. I can look ASAP
opened #2524 to hopefully address this. @rgranvil running your example code on that branch, I get the same metric from both approaches.
Thanks for providing such a clean reproducible example we could use!