Lightgbm: Add option to keep cv predicted values

Created on 5 Feb 2017 · 15Comments · Source: microsoft/LightGBM

Forgive me if I missed something but I have review the code and documentation and didn't see a way to keep the cv probabilities.

feature request help wanted

Source

JackStat

👍11

Most helpful comment

lgb.cv would indeed be much more useful if it would return the final predictions. That would e.g. allow to do stacking.

mayer79 on 10 Nov 2017

👍7

All 15 comments

@JackStat
I think a simple solution is saving all cv models. Then get the predictions by these models.
Welcome to contribute for this, I think it is easy to implement.

guolinke on 14 Feb 2017

Absolutely. I looked through the code and thought that would be a good strategy as well but I could not find an object that had the holdout data.frame.

So it looks like this chunk creates the 3 boosters (assuming 3-fold cv)

```

construct booster

bst_folds <- lapply(seq_along(folds), function(k) {
dtest <- slice(data, folds[[k]])
dtrain <- slice(data, unlist(folds[-k]))
booster <- Booster$new(params, dtrain)
booster$add_valid(dtest, "valid")
list(booster = booster)
})
```

Then you can run something with lapply to get predictions from each booster using bst_folds[[1]]$booster$predict Now I just need to know where the cv data.frames are kept so I can apply the predictions to those. I dug into the objects and couldn't see them.

Any help would be appreciated and I will open the pull req.
Thanks

JackStat on 14 Feb 2017

@guolinke I looked through the code and I found that predicting form lgb.Dataset hasn't been supported yet. Could you support that when you got time? Otherwise we can not use all cv models to predict on each fold.

Below is a simple function that generating cv predictions from original dataset, @JackStat you can use that for your problem, though I think you had figured it out by yourself.

LGB_CV_Predict <- function(lgb_cv, data, num_iteration = NULL, folds) {
  if (is.null(num_iteration)) {
    num_iteration <- lgb_cv$best_iter
  }
  cv_pred_mat <- foreach::foreach(i = seq_along(lgb_cv$boosters), .combine = "rbind") %do% {
    lgb_tree <- lgb_cv$boosters[[i]][[1]]
    predict(lgb_tree, 
            data[folds[[i]],], 
            num_iteration = num_iteration, 
            rawscore = FALSE, predleaf = FALSE, header = FALSE, reshape = TRUE)
  }
  if (ncol(cv_pred_mat) == 1) {
    as.double(cv_pred_mat)[order(unlist(folds))]
  } else {
    cv_pred_mat[order(unlist(folds)), , drop = FALSE]
  }
}

yanyachen on 17 May 2017

@yanyachen
Actually, we can get the prediction of training dataset and validation dataset by using this function:
R: https://github.com/Microsoft/LightGBM/blob/master/R-package/R/lgb.Booster.R#L454-L495
python: https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py#L1768-L1793

I think use them is enough to achieve the CV prediction score.

guolinke on 16 Aug 2017

Is this related to https://github.com/Microsoft/LightGBM/issues/828 ?

fulldecent on 16 Aug 2017

lgb.cv would indeed be much more useful if it would return the final predictions. That would e.g. allow to do stacking.

mayer79 on 10 Nov 2017

👍7

Here is an R function that will do it if you pass in a obj from lgb.cv:

get_lgbm_cv_preds <- function(cv){
        rows <- length(cv$boosters[[1]]$booster$.__enclos_env__$private$valid_sets[[1]]$.__enclos_env__$private$used_indices)+length(cv$boosters[[1]]$booster$.__enclos_env__$private$train_set$.__enclos_env__$private$used_indices)
        preds <- numeric(rows)
        for(i in 1:length(cv$boosters)){
                preds[
                cv$boosters[[i]]$booster$.__enclos_env__$private$valid_sets[[1]]$.__enclos_env__$private$used_indices] <-
                cv$boosters[[i]]$booster$.__enclos_env__$private$inner_predict(2)
        }
        return(preds)
}

programmersims on 4 Mar 2019

👍1

Great job, @programmersims !

Does the function get the best cv prediction?

NamLQ on 25 Mar 2019

Pretty sure it gets the CV prediction from the last stopping round

On Mon, Mar 25, 2019, 11:52 AM Nam Lê Quang notifications@github.com
wrote:

Great job, @programmersims https://github.com/programmersims !

Does the function get the best cv prediction?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/Microsoft/LightGBM/issues/283#issuecomment-476285672,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AOT-C7zd1FCVMN9crl4WKSM5NAOQGNwfks5vaP7NgaJpZM4L3Yff
.

programmersims on 25 Mar 2019

👍2

What a pity!

How can I just keep the best cv prediction, @programmersims ?

NamLQ on 25 Mar 2019

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

StrikerRUS on 1 Aug 2019

My sincere thanks to @StrikerRUS for unlocking.

Motivation and requirements

I know that people (especially, included some Kagglers) want this feature and I want to fix it. There are probably two reasons why people might want to get prediction values of trained models from cv() function.

req1. to analyze out-of-fold predictions for training data in more detail.
req2. to do some ensemble techniques (stacking, averaging, etc) using the trained models from the cv() function

How to fix it

I agree with @guolinke mentioned plan. In other words, add a simple way to get trained models.

req1: cv() function can accept 'folds' (context of data split), therefore users can predict of out-of-fold with trained models.
req2: users are free to enjoy any ensemble techniques with trained models.

Steps to fix it

I want to follow scikit-learn way. In other words, trained models are included to the dictionary of return value.
ref: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

I suggest the following steps:

Add an option as named of 'return_cvbooster' to cv() function.
- Add trained '_CVBooster' object (cvfolds) to the dict of return value (results) with the key 'cvbooster'
- NOTE: I am not particular about parameter names.
Change the name of '_CVBooster' to 'CVBooster'
- In other words, _CVBooster will be treated as public API
- This step also fix #2105

I would like to have your opinion.