Machinelearning: LightGBM trainer exception

Created on 15 Nov 2018  路  17Comments  路  Source: dotnet/machinelearning

System information

  • OS version/distro: Windows 10
  • .NET Version (eg., dotnet --info): .NET Core 2.1

Issue

  • What did you do?
    Ran MML command line: execgraph "C:\Benchmarking\automl_graph.json"

Contents of automl_.graph.json:

{
  "Inputs": {
    "file_train": "D:\\SplitDatasets\\ExcitementFG2_train.csv",
    "file_test": "D:\\SplitDatasets\\ExcitementFG2_valid.csv"
  },
  "Nodes": [
    {
      "Inputs": {
        "CustomSchema": "sep=, col=Label:R4:78 col=Features1:R4:0-77 col=Features2:R4:79-202 header=+",
        "InputFile": "$file_train"
      },
      "Name": "Data.CustomTextLoader",
      "Outputs": {
        "Data": "$data_train"
      }
    },
    {
      "Inputs": {
        "CustomSchema": "sep=, col=Label:R4:78 col=Features1:R4:0-77 col=Features2:R4:79-202 header=+",
        "InputFile": "$file_test"
      },
      "Name": "Data.CustomTextLoader",
      "Outputs": {
        "Data": "$data_test"
      }
    },
    {
      "Inputs": {
        "BatchSize": 3,
        "StateArguments": {
          "Name": "AutoMlState",
          "Settings": {
            "Engine": {
              "Name": "Rocket",
              "Settings": {}
            },
            "Metric": "Accuracy",
            "TerminatorArgs": {
              "Name": "IterationLimited",
              "Settings": {
                "FinalHistoryLength": 100
              }
            },
            "TrainerKind": "SignatureBinaryClassifierTrainer"
          }
        },
        "TestingData": "$data_test",
        "TrainingData": "$data_train",
        "IgnoreColumns": ["cost"]
      },
      "Name": "Models.PipelineSweeper",
      "Outputs": {
        "Results": "$output_data",
        "State": "$xyz"
      }
    }
  ],
  "Outputs": {
    "output_data": "C:\\Benchmarking\\01-ResultsOut.csv"
  }
}
  • What happened?
    Encountered an exception in LightGBM trainer

  • What did you expect?
    A run to completion, w/o exception

Source code / logs

--- Command line args ---
dotnet MML.dll execgraph C:\Benchmarking\automl_graph.json

--- Exception message ---

System.InvalidOperationException
  HResult=0x80131509
  Message=Categorical split features is zero length
  Source=Microsoft.ML.Core
  StackTrace:
   at Microsoft.ML.Runtime.Contracts.Check(Boolean f, String msg) in C:\MLDotNet\src\Microsoft.ML.Core\Utilities\Contracts.cs:line 497
   at Microsoft.ML.Trainers.FastTree.Internal.RegressionTree.CheckValid(Action`2 checker) in C:\MLDotNet\src\Microsoft.ML.FastTree\TreeEnsemble\RegressionTree.cs:line 469
   at Microsoft.ML.Trainers.FastTree.Internal.RegressionTree..ctor(Int32[] splitFeatures, Double[] splitGain, Double[] gainPValue, Single[] rawThresholds, Single[] defaultValueForMissing, Int32[] lteChild, Int32[] gtChild, Double[] leafValues, Int32[][] categoricalSplitFeatures, Boolean[] categoricalSplit) in C:\MLDotNet\src\Microsoft.ML.FastTree\TreeEnsemble\RegressionTree.cs:line 223
   at Microsoft.ML.Trainers.FastTree.Internal.RegressionTree.Create(Int32 numLeaves, Int32[] splitFeatures, Double[] splitGain, Single[] rawThresholds, Single[] defaultValueForMissing, Int32[] lteChild, Int32[] gtChild, Double[] leafValues, Int32[][] categoricalSplitFeatures, Boolean[] categoricalSplit) in C:\MLDotNet\src\Microsoft.ML.FastTree\TreeEnsemble\RegressionTree.cs:line 189
   at Microsoft.ML.Runtime.LightGBM.Booster.GetModel(Int32[] categoricalFeatureBoudaries) in C:\MLDotNet\src\Microsoft.ML.LightGBM\WrappedLightGbmBooster.cs:line 241
   at Microsoft.ML.Runtime.LightGBM.LightGbmTrainerBase`3.TrainCore(IChannel ch, IProgressChannel pch, Dataset dtrain, CategoricalMetaData catMetaData, Dataset dvalid) in C:\MLDotNet\src\Microsoft.ML.LightGBM\LightGbmTrainerBase.cs:line 378
   at Microsoft.ML.Runtime.LightGBM.LightGbmTrainerBase`3.TrainModelCore(TrainContext context) in C:\MLDotNet\src\Microsoft.ML.LightGBM\LightGbmTrainerBase.cs:line 126
   at Microsoft.ML.Runtime.Training.TrainerEstimatorBase`2.Train(TrainContext context) in C:\MLDotNet\src\Microsoft.ML.Data\Training\TrainerEstimatorBase.cs:line 92
   at Microsoft.ML.Runtime.Training.TrainerEstimatorBase`2.Microsoft.ML.Runtime.ITrainer.Train(TrainContext context) in C:\MLDotNet\src\Microsoft.ML.Data\Training\TrainerEstimatorBase.cs:line 158
   at Microsoft.ML.Runtime.Data.TrainUtils.TrainCore(IHostEnvironment env, IChannel ch, RoleMappedData data, ITrainer trainer, RoleMappedData validData, IComponentFactory`1 calibrator, Int32 maxCalibrationExamples, Nullable`1 cacheData, IPredictor inputPredictor) in C:\MLDotNet\src\Microsoft.ML.Data\Commands\TrainCommand.cs:line 254
   at Microsoft.ML.Runtime.Data.TrainUtils.Train(IHostEnvironment env, IChannel ch, RoleMappedData data, ITrainer trainer, IComponentFactory`1 calibrator, Int32 maxCalibrationExamples) in C:\MLDotNet\src\Microsoft.ML.Data\Commands\TrainCommand.cs:line 223
   at Microsoft.ML.Runtime.EntryPoints.LearnerEntryPointsUtils.Train[TArg,TOut](IHost host, TArg input, Func`1 createTrainer, Func`1 getLabel, Func`1 getWeight, Func`1 getGroup, Func`1 getName, Func`1 getCustom, ICalibratorTrainerFactory calibrator, Int32 maxCalibrationExamples) in C:\MLDotNet\src\Microsoft.ML.Data\EntryPoints\InputBase.cs:line 189
   at Microsoft.ML.Runtime.LightGBM.LightGbm.TrainBinary(IHostEnvironment env, LightGbmArguments input) in C:\MLDotNet\src\Microsoft.ML.LightGBM\LightGbmBinaryTrainer.cs:line 189
P0 bug

Most helpful comment

Hey @daholste, I wasn't able to reproduce this at all, neither in TLC nor in ML.NET. And it looks like the Models.PipelineSweeper and Rocket components in the graph (along with the execgraph command)聽were removed in ML.NET some time ago. In any case, there was聽no repro even when using LightGbm from the command line or API since the dataset is only numerical columns, and the聽Categorical split features is zero length聽error isn't applicable so I'm not sure why you were seeing that in the first place.

I do, however, have the same error reproduced in #3659, and I believe the underlying cause is the same. It deterministically happens when there is only one categorical feature and UseCategoricalSplit is true in LightGbm, and it is likely a bug in model conversion from LightGbm to FastTree. Please follow #3659 for details and updates. I am closing this issue. Please feel free to reopen if you find a repro that is distinct from the conditions described in the other issue.

cc: @vinodshanbhag @justinormont @guolinke @vKuryshev @mayoatte @rauhs @eyvindwa

All 17 comments

@codemzs, @guolinke: Any ideas? (related to categorical handing of LightGBM)

Thanks @justinormont for the contacts. Thanks @codemzs for all your help on this today. @guolinke -- tr=LightGBMBinary{UseCat:+ CatSmooth:1} repeatedly fails, but tr=LightGBMBinary{UseCat:+} repeatedly works. Perhaps this is a LightGBM cat smoothing bug?

Unfortunately, I think I was wrong, and I do not think this bug is confined to smoothing. @guolinke -- I have some concrete reproducible data where this fails. Let me know if you want to sync up to debug

It seems there is a bug in model conversion from LightGBM to FastTree.

https://github.com/dotnet/machinelearning/blob/18f7acc6b2f6e1cada41bb7ad2e03e53ae381849/src/Microsoft.ML.FastTree/TreeEnsemble/RegressionTree.cs#L469

@codemzs any idea about this bug?

@justinormont can you share dataset on which you get exception?
I've tried to run LightGBm with use cat on small datasets (like adult) but I can't hit that exception.

This is still happening often unfortunately
LightGBM is one of the trainers that often produces the best results, so the exception may be an important one
Have a consistent repro -- would love to work with someone to help debug if possible

@vinodshanbhag for visibility

@Ivanidzo4ka we are still seeing this issue in more than one datasets. Can we please have this investiagated? @daholste can share the dataset if you need

I got the same issue when try to train model with empty values in department feature. I used OneHotEncoding. Then I replaced all empty strings to "-1" and issue has been fixed.

I think @justinormont sent me repo file some time ago, but I lost it. If someone can provide reproducible snippet of code, I would be more than happy to fix it.

Thanks a lot, @Ivanidzo4ka . Sent!

@justinormont Any updates on this issue? Just ran into it again. Thanks!

Same here. If it helps I get this now when my features have extremely high cardinality. When I print the schema of the categorical feature (the dataview, doing my own printing) I see that I have 99_999 different values in that categorical feature.

I have this issue too - any ETA for a fix?

@daholste can you send me the dataset and code with which I can reproduce this issue? The same that you sent to Ivan :)

Hey, sent!

Hey @daholste, I wasn't able to reproduce this at all, neither in TLC nor in ML.NET. And it looks like the Models.PipelineSweeper and Rocket components in the graph (along with the execgraph command)聽were removed in ML.NET some time ago. In any case, there was聽no repro even when using LightGbm from the command line or API since the dataset is only numerical columns, and the聽Categorical split features is zero length聽error isn't applicable so I'm not sure why you were seeing that in the first place.

I do, however, have the same error reproduced in #3659, and I believe the underlying cause is the same. It deterministically happens when there is only one categorical feature and UseCategoricalSplit is true in LightGbm, and it is likely a bug in model conversion from LightGbm to FastTree. Please follow #3659 for details and updates. I am closing this issue. Please feel free to reopen if you find a repro that is distinct from the conditions described in the other issue.

cc: @vinodshanbhag @justinormont @guolinke @vKuryshev @mayoatte @rauhs @eyvindwa

Was this page helpful?
0 / 5 - 0 ratings

Related issues

maxt3r picture maxt3r  路  3Comments

neven10 picture neven10  路  3Comments

rogancarr picture rogancarr  路  3Comments

daholste picture daholste  路  4Comments

frankhaugen picture frankhaugen  路  3Comments