Machinelearning: LightGBM error when using UseCat with count Feature selection

Created on 3 May 2019  Â·  9Comments  Â·  Source: dotnet/machinelearning

Issue

version: 0.11

  • What did you do?
  • I trained a LightGBM multi-class classifier with UseCat to true.
  • I added a SelectFeaturesBasedOnCount on the final Features.

If it matters: I also use early stopping (does this prune some trees?).

  • What happened?

I got an exception after training was done (successfully) and ML.NET tries to construct the InternalRegressionTree:

System.InvalidOperationException: 'Categorical split features is zero length'

Stack

>   Microsoft.ML.Core.dll!Microsoft.ML.Contracts.Check(bool f, string msg) Line 491 C#
    Microsoft.ML.FastTree.dll!Microsoft.ML.Trainers.FastTree.InternalRegressionTree.CheckValid(System.Action<bool, string> checker) Line 471    C#
    Microsoft.ML.FastTree.dll!Microsoft.ML.Trainers.FastTree.InternalRegressionTree.InternalRegressionTree(int[] splitFeatures, double[] splitGain, double[] gainPValue, float[] rawThresholds, float[] defaultValueForMissing, int[] lteChild, int[] gtChild, double[] leafValues, int[][] categoricalSplitFeatures, bool[] categoricalSplit) Line 224 C#
    Microsoft.ML.FastTree.dll!Microsoft.ML.Trainers.FastTree.InternalRegressionTree.Create(int numLeaves, int[] splitFeatures, double[] splitGain, float[] rawThresholds, float[] defaultValueForMissing, int[] lteChild, int[] gtChild, double[] leafValues, int[][] categoricalSplitFeatures, bool[] categoricalSplit) Line 188   C#
    Microsoft.ML.LightGBM.dll!Microsoft.ML.LightGBM.Booster.GetModel(int[] categoricalFeatureBoudaries) Line 257    C#
    Microsoft.ML.LightGBM.dll!Microsoft.ML.LightGBM.LightGbmTrainerBase<Microsoft.ML.Data.VBuffer<float>, Microsoft.ML.Data.MulticlassPredictionTransformer<Microsoft.ML.Trainers.OvaModelParameters>, Microsoft.ML.Trainers.OvaModelParameters>.TrainCore(Microsoft.ML.IChannel ch, Microsoft.ML.IProgressChannel pch, Microsoft.ML.LightGBM.Dataset dtrain, Microsoft.ML.LightGBM.LightGbmTrainerBase<Microsoft.ML.Data.VBuffer<float>, Microsoft.ML.Data.MulticlassPredictionTransformer<Microsoft.ML.Trainers.OvaModelParameters>, Microsoft.ML.Trainers.OvaModelParameters>.CategoricalMetaData catMetaData, Microsoft.ML.LightGBM.Dataset dvalid) Line 375    C#
    Microsoft.ML.LightGBM.dll!Microsoft.ML.LightGBM.LightGbmTrainerBase<Microsoft.ML.Data.VBuffer<float>, Microsoft.ML.Data.MulticlassPredictionTransformer<Microsoft.ML.Trainers.OvaModelParameters>, Microsoft.ML.Trainers.OvaModelParameters>.TrainModelCore(Microsoft.ML.TrainContext context) Line 117 C#
    Microsoft.ML.Data.dll!Microsoft.ML.Trainers.TrainerEstimatorBase<Microsoft.ML.Data.MulticlassPredictionTransformer<Microsoft.ML.Trainers.OvaModelParameters>, Microsoft.ML.Trainers.OvaModelParameters>.TrainTransformer(Microsoft.Data.DataView.IDataView trainSet, Microsoft.Data.DataView.IDataView validationSet, Microsoft.ML.IPredictor initPredictor) Line 148   C#
    MlnEval.exe!ConsoleApp1.MlNetSpecific.MlNetLightGbmMultiClassTrainer.TrainAndEval(ConsoleApp1.Dev.AppState app) Line 107    C#
    MlnEval.exe!ConsoleApp1.Program.Main(string[] args) Line 116    C#

Unfortunately I failed to come up with a minimal reproducible example. Seems this requires a certain data setup.

Partial analysis

It looks like here:
https://github.com/dotnet/machinelearning/blob/c5aab770622f1f56bddf8bbaf96f7798762c45ff/src/Microsoft.ML.LightGbm/WrappedLightGbmBooster.cs#L236-L251

cats array is actually 0 length but categoricalSplit[node] is still set to true.

which then later on will throw here:

https://github.com/dotnet/machinelearning/blob/c5aab770622f1f56bddf8bbaf96f7798762c45ff/src/Microsoft.ML.FastTree/TreeEnsemble/InternalRegressionTree.cs#L465-L469

P0 bug

Most helpful comment

Upon further investigation, I have found that this is not because of CountFeatureSelection, but because of how categorical splits are handled when there is only one categorical column, specifically how CategoricalSlotRanges is determined in the schema annotation and how catBoundaries and catThresholds are used
https://github.com/dotnet/machinelearning/blob/f94f359a5c52ce797b747e1b240c789696c31985/src/Microsoft.ML.LightGbm/WrappedLightGbmBooster.cs#L189-L190
https://github.com/dotnet/machinelearning/blob/f94f359a5c52ce797b747e1b240c789696c31985/src/Microsoft.ML.LightGbm/WrappedLightGbmBooster.cs#L231-L232
which affects the calculation of cats
https://github.com/dotnet/machinelearning/blob/f94f359a5c52ce797b747e1b240c789696c31985/src/Microsoft.ML.LightGbm/WrappedLightGbmBooster.cs#L235-L245

Here's what happens under various situations:
| Situation | CategoricalSlotRanges | categoricalFeatureBoundaries | catBoundaries | catThresholds | cats |
|-|-|-|-|-|-|
| Concatenate A into Features | [0, 665] | [0, 666] | [0, 1] | [1] | empty []|
| Concatenate A and B into Features | [0, 665, 666, 864] | [0, 666, 865] | [0, 2] | [2, 4096] | [1, 44] |
| Concatenate A and B into Features and apply SelectFeaturesBasedOnCount | 9 slots are selected [0, 8] | [0, 9] | [0, 1] | [1] | empty [] |

1 and 3 fail, but 2 succeeds. 1 and 3 lead to cats being an empty array but categoricalSplit being set to true, as pointed out by @rauhs, which leads to the error in
https://github.com/dotnet/machinelearning/blob/f94f359a5c52ce797b747e1b240c789696c31985/src/Microsoft.ML.FastTree/TreeEnsemble/InternalRegressionTree.cs#L474-L475

The error arises not specifically due to the presence of SelectFeaturesBasedOnCount, but because there is only one implied categorical feature produced by SelectFeaturesBasedOnCount.

I will look deeper into it and propose a fix.

cc: @guolinke @harishsk

All 9 comments

FWIW, same problem with using HandleMissingValue and assigning some nulls to some features.

Duplicate of https://github.com/dotnet/machinelearning/issues/1625, though additional details are provided in this new bug report.

Any chance this can be looked at?

@rauhs can you give me a sample of the dataset on which this can be reproduced?

I'm sure there is a smaller example but this reproduces it with version 1.5.0-preview:

   public static string[] FeatureVector(int card, int total)
    {
      var inner  = Enumerable.Range(1, card).Select(x => x.ToString());
      var repeatCount = (int)Math.Ceiling((double)total / card) + 1;
      return Enumerable.Repeat(inner, repeatCount).SelectMany(x => x).ToArray();
    }

    public static void ReproduceLightGbmFeatureSelByCountUseCatBug()
    {
      var numInstances = 10_000;
      var axs = FeatureVector(1000, numInstances);
      var bxs = FeatureVector(200, numInstances);
      var labels = FeatureVector(100, numInstances);
      var data = Enumerable.Range(1, numInstances).Select(x => new GenericSample { A = axs[x], B = bxs[x], Label = labels[x] });
      var ctx = new MLContext();
      ctx.Log += (sender, e) =>
      {
        if (e.Message.Contains("Kind=Trace]"))
        {
          return;
        }
        Console.WriteLine(e.Message);
      };

      var options = new LightGbmMulticlassTrainer.Options
      {
        UseCategoricalSplit = true,
        MinimumExampleCountPerLeaf = 1,
        MinimumExampleCountPerGroup = 1,
        EarlyStoppingRound = 10,
      };
      options.Booster = new GradientBooster.Options();

      var pipe = ctx.Transforms.Conversion.MapValueToKey("A")
        .Append(ctx.Transforms.Conversion.MapValueToKey("B"))
        .Append(ctx.Transforms.Conversion.MapKeyToVector("A"))
        .Append(ctx.Transforms.Conversion.MapKeyToVector("B"))
        .Append(ctx.Transforms.Concatenate("Features", "A", "B"))
        .Append(ctx.Transforms.FeatureSelection.SelectFeaturesBasedOnCount("Features", "Features", 10))
        .Append(ctx.Transforms.Conversion.MapValueToKey("Label"));
      var dataView = ctx.Data.LoadFromEnumerable(data);
      var trainer = ctx.MulticlassClassification.Trainers.LightGbm(options);
      var dataSplit = ctx.Data.TrainTestSplit(dataView);
      var encoder = pipe.Fit(dataSplit.TestSet);
      var trainEncoded = encoder.Transform(dataSplit.TrainSet);
      var testEncoded = encoder.Transform(dataSplit.TestSet);
      var model = trainer.Fit(trainEncoded, testEncoded);
      var scores = model.Transform(trainEncoded).GetColumn<float[]>("Score").ToArray();

      Console.WriteLine($"Min: {scores.Select(x => x.Min()).Min()}");
      Console.WriteLine($"Max: {scores.Select(x => x.Max()).Max()}");
    }

Upon further investigation, I have found that this is not because of CountFeatureSelection, but because of how categorical splits are handled when there is only one categorical column, specifically how CategoricalSlotRanges is determined in the schema annotation and how catBoundaries and catThresholds are used
https://github.com/dotnet/machinelearning/blob/f94f359a5c52ce797b747e1b240c789696c31985/src/Microsoft.ML.LightGbm/WrappedLightGbmBooster.cs#L189-L190
https://github.com/dotnet/machinelearning/blob/f94f359a5c52ce797b747e1b240c789696c31985/src/Microsoft.ML.LightGbm/WrappedLightGbmBooster.cs#L231-L232
which affects the calculation of cats
https://github.com/dotnet/machinelearning/blob/f94f359a5c52ce797b747e1b240c789696c31985/src/Microsoft.ML.LightGbm/WrappedLightGbmBooster.cs#L235-L245

Here's what happens under various situations:
| Situation | CategoricalSlotRanges | categoricalFeatureBoundaries | catBoundaries | catThresholds | cats |
|-|-|-|-|-|-|
| Concatenate A into Features | [0, 665] | [0, 666] | [0, 1] | [1] | empty []|
| Concatenate A and B into Features | [0, 665, 666, 864] | [0, 666, 865] | [0, 2] | [2, 4096] | [1, 44] |
| Concatenate A and B into Features and apply SelectFeaturesBasedOnCount | 9 slots are selected [0, 8] | [0, 9] | [0, 1] | [1] | empty [] |

1 and 3 fail, but 2 succeeds. 1 and 3 lead to cats being an empty array but categoricalSplit being set to true, as pointed out by @rauhs, which leads to the error in
https://github.com/dotnet/machinelearning/blob/f94f359a5c52ce797b747e1b240c789696c31985/src/Microsoft.ML.FastTree/TreeEnsemble/InternalRegressionTree.cs#L474-L475

The error arises not specifically due to the presence of SelectFeaturesBasedOnCount, but because there is only one implied categorical feature produced by SelectFeaturesBasedOnCount.

I will look deeper into it and propose a fix.

cc: @guolinke @harishsk

More on this, re: values of variables in (relevant diffs in bold)
https://github.com/dotnet/machinelearning/blob/f94f359a5c52ce797b747e1b240c789696c31985/src/Microsoft.ML.LightGbm/WrappedLightGbmBooster.cs#L208-L247

| | A only | A and B |
|---------------------------------------------------|---------------------------------|-----------------------------------|
| numberOfLeaves | 2 | 2 |
| numCat | 1 | 1 |
| leftChild | int[] {-1} | int[] {-1} |
| rightChild | int[] {-2} | int[] {-2} |
| splitFeature at line 215 | int[] {0} | int[] {1} |
| splitFeature after update in line 227 | int[] {0} | int[] {666} |
| threshold | double[] {0} | double[] {0} |
| splitGain | double[] {1.13603} | double[] {250.777} |
| leafOutput | double[] {-9.1865…, -9.2197...} | double[] {-8.7746..., -9.3438...} |
| decisionType | uint[] {5} | uint[] {1} |
| defaultValue | double[] {0} | double[] {0} |
| categoricalSplitFeatures at line 221 | int[][] {null} | int[][] {null} |
| categoricalSplitFeatures after update at line 243 | int[][] { {} } | int[][] { {666, 709} } |
| categoricalSplit at line 222 | bool[] {false} | bool[] {false} |
| categoricalSplit after update in line 246 | bool[] {true} | bool[] {true} |
| categoricalFeatureBoundaries | int[] {0, 666} | int[] {0, 666, 865} |
| numCat | 1 | 1 |
| catBoundaries | int[] (0, 1} | int[] (0, 2} |
| catThreshold | uint[] {1} | uint[] {2, 4096} |
| | | |
| cats | int[] {} (empty) | int[] {1, 44} |

cc: @guolinke @harishsk

@najeeb-kazmi from the log, it seems the bug is caused by wrong mapping from feature index to cat index

Circling back on this.

Upon still further investigation, I found that the issue was not due to the presence of a single categorical feature. Instead, it was triggered when there was a one-hot categorical feature, but there were some rows where all slots were 0 in the train set. This can happen if the one-hot encoder is trained on a smaller dataset and then applied on the larger training dataset, as in the repro example here.

This was not being handled correctly in the internal data representation in LightGbm. This has been fixed in #5048 thanks to @guolinke

Was this page helpful?
0 / 5 - 0 ratings