Machinelearning: Automl.net version 0.17.1, training a Binary Classification model returns misleading quality metrics

Created on 24 Aug 2020  路  9Comments  路  Source: dotnet/machinelearning

System information

  • OS version/distro:
  • .NET Version (eg., dotnet --info):

Issue

  • What did you do?
    The AutoMl Api, stops after one iteration when training a Binary Classification, and the best run model score is set to 1. Therefore, the quality metric always set to prefect values which are misleading.

  • What happened?

// Deugging the source code, I can see if model is perfect, break
if (_metricsAgent.IsModelPerfect(suggestedPipelineRunDetail.Score))
{
break;
}

suggestedPipelineRunDetail. The score is always 1

Trainer Accuracy AUC AUPRC F1-score Duration
1 AveragedPerceptronBinary 1.0000 1.0000 1.0000 1.0000 0.5

  • What did you expect?

    if you run with ML.Net for the same training dataset:

Accuracy AUC F1-Score Positive Precision Positive Recall Negative Precision Negative Recall
52.26% 52.86% 0.82% 1.00 0.00 0.52 100.00%

Source code / logs

Please paste or attach the code or logs or traces that would be helpful to diagnose the issue you are reporting.

bug need info

All 9 comments

Are you able to share your dataset or code you used to create the model?

I have created a cutdown sample file and console app. What is the best way getting it to you

@aforoughi1 : You can attach your .zip file to a comment using drag/drop. It's also useful to put the body of the training code in the comment itself (for easier readability).

Generally perfect metrics are caused either by too small validation/test datasets, very high label skew, too easy of a problem, or leakage.

Info about leakage: https://en.wikipedia.org/wiki/Leakage_(machine_learning)

The zip file is attached, which includes the sample data and code.
#5362 sample code.zip.

What pipeline did you build for ML.net to get those results? The pipeline that AutoML is building, at least on my machine not sure if yours is different, is this:

var pipeline = mlContext.Transforms.Concatenate("Features", "Open", "High", "Low", "Close", "AdjClose", "Volume", "DISignal")
                .Append(mlContext.Transforms.NormalizeMinMax("Features"))
                .Append(mlContext.BinaryClassification.Trainers.AveragedPerceptron());

When I run that pipeline I get identical results to what AutoML is giving.

The DISignal column is a proxy for the Label column. Using that column as a feature will cause leakage, and a perfect score.

Few options:

  • Drop the old label -- .Append(mlContext.Transforms.DropColumns("DISignal"))
  • Overwrite the old label when you convert to boolean
  • Tell AutoML to ignore "DISignal" by moving the column name from columnInference.ColumnInformation.NumericColumnNames to .IgnoredColumnNames
  • Use multi-class classification instead of binary (so you don't need to convert to Boolean)
  • Remove missing values (then directly parse as Boolean - see: Dataset cleanup below)

Dataset cleanup:
The last line in your dataset has a missing label value.

The Boolean type in ML鈥ET doesn't support missing values, so I expect AutoML will assume the column is of type Single (which does support missing values). You may want to fill this missing value, or remove the line. Once the missing value is fixed in the dataset, a side benefit is ML鈥ET's loader should properly read your {-1, 1} values directly into a Boolean (supported Boolean values), and you can remove the custom conversion code.

Another route is switching to multi-class classification, which will cleanly support the missing value and will remove the need to convert to Boolean.

Improvement for the ML鈥ET code:
We may want to implement automated leakage detection, which can detect this type of column-wise leakage. Then we can warn the user. One method of implementing this check is purposefully adding the label column into the existing features, training a FastForestRegression (w/ label shuffling if multi-class), then checking the model weights, and alerting of any columns having close to 1.0 feature importance (besides the label).

As was requested, below is the pipeline for the data processing part in the ML..Net:

var dataProcessPipeline = mlContext.Transforms.CopyColumns("Label", TargetName())
.Append(FloatToBoolLabelNormalizer())
.Append(mlContext.Transforms.Concatenate("Features", StockFeatures.Indicators)
.Append(mlContext.Transforms.NormalizeMinMax("MinMaxNormalized", "Features"))
.AppendCacheCheckpoint(mlContext));

I have 15 Technical Indicators which are used to train Regression, Binary, and Multiclass models. The results of All the Multiclass models/predictions fit well when they are compared against the real-time Technical Indicators/charts.

I was surprised to see a difference in AutoMl and Ml quality metrics when I retrain the best run model using Ml. I read in the latest release of API that the number of iterations are changed to 10 and was wondering under what conditions it would retry again?

The missing data rows are dropped at the cleaning stage e.g. public holidays and NAs caused due to data engineering calculations such as moving averages, etc. The last value is a Na because the values are shifted by 1 day (lagged variable). The actual code fills this value.

I will look at some of your suggestions to avoid leakage.

I have dropped the proxy for the label and it works now.

@aforoughi1 I'm glad that was able to resolve your issue. I'm going to close this for now, if you have any more problems feel free to reopen this.

@justinormont Lets sync to discuss your suggestions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rogancarr picture rogancarr  路  3Comments

daholste picture daholste  路  3Comments

rebecca-burwei picture rebecca-burwei  路  3Comments

daholste picture daholste  路  3Comments

daholste picture daholste  路  4Comments