I have tried creating a simple data and performing the training like so
dotnet .\mlnet.dll auto-train --task binary-classification
--dataset "logons.csv" --label-column-index 0
--has-header true --max-exploration-time 10
Here is an example of the data set which is reduced from my original, but shows the format:
Valid Data
0 09:00
0 09:01
0 09:02
0 09:03
0 09:04
0 09:05
0 09:06
0 09:07
1 12:08
0 09:09
0 09:10
0 09:00
0 09:01
0 09:02
0 09:03
0 09:04
0 09:05
0 09:06
0 09:07
1 13:08
0 09:09
0 09:10
0 09:00
0 09:01
0 09:02
0 09:03
0 09:04
0 09:05
0 09:06
0 09:07
1 14:08
0 09:09
0 09:10
Every time I try and run the command I get the following error:
Exception occured while exploring pipelines:
Training failed with the exception:
System.ArgumentOutOfRangeException: AUC is not definied
when there is no positive class in the data
Parameter name: PosSample
I originally tried it via VS2019 and the latest version of ML.Net, but that failed, so I tried it using the binary directly
This is likely an instance of a cross-validation fold failing. It fails due to not having enough samples to always have both classes.
This is being fixed in https://github.com/dotnet/machinelearning/pull/3794
My original file had over 200 training lines, which is similar to the Wikipedia training set?
I'll transfer this issue to the ML.NET repo since it is related to the framework, not the samples, ok?
I have now altered my test data to have a 30+% split of positive results, and the training works. Thanks!
@woanware: You may want to set a weight column too, which will preserve the original true/false ratio.
Upsampling your positive class (or downsampling your negative class) changes the ratio of true/false that your trainer sees. This will cause the model to predict true more often than your original dataset. If that is unwanted, you can use a weight column to down-weight your positive class.
Also, if you're upsampling, ensure you split your dataset first, then upsample. Otherwise duplicate rows will be seen again in the test set causing your metrics to be no longer representative, which is a form of data leakage.
Most helpful comment
I have now altered my test data to have a 30+% split of positive results, and the training works. Thanks!