Machinelearning: Issue training

Created on 31 May 2019 · 5Comments · Source: dotnet/machinelearning

I have tried creating a simple data and performing the training like so

dotnet .\mlnet.dll auto-train --task binary-classification
 --dataset "logons.csv" --label-column-index 0 
--has-header true --max-exploration-time 10

Here is an example of the data set which is reduced from my original, but shows the format:

Valid    Data
0    09:00
0    09:01
0    09:02
0    09:03
0    09:04
0    09:05
0    09:06
0    09:07
1    12:08
0    09:09
0    09:10
0    09:00
0    09:01
0    09:02
0    09:03
0    09:04
0    09:05
0    09:06
0    09:07
1    13:08
0    09:09
0    09:10
0    09:00
0    09:01
0    09:02
0    09:03
0    09:04
0    09:05
0    09:06
0    09:07
1    14:08
0    09:09
0    09:10

Every time I try and run the command I get the following error:

Exception occured while exploring pipelines:
Training failed with the exception: 
System.ArgumentOutOfRangeException: AUC is not definied 
when there is no positive class in the data
Parameter name: PosSample

I originally tried it via VS2019 and the latest version of ML.Net, but that failed, so I tried it using the binary directly

AutoML.NET bug command-line

Source

woanware

👍2

Most helpful comment

I have now altered my test data to have a 30+% split of positive results, and the training works. Thanks!

woanware on 31 May 2019

🚀1 👍1

All 5 comments

This is likely an instance of a cross-validation fold failing. It fails due to not having enough samples to always have both classes.

This is being fixed in https://github.com/dotnet/machinelearning/pull/3794

justinormont on 31 May 2019

My original file had over 200 training lines, which is similar to the Wikipedia training set?

woanware on 31 May 2019

I'll transfer this issue to the ML.NET repo since it is related to the framework, not the samples, ok?

CESARDELATORRE on 31 May 2019

👍1

I have now altered my test data to have a 30+% split of positive results, and the training works. Thanks!

woanware on 31 May 2019

🚀1 👍1

@woanware: You may want to set a weight column too, which will preserve the original true/false ratio.

Upsampling your positive class (or downsampling your negative class) changes the ratio of true/false that your trainer sees. This will cause the model to predict true more often than your original dataset. If that is unwanted, you can use a weight column to down-weight your positive class.

Also, if you're upsampling, ensure you split your dataset first, then upsample. Otherwise duplicate rows will be seen again in the test set causing your metrics to be no longer representative, which is a form of data leakage.

justinormont on 31 May 2019

Was this page helpful?

0 / 5 - 0 ratings