Machinelearning: TrainTestSplit is not working properly, when the column name provided

Created on 20 Dec 2020 · 7Comments · Source: dotnet/machinelearning

I think there is a issue with "TrainTestSplit" function.

When I tried to split data that consist of 500 samples which have equal number of sample from each class, it returns the testset as 0 number of rows. However when I try to split it without "label" with 0.1 ratio, it returns as 49 to testset rows and 451 to trainset rows.

Is there a way to solve this problem?

I'm using final packages.

loadsave

Source

tasmektep

Most helpful comment

Thank you for your response

tasmektep on 23 Dec 2020

❤2

All 7 comments

Hi @tasmektep,

Thank you for reporting this issue. Do you have a repro available for this specific bug? In our test cases for TrainTestSplit, we have not encountered such a bug. Thanks.

mstfbl on 22 Dec 2020

I created an example file below. (I don't know is there a better ways to add repro but I think that will be sufficient)

https://github.com/tasmektep/DotNetMLSplit

tasmektep on 22 Dec 2020

@tasmektep: In your sample, you're using the SamplingKeyColumn with your Label in it.

Using your Label as your SamplingKeyColumn will cause all rows with the same Label value to be placed together in the same splits/folds (as you're seeing).

Description from docs:

SamplingKeyColumn:
Name of a column to use for grouping rows. If two examples share the same value of the samplingKeyColumnName, they are guaranteed to appear in the same subset (train or test). This can be used to ensure no label leakage from the train to the test set. Note that when performing a Ranking Experiment, the samplingKeyColumnName must be the GroupId column. If null no row grouping will be performed.

https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.dataoperationscatalog.traintestsplit?view=ml-dotnet

You are likely thinking of the related, but inverse, concept of Stratification where the rows are evenly represented between the splits/folds. Stratification has some downsides causing it be less helpful.

justinormont on 22 Dec 2020

@tasmektep: Keep posting issues that you run into. And thanks for posting your repro.

Work on ML․NET side:

Warning -- Have the splitter warn when zero rows are present in a split, with a special warning if SamplingKeyColumn is used. In the same fix, we could warn of unbalanced splits/folds to help https://github.com/dotnet/machinelearning/issues/3711. Down side is the user would need to attach a logger to see the warning.
Documentation
- Param hover -- Ensure the hover description for SamplingKeyColumn in Visual Studio is well worded to explain the concept, and perhaps mention what it does not do.
- Samples/main docs -- Further explain the concept of SamplingKeyColumn, why its useful, and also what it does not do.

justinormont on 22 Dec 2020

❤1 👍1

Thank you for your response

tasmektep on 23 Dec 2020

❤2

Thank you @justinormont for your help and input! I'm closing this issue as @tasmektep's query has been answered.

mstfbl on 23 Dec 2020

Thank you @justinormont for your help and input! I'm closing this issue as @tasmektep's query has been answered.

-- @mstfbl

Thank you for your response

-- @tasmektep

Quite welcome.

I filed a follow up tracking issue:
Improve SamplingKeyColumn documentation and usability https://github.com/dotnet/machinelearning/issues/5567

justinormont on 23 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Printing training statistics by default discussion

sfilipi · 4Comments

ColumnInfo as an API parameter

rogancarr · 3Comments

Linear Regression using ML.NET

sethreidnz · 3Comments

Graphs/Plots of Evaluation Metrics

aslotte · 3Comments

How to use LinearSvm?

maxt3r · 3Comments