Machinelearning: TrainTestSplit is not working properly, when the column name provided

Created on 20 Dec 2020  路  7Comments  路  Source: dotnet/machinelearning

I think there is a issue with "TrainTestSplit" function.

When I tried to split data that consist of 500 samples which have equal number of sample from each class, it returns the testset as 0 number of rows. However when I try to split it without "label" with 0.1 ratio, it returns as 49 to testset rows and 451 to trainset rows.

Is there a way to solve this problem?

I'm using final packages.

loadsave

Most helpful comment

Thank you for your response

All 7 comments

Hi @tasmektep,

Thank you for reporting this issue. Do you have a repro available for this specific bug? In our test cases for TrainTestSplit, we have not encountered such a bug. Thanks.

I created an example file below. (I don't know is there a better ways to add repro but I think that will be sufficient)

https://github.com/tasmektep/DotNetMLSplit

@tasmektep: In your sample, you're using the SamplingKeyColumn with your Label in it.

Using your Label as your SamplingKeyColumn will cause all rows with the same Label value to be placed together in the same splits/folds (as you're seeing).

Description from docs:

SamplingKeyColumn:
Name of a column to use for grouping rows. If two examples share the same value of the samplingKeyColumnName, they are guaranteed to appear in the same subset (train or test). This can be used to ensure no label leakage from the train to the test set. Note that when performing a Ranking Experiment, the samplingKeyColumnName must be the GroupId column. If null no row grouping will be performed.

https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.dataoperationscatalog.traintestsplit?view=ml-dotnet

You are likely thinking of the related, but inverse, concept of Stratification where the rows are evenly represented between the splits/folds. Stratification has some downsides causing it be less helpful.

@tasmektep: Keep posting issues that you run into. And thanks for posting your repro.

Work on ML鈥ET side:

  • Warning -- Have the splitter warn when zero rows are present in a split, with a special warning if SamplingKeyColumn is used. In the same fix, we could warn of unbalanced splits/folds to help https://github.com/dotnet/machinelearning/issues/3711. Down side is the user would need to attach a logger to see the warning.
  • Documentation

    • Param hover -- Ensure the hover description for SamplingKeyColumn in Visual Studio is well worded to explain the concept, and perhaps mention what it does not do.

    • Samples/main docs -- Further explain the concept of SamplingKeyColumn, why its useful, and also what it does not do.

Thank you for your response

Thank you @justinormont for your help and input! I'm closing this issue as @tasmektep's query has been answered.

Thank you @justinormont for your help and input! I'm closing this issue as @tasmektep's query has been answered.

-- @mstfbl

Thank you for your response

-- @tasmektep

Quite welcome.

I filed a follow up tracking issue:
    Improve SamplingKeyColumn documentation and usability https://github.com/dotnet/machinelearning/issues/5567

Was this page helpful?
0 / 5 - 0 ratings

Related issues

maxt3r picture maxt3r  路  3Comments

bs6523 picture bs6523  路  4Comments

lionelquirynen picture lionelquirynen  路  3Comments

rogancarr picture rogancarr  路  3Comments

sethreidnz picture sethreidnz  路  3Comments