I think there is a issue with "TrainTestSplit" function.
When I tried to split data that consist of 500 samples which have equal number of sample from each class, it returns the testset as 0 number of rows. However when I try to split it without "label" with 0.1 ratio, it returns as 49 to testset rows and 451 to trainset rows.
Is there a way to solve this problem?
I'm using final packages.
Hi @tasmektep,
Thank you for reporting this issue. Do you have a repro available for this specific bug? In our test cases for TrainTestSplit, we have not encountered such a bug. Thanks.
I created an example file below. (I don't know is there a better ways to add repro but I think that will be sufficient)
@tasmektep: In your sample, you're using the SamplingKeyColumn with your Label in it.
Using your Label as your SamplingKeyColumn will cause all rows with the same Label value to be placed together in the same splits/folds (as you're seeing).
Description from docs:
SamplingKeyColumn:
Name of a column to use for grouping rows. If two examples share the same value of the samplingKeyColumnName, they are guaranteed to appear in the same subset (train or test). This can be used to ensure no label leakage from the train to the test set. Note that when performing a Ranking Experiment, the samplingKeyColumnName must be the GroupId column. If null no row grouping will be performed.
You are likely thinking of the related, but inverse, concept of Stratification where the rows are evenly represented between the splits/folds. Stratification has some downsides causing it be less helpful.
@tasmektep: Keep posting issues that you run into. And thanks for posting your repro.
Work on ML鈥ET side:
Thank you for your response
Thank you @justinormont for your help and input! I'm closing this issue as @tasmektep's query has been answered.
Thank you @justinormont for your help and input! I'm closing this issue as @tasmektep's query has been answered.
-- @mstfbl
Thank you for your response
-- @tasmektep
Quite welcome.
I filed a follow up tracking issue:
Improve SamplingKeyColumn documentation and usability https://github.com/dotnet/machinelearning/issues/5567
Most helpful comment
Thank you for your response