mlnet auto-train --task multiclass-classification --dataset "SampleTrainDataset.txt" --label-column-name "label" --has-header true --max-exploration-time 60 -V diag for this
mlContext.Auto().InferColumns(SampleTrainDatasetPath, "label", separatorChar: '\t') function, the first column ("dataValue") went into the IgnoredColumnNames collection. I want to know why,,,Please paste or attach the code or logs or traces that would be helpful to diagnose the issue you are reporting.
debug_log.txt
Generally a column is ignored if it looks like an ID. Code for purpose detection: https://github.com/dotnet/machinelearning/blob/d518b587b06ac3896a48646622b0f2169a230855/src/Microsoft.ML.AutoML/ColumnInference/PurposeInference.cs#L158
Since the column statistics are created on the first 10k rows, and in that sample, all values are unique, short, have have no spaces. The assumption is made that your column is an ID, which then gets ignored.
For your case:
You can re-order your file so more of the terms with spaces are near the top. Try running AutoML on a reordered version of your dataset: SampleTrainDataset.REORDERED.txt.
Shuffling your rows should also work, but in the random ordering I received, it did not lead to the column being predicted as a text feature. This implies our thresholds should be adjusted.
To improve AutoML:
We should move the column statistics to be calculated on a random subsample (likely a reservoir sample) of the dataset. We may also need to adjust the thresholds.
--
What is your dataset looking to predict?
Incase you're looking for more ideas on text stats to produce in your dataset..
Here's some text statistics I often use:
{ length, vowelCount, consonantCount, numberCount, underscoreCount, letterCount, startsWithVowel, endsInVowel, endsInVowelNumber, maxRepeatingChar, maxRepeatingVowel, lowerCaseCount, upperCaseCount, upperCasePercent, letterPercent, numberPercent, longestRepeatingChar, longestRepeatingVowel } (gist)
Yes, you are right!! Thanks for your help!! I try to shuffle my dataset and it works now!
And thanks for your text statistics!
@daholste: We have need of reservoir sampling for dataset statistics. https://github.com/dotnet/machinelearning/issues/3778
Most helpful comment
Generally a column is ignored if it looks like an ID. Code for purpose detection: https://github.com/dotnet/machinelearning/blob/d518b587b06ac3896a48646622b0f2169a230855/src/Microsoft.ML.AutoML/ColumnInference/PurposeInference.cs#L158
Since the column statistics are created on the first 10k rows, and in that sample, all values are unique, short, have have no spaces. The assumption is made that your column is an ID, which then gets ignored.
For your case:
You can re-order your file so more of the terms with spaces are near the top. Try running AutoML on a reordered version of your dataset: SampleTrainDataset.REORDERED.txt.
Shuffling your rows should also work, but in the random ordering I received, it did not lead to the column being predicted as a text feature. This implies our thresholds should be adjusted.
To improve AutoML:
We should move the column statistics to be calculated on a random subsample (likely a reservoir sample) of the dataset. We may also need to adjust the thresholds.
--
What is your dataset looking to predict?
Incase you're looking for more ideas on text stats to produce in your dataset..
Here's some text statistics I often use: