Machinelearning: [AutoML/CLI] Error in running a multiClass training for a datasets

Created on 9 Jul 2019 · 3Comments · Source: dotnet/machinelearning

System information

OS version/distro: Windows 10
.NET Version (eg., dotnet --info): 3.0.0-preview6-27804-01

Issue

What did you do?
Running the command mlnet auto-train --task multiclass-classification --dataset "SampleTrainDataset.txt" --label-column-name "label" --has-header true --max-exploration-time 60 -V diag for this
dataset
What happened?

The ColumnConcatenating set only contains 4 columns. It ignores the first "dataValue" column for no reason.
What did you expect?
I tried to debug it by the source code. I found out that after the mlContext.Auto().InferColumns(SampleTrainDatasetPath, "label", separatorChar: '\t') function, the first column ("dataValue") went into the IgnoredColumnNames collection. I want to know why,,,

Source code / logs

Please paste or attach the code or logs or traces that would be helpful to diagnose the issue you are reporting.
debug_log.txt

AutoML.NET command-line question

Source

darren-zdc

Most helpful comment

Generally a column is ignored if it looks like an ID. Code for purpose detection: https://github.com/dotnet/machinelearning/blob/d518b587b06ac3896a48646622b0f2169a230855/src/Microsoft.ML.AutoML/ColumnInference/PurposeInference.cs#L158

Since the column statistics are created on the first 10k rows, and in that sample, all values are unique, short, have have no spaces. The assumption is made that your column is an ID, which then gets ignored.

For your case:
You can re-order your file so more of the terms with spaces are near the top. Try running AutoML on a reordered version of your dataset: SampleTrainDataset.REORDERED.txt.

Shuffling your rows should also work, but in the random ordering I received, it did not lead to the column being predicted as a text feature. This implies our thresholds should be adjusted.

To improve AutoML:
We should move the column statistics to be calculated on a random subsample (likely a reservoir sample) of the dataset. We may also need to adjust the thresholds.

What is your dataset looking to predict?

Incase you're looking for more ideas on text stats to produce in your dataset..
Here's some text statistics I often use:

{ length, vowelCount, consonantCount, numberCount, underscoreCount, letterCount, startsWithVowel, endsInVowel, endsInVowelNumber, maxRepeatingChar, maxRepeatingVowel, lowerCaseCount, upperCaseCount, upperCasePercent, letterPercent, numberPercent, longestRepeatingChar, longestRepeatingVowel } (gist)

justinormont on 9 Jul 2019

❤2

All 3 comments

For your case:
You can re-order your file so more of the terms with spaces are near the top. Try running AutoML on a reordered version of your dataset: SampleTrainDataset.REORDERED.txt.

Shuffling your rows should also work, but in the random ordering I received, it did not lead to the column being predicted as a text feature. This implies our thresholds should be adjusted.

To improve AutoML:
We should move the column statistics to be calculated on a random subsample (likely a reservoir sample) of the dataset. We may also need to adjust the thresholds.

What is your dataset looking to predict?

Incase you're looking for more ideas on text stats to produce in your dataset..
Here's some text statistics I often use:

{ length, vowelCount, consonantCount, numberCount, underscoreCount, letterCount, startsWithVowel, endsInVowel, endsInVowelNumber, maxRepeatingChar, maxRepeatingVowel, lowerCaseCount, upperCaseCount, upperCasePercent, letterPercent, numberPercent, longestRepeatingChar, longestRepeatingVowel } (gist)

justinormont on 9 Jul 2019

❤2

Yes, you are right!! Thanks for your help!! I try to shuffle my dataset and it works now!

And thanks for your text statistics!

darren-zdc on 10 Jul 2019

❤1

@daholste: We have need of reservoir sampling for dataset statistics. https://github.com/dotnet/machinelearning/issues/3778

justinormont on 12 Jul 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

FastTree LearningRate not settable thru arguments object

daholste · 4Comments

Support for LSTM / RNN (Time Series)

OneCyrus · 4Comments

Sentiment Analysis on Uwp - MissingMethodException

neven10 · 3Comments

ColumnInfo as an API parameter

rogancarr · 3Comments

Graphs/Plots of Evaluation Metrics

aslotte · 3Comments