Machinelearning: [AutoML/CLI] Error in running a multiClass training for a datasets

Created on 9 Jul 2019  路  3Comments  路  Source: dotnet/machinelearning

System information

  • OS version/distro: Windows 10
  • .NET Version (eg., dotnet --info): 3.0.0-preview6-27804-01

Issue

  • What did you do?
    Running the command mlnet auto-train --task multiclass-classification --dataset "SampleTrainDataset.txt" --label-column-name "label" --has-header true --max-exploration-time 60 -V diag for this
    dataset
  • What happened?
    image
    The ColumnConcatenating set only contains 4 columns. It ignores the first "dataValue" column for no reason.
  • What did you expect?
    I tried to debug it by the source code. I found out that after the mlContext.Auto().InferColumns(SampleTrainDatasetPath, "label", separatorChar: '\t') function, the first column ("dataValue") went into the IgnoredColumnNames collection. I want to know why,,,

    Source code / logs

Please paste or attach the code or logs or traces that would be helpful to diagnose the issue you are reporting.
debug_log.txt

AutoML.NET command-line question

Most helpful comment

Generally a column is ignored if it looks like an ID. Code for purpose detection: https://github.com/dotnet/machinelearning/blob/d518b587b06ac3896a48646622b0f2169a230855/src/Microsoft.ML.AutoML/ColumnInference/PurposeInference.cs#L158

Since the column statistics are created on the first 10k rows, and in that sample, all values are unique, short, have have no spaces. The assumption is made that your column is an ID, which then gets ignored.

For your case:
You can re-order your file so more of the terms with spaces are near the top. Try running AutoML on a reordered version of your dataset: SampleTrainDataset.REORDERED.txt.

Shuffling your rows should also work, but in the random ordering I received, it did not lead to the column being predicted as a text feature. This implies our thresholds should be adjusted.

To improve AutoML:
We should move the column statistics to be calculated on a random subsample (likely a reservoir sample) of the dataset. We may also need to adjust the thresholds.

--

What is your dataset looking to predict?

Incase you're looking for more ideas on text stats to produce in your dataset..
Here's some text statistics I often use:

{ length, vowelCount, consonantCount, numberCount, underscoreCount, letterCount, startsWithVowel, endsInVowel, endsInVowelNumber, maxRepeatingChar, maxRepeatingVowel, lowerCaseCount, upperCaseCount, upperCasePercent, letterPercent, numberPercent, longestRepeatingChar, longestRepeatingVowel } (gist)

All 3 comments

Generally a column is ignored if it looks like an ID. Code for purpose detection: https://github.com/dotnet/machinelearning/blob/d518b587b06ac3896a48646622b0f2169a230855/src/Microsoft.ML.AutoML/ColumnInference/PurposeInference.cs#L158

Since the column statistics are created on the first 10k rows, and in that sample, all values are unique, short, have have no spaces. The assumption is made that your column is an ID, which then gets ignored.

For your case:
You can re-order your file so more of the terms with spaces are near the top. Try running AutoML on a reordered version of your dataset: SampleTrainDataset.REORDERED.txt.

Shuffling your rows should also work, but in the random ordering I received, it did not lead to the column being predicted as a text feature. This implies our thresholds should be adjusted.

To improve AutoML:
We should move the column statistics to be calculated on a random subsample (likely a reservoir sample) of the dataset. We may also need to adjust the thresholds.

--

What is your dataset looking to predict?

Incase you're looking for more ideas on text stats to produce in your dataset..
Here's some text statistics I often use:

{ length, vowelCount, consonantCount, numberCount, underscoreCount, letterCount, startsWithVowel, endsInVowel, endsInVowelNumber, maxRepeatingChar, maxRepeatingVowel, lowerCaseCount, upperCaseCount, upperCasePercent, letterPercent, numberPercent, longestRepeatingChar, longestRepeatingVowel } (gist)

Yes, you are right!! Thanks for your help!! I try to shuffle my dataset and it works now!

And thanks for your text statistics!

@daholste: We have need of reservoir sampling for dataset statistics. https://github.com/dotnet/machinelearning/issues/3778

Was this page helpful?
0 / 5 - 0 ratings

Related issues

maxt3r picture maxt3r  路  3Comments

frankhaugen picture frankhaugen  路  3Comments

daholste picture daholste  路  3Comments

OneCyrus picture OneCyrus  路  4Comments

samueleresca picture samueleresca  路  3Comments