Machinelearning: Output vector size of FeaturizeText does not respect MaximumNgramsCount

Created on 9 Jun 2020  路  2Comments  路  Source: dotnet/machinelearning

System information

.NET Core SDK (reflecting any global.json):
 Version:   3.1.201
 Commit:    b1768b4ae7

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.18362
 OS Platform: Windows
 RID:         win10-x64
 Base Path:   C:\Program Files\dotnet\sdk\3.1.201\

Host (useful for support):
  Version: 3.1.3
  Commit:  4a9f85e9f8

.NET Core SDKs installed:
  3.0.100 [C:\Program Files\dotnet\sdk]
  3.1.201 [C:\Program Files\dotnet\sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.1.17 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.17 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 3.0.0 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 3.1.3 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.1.17 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 3.0.0 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 3.1.3 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.WindowsDesktop.App 3.0.0 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 3.1.3 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

Issue

  • I build a pipline with FeaturizeText with MaximumNgramsCount = 50

let featureEstimator = 
    let wordBagOptions = WordBagEstimator.Options(MaximumNgramsCount = [|50|]) 
    let textFeaturizeOptions =  
        TextFeaturizingEstimator.Options(
            OutputTokensColumnName = "OutputTokens",
            CaseMode = Microsoft.ML.Transforms.Text.TextNormalizingEstimator.CaseMode.Lower,
            KeepNumbers = true,
            KeepPunctuations = false,
            WordFeatureExtractor = wordBagOptions)
    EstimatorChain()
        .Append(mlContext.Transforms.Text.FeaturizeText(outputColumnName = "Features", options = textFeaturizeOptions, inputColumnNames =  [|"Text"|]))
  • Features vector has a size of 1266 Features: Vector<Single, 1266>
  • I expect features vector to be of size 50 (MaximumNgramsCount)
    BowVectorSizeRepro.zip
F# P1 bug

All 2 comments

Hi @IvanAntipov , I have reproduced your issue, and am working on a fix. Thanks!

Hi @IvanAntipov,

In your FeaturizeText, while you are correctly setting the value of WordFeatureExtractor, you are not setting CharFeatureExtractor to null, which means the default CharFeatureExtractor is being used, which results in the wrong Features Vector size. I have confirmed locally that setting CharFeatureExtractor=null gives the correct Vector<Single, 50> (Thank you @ganik for your help on this). :)

Looking at the FeaturizeText documentation, it is indicated that by default WordFeatureExtractor and CharFeatureExtractor are instantiated with the following default values:

WordFeatureExtractor: NgramLength = 1
CharFeatureExtractor: NgramLength = 3, UseAllLengths = false

My local reproduction of your code in C#, with CharFeatureExtractor=null:

internal static void Main()
        {
            MLContext mlContext = new MLContext(1);

            List<Input> randomLines = new List<Input>();
            var rnd = new Random();
            for (int i = 0; i < 200; i++)
            {
                string str = "";
                for (int j = 0; j < 100; j++)
                {
                    str += "word" + rnd.Next(7777777) + "word ";
                }
                randomLines.Add(new Input(str));
            }
            var enumerable = randomLines.AsEnumerable();
            var preview = enumerable.Count();
            var dataReader = mlContext.Data.LoadFromEnumerable<Input>(enumerable);
            var pipeline = mlContext.Transforms.Text.FeaturizeText("Features", new TextFeaturizingEstimator.Options()
            {
                OutputTokensColumnName = "OutputTokens",
                CaseMode = TextNormalizingEstimator.CaseMode.Lower,
                KeepNumbers = true,
                KeepPunctuations = false,
                WordFeatureExtractor = new WordBagEstimator.Options()
                {
                    MaximumNgramsCount = new int[] { 50 }
                },
                CharFeatureExtractor = null
            }, "Text");
            var model = pipeline.Fit(dataReader);
            var outSchema = model.GetOutputSchema(dataReader.Schema);
        }
    }

    public class Input
    {
        [ColumnName("Text")]
        public string Text { get; set; }

        public Input(string t)
        {
            Text = t;
        }
    }

I'm closing this issue as I have identified the bug in your code. Please feel free to reopen if you have further issues. Thanks!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ddobric picture ddobric  路  4Comments

daholste picture daholste  路  3Comments

rogancarr picture rogancarr  路  3Comments

JakeRadMSFT picture JakeRadMSFT  路  3Comments

bs6523 picture bs6523  路  4Comments