Machinelearning: Output vector size of FeaturizeText does not respect MaximumNgramsCount

Created on 9 Jun 2020 · 2Comments · Source: dotnet/machinelearning

System information

.NET Core SDK (reflecting any global.json):
 Version:   3.1.201
 Commit:    b1768b4ae7

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.18362
 OS Platform: Windows
 RID:         win10-x64
 Base Path:   C:\Program Files\dotnet\sdk\3.1.201\

Host (useful for support):
  Version: 3.1.3
  Commit:  4a9f85e9f8

.NET Core SDKs installed:
  3.0.100 [C:\Program Files\dotnet\sdk]
  3.1.201 [C:\Program Files\dotnet\sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.1.17 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.17 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 3.0.0 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 3.1.3 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.1.17 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 3.0.0 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 3.1.3 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.WindowsDesktop.App 3.0.0 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 3.1.3 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

Issue

I build a pipline with FeaturizeText with MaximumNgramsCount = 50


let featureEstimator = 
    let wordBagOptions = WordBagEstimator.Options(MaximumNgramsCount = [|50|]) 
    let textFeaturizeOptions =  
        TextFeaturizingEstimator.Options(
            OutputTokensColumnName = "OutputTokens",
            CaseMode = Microsoft.ML.Transforms.Text.TextNormalizingEstimator.CaseMode.Lower,
            KeepNumbers = true,
            KeepPunctuations = false,
            WordFeatureExtractor = wordBagOptions)
    EstimatorChain()
        .Append(mlContext.Transforms.Text.FeaturizeText(outputColumnName = "Features", options = textFeaturizeOptions, inputColumnNames =  [|"Text"|]))

Features vector has a size of 1266 Features: Vector<Single, 1266>
I expect features vector to be of size 50 (MaximumNgramsCount)
BowVectorSizeRepro.zip

F# P1 bug

Source

IvanAntipov

All 2 comments

Hi @IvanAntipov , I have reproduced your issue, and am working on a fix. Thanks!

mstfbl on 10 Jun 2020

👍1

Hi @IvanAntipov,

In your FeaturizeText, while you are correctly setting the value of WordFeatureExtractor, you are not setting CharFeatureExtractor to null, which means the default CharFeatureExtractor is being used, which results in the wrong Features Vector size. I have confirmed locally that setting CharFeatureExtractor=null gives the correct Vector<Single, 50> (Thank you @ganik for your help on this). :)

Looking at the FeaturizeText documentation, it is indicated that by default WordFeatureExtractor and CharFeatureExtractor are instantiated with the following default values:

WordFeatureExtractor: NgramLength = 1
CharFeatureExtractor: NgramLength = 3, UseAllLengths = false

My local reproduction of your code in C#, with CharFeatureExtractor=null:

internal static void Main()
        {
            MLContext mlContext = new MLContext(1);

            List<Input> randomLines = new List<Input>();
            var rnd = new Random();
            for (int i = 0; i < 200; i++)
            {
                string str = "";
                for (int j = 0; j < 100; j++)
                {
                    str += "word" + rnd.Next(7777777) + "word ";
                }
                randomLines.Add(new Input(str));
            }
            var enumerable = randomLines.AsEnumerable();
            var preview = enumerable.Count();
            var dataReader = mlContext.Data.LoadFromEnumerable<Input>(enumerable);
            var pipeline = mlContext.Transforms.Text.FeaturizeText("Features", new TextFeaturizingEstimator.Options()
            {
                OutputTokensColumnName = "OutputTokens",
                CaseMode = TextNormalizingEstimator.CaseMode.Lower,
                KeepNumbers = true,
                KeepPunctuations = false,
                WordFeatureExtractor = new WordBagEstimator.Options()
                {
                    MaximumNgramsCount = new int[] { 50 }
                },
                CharFeatureExtractor = null
            }, "Text");
            var model = pipeline.Fit(dataReader);
            var outSchema = model.GetOutputSchema(dataReader.Schema);
        }
    }

    public class Input
    {
        [ColumnName("Text")]
        public string Text { get; set; }

        public Input(string t)
        {
            Text = t;
        }
    }

I'm closing this issue as I have identified the bug in your code. Please feel free to reopen if you have further issues. Thanks!

mstfbl on 11 Jun 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings