.NET Core SDK (reflecting any global.json):
Version: 3.1.201
Commit: b1768b4ae7
Runtime Environment:
OS Name: Windows
OS Version: 10.0.18362
OS Platform: Windows
RID: win10-x64
Base Path: C:\Program Files\dotnet\sdk\3.1.201\
Host (useful for support):
Version: 3.1.3
Commit: 4a9f85e9f8
.NET Core SDKs installed:
3.0.100 [C:\Program Files\dotnet\sdk]
3.1.201 [C:\Program Files\dotnet\sdk]
.NET Core runtimes installed:
Microsoft.AspNetCore.All 2.1.17 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.1.17 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.0.0 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.3 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.NETCore.App 2.1.17 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.0.0 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.3 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.WindowsDesktop.App 3.0.0 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 3.1.3 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
FeaturizeText with MaximumNgramsCount = 50
let featureEstimator =
let wordBagOptions = WordBagEstimator.Options(MaximumNgramsCount = [|50|])
let textFeaturizeOptions =
TextFeaturizingEstimator.Options(
OutputTokensColumnName = "OutputTokens",
CaseMode = Microsoft.ML.Transforms.Text.TextNormalizingEstimator.CaseMode.Lower,
KeepNumbers = true,
KeepPunctuations = false,
WordFeatureExtractor = wordBagOptions)
EstimatorChain()
.Append(mlContext.Transforms.Text.FeaturizeText(outputColumnName = "Features", options = textFeaturizeOptions, inputColumnNames = [|"Text"|]))
Features: Vector<Single, 1266>Hi @IvanAntipov , I have reproduced your issue, and am working on a fix. Thanks!
Hi @IvanAntipov,
In your FeaturizeText, while you are correctly setting the value of WordFeatureExtractor, you are not setting CharFeatureExtractor to null, which means the default CharFeatureExtractor is being used, which results in the wrong Features Vector size. I have confirmed locally that setting CharFeatureExtractor=null gives the correct Vector<Single, 50> (Thank you @ganik for your help on this). :)
Looking at the FeaturizeText documentation, it is indicated that by default WordFeatureExtractor and CharFeatureExtractor are instantiated with the following default values:
WordFeatureExtractor: NgramLength = 1
CharFeatureExtractor: NgramLength = 3, UseAllLengths = false
My local reproduction of your code in C#, with CharFeatureExtractor=null:
internal static void Main()
{
MLContext mlContext = new MLContext(1);
List<Input> randomLines = new List<Input>();
var rnd = new Random();
for (int i = 0; i < 200; i++)
{
string str = "";
for (int j = 0; j < 100; j++)
{
str += "word" + rnd.Next(7777777) + "word ";
}
randomLines.Add(new Input(str));
}
var enumerable = randomLines.AsEnumerable();
var preview = enumerable.Count();
var dataReader = mlContext.Data.LoadFromEnumerable<Input>(enumerable);
var pipeline = mlContext.Transforms.Text.FeaturizeText("Features", new TextFeaturizingEstimator.Options()
{
OutputTokensColumnName = "OutputTokens",
CaseMode = TextNormalizingEstimator.CaseMode.Lower,
KeepNumbers = true,
KeepPunctuations = false,
WordFeatureExtractor = new WordBagEstimator.Options()
{
MaximumNgramsCount = new int[] { 50 }
},
CharFeatureExtractor = null
}, "Text");
var model = pipeline.Fit(dataReader);
var outSchema = model.GetOutputSchema(dataReader.Schema);
}
}
public class Input
{
[ColumnName("Text")]
public string Text { get; set; }
public Input(string t)
{
Text = t;
}
}
I'm closing this issue as I have identified the bug in your code. Please feel free to reopen if you have further issues. Thanks!