Machinelearning: Request : Apply Lemma / stemming in FeaturizeText options

Created on 3 Jul 2020 · 4Comments · Source: dotnet/machinelearning

Hi
First Thank you for all the work done, i know that FeaturizeText apply NLP preprocessing like skipword with a specifique language :

But is there a way to apply lemma / stemming in this function ?

P2 enhancement

Source

ErwanL08

👍5

Most helpful comment

I agree, there should be a direct lemmatizer/stemmer.

The default in the FeaturizeText transform uses unigrams (one word) + bigrams (two words) + tricharactergrams (three letter ngram).

The default tricharactergrams gives a good part of the gains of a full stemmer.

For example, it will extract the same tricharactergram r|u|n from runner/running/runs. This allows the model to learn the common concept of "run" from all of these, and with the ngrams it maintains the original unstemmed words, allowing the model to also learn running (unigram) and i|n|g (tricharactergram).

The word embedding transform can also help. The fastTextWikipedia300D model in particular has a large vocabulary, and already has a word vector for runner/running/runs and they will be in similar position in the embedding space.

All this said, the world is moving towards transformer networks like BERT.
There's an external BERT implementation for ML․NET -- https://github.com/GerjanVlot/BERT-ML.NET by @GerjanVlot.

justinormont on 5 Dec 2020

👍4

All 4 comments

Hi, @ErwanL08 . Unfortunately, there's no option for doing lemmatization or stemming in ML.NET, so I will mark this issue as a feature request so that we can take it into account when planning future features.

In the meantime, there are a couple of options you can explore:

Apply lemmatization/stemming before creating the input DataView. I notice in your screenshot that you're using LoadFromEnumerable<>() to get your data into a DataView. If possible you can try to lemmatize/stem the strings on your input "Utterance" string field, before creating the DV. I'm not able to recommend any C# library for this, but a quick google search points to some NLP-related nugets which maybe have this functionality... I've also found some open-source implementations of basic english stemming on C#, which you might be able to add to your project without installing any nuget.
Apply lemmatization/stemming inside a CustomMappingTransformer. A CustomMappingTransformer lets the user define a method that will be used to apply transformations to every row of the input; this function will be applied on an streaming fashion. You can create a function that does lemmatization/stemming (either using your own implementation or another library), and use it inside a CustomMappingTransformer. See more about this transformer on the docs.

antoniovs1029 on 6 Jul 2020

This feature is very important, I'm impatient to see it inside the awesome ML.NET. Also NLP is a very essential today, I hope a serious attention will be granted to it.

AniaBerthelot on 26 Nov 2020

👍4

I agree, there should be a direct lemmatizer/stemmer.

The default in the FeaturizeText transform uses unigrams (one word) + bigrams (two words) + tricharactergrams (three letter ngram).

The default tricharactergrams gives a good part of the gains of a full stemmer.

All this said, the world is moving towards transformer networks like BERT.
There's an external BERT implementation for ML․NET -- https://github.com/GerjanVlot/BERT-ML.NET by @GerjanVlot.

justinormont on 5 Dec 2020

👍4

I totally agree with @AniaBerthelot , if ML.Net can have a .Net version of a stemmer / lemmatizer (up to date) the framework will be so awesome 👍

ErwanL08 on 19 Jan 2021

❤1

Was this page helpful?

0 / 5 - 0 ratings