For example, A Chinese 长春市长春药店 that have many ways to extract text.
Bigram algorithm, its simple and fast.
春市
市长
长春
春药
药店
Standard algorithm.
长春
药店
I noticed the ML.NET was include a NGramNgramExtractor class that supported N-Gram algorithm, does it support Chinese? The Transforms.TextTransformLanguage includes English,French,German,Dutch,Italian,Spanish,Japanese.
If not, how to implement custom text segmentation for other language? Hope in the future version can support custom extract text feature.
Thanks.
Take a little of time to read the source code, i noticed a TextAnalytics file, it looks implements different text tokenize, but it's too complicated, whether had any docs that description ML.NET how implements text tokenize?
I found a temporary solution that can supports Chinese.
```c#
pipeline.Add(new TextFeaturizer("Features", "Text")
{
KeepDiacritics = false,
KeepPunctuations = false,
TextCase = TextNormalizerTransformCaseNormalizationMode.Lower,
OutputTokens = true,
Language = TextTransformLanguage.English,
StopWordsRemover = new PredefinedStopWordsRemover(),
VectorNormalizer = TextTransformTextNormKind.L2,
CharFeatureExtractor = new NGramNgramExtractor() { NgramLength = 2, AllLengths = false },
WordFeatureExtractor = null
});
You just to do the following:
- `CharFeatureExtractor= new NGramNgramExtractor() { NgramLength = 2, AllLengths = false }`
- `WordFeatureExtractor = null`
`长春市长春药店` will converts text tokens like the following:
{<␂>|长} Microsoft.ML.Runtime.Data.DvText
{长|春} Microsoft.ML.Runtime.Data.DvText
{春|市} Microsoft.ML.Runtime.Data.DvText
{市|长} Microsoft.ML.Runtime.Data.DvText
{春|药} Microsoft.ML.Runtime.Data.DvText
{药|店} Microsoft.ML.Runtime.Data.DvText
{店|<␃>} Microsoft.ML.Runtime.Data.DvText
```
I am still hope ML.NET will supports custom text tokenization.
Let me add some background here so that people don't know Chinese a lot can still join this discussion.
There are tons of ways to tokenize Chinese sentences because we do NOT have separators like “ “ (aka space) in English. For example, “today’s weather is very good” in Chinese is “今天天氣很好”. A Chinese tokenizer needs to find boundaries between meaningful tokens so that “今天天氣很好” can be tokenized into “今天” (today), “天氣” (weather), “很” (very), “好” (good). Because Chinese has its own way of doing tokenization, a customized tokenizer could be very helpful.
We don't support Chinese
你也是用做分类场景?我用多类分类进行分类,大约500个分类。各个算法试了一遍,testdata上面准确率只有60%,猜测应该跟分词有关,导致vector不对。想怎么自定义分词,就看到了这个,2年多了,还是没支持