Machinelearning: Does ML.NET supports Chinese?

Created on 7 Jun 2018 · 5Comments · Source: dotnet/machinelearning

For example, A Chinese 长春市长春药店 that have many ways to extract text.

Bigram algorithm, its simple and fast.

春市
市长
长春
春药
药店

Standard algorithm.

长春
药店

I noticed the ML.NET was include a NGramNgramExtractor class that supported N-Gram algorithm, does it support Chinese? The Transforms.TextTransformLanguage includes English,French,German,Dutch,Italian,Spanish,Japanese.

If not, how to implement custom text segmentation for other language? Hope in the future version can support custom extract text feature.

Thanks.

API enhancement question

Source

zhengchun

👍4

All 5 comments

Take a little of time to read the source code, i noticed a TextAnalytics file, it looks implements different text tokenize, but it's too complicated, whether had any docs that description ML.NET how implements text tokenize?

zhengchun on 7 Jun 2018

I found a temporary solution that can supports Chinese.

```c#
pipeline.Add(new TextFeaturizer("Features", "Text")
{
KeepDiacritics = false,
KeepPunctuations = false,
TextCase = TextNormalizerTransformCaseNormalizationMode.Lower,
OutputTokens = true,
Language = TextTransformLanguage.English,
StopWordsRemover = new PredefinedStopWordsRemover(),
VectorNormalizer = TextTransformTextNormKind.L2,
CharFeatureExtractor = new NGramNgramExtractor() { NgramLength = 2, AllLengths = false },
WordFeatureExtractor = null
});

You just to do the following:
-  `CharFeatureExtractor= new NGramNgramExtractor() { NgramLength = 2, AllLengths = false }` 
-  `WordFeatureExtractor = null`

`长春市长春药店`  will converts text tokens like the following:

{<␂>|长} Microsoft.ML.Runtime.Data.DvText
{长|春} Microsoft.ML.Runtime.Data.DvText
{春|市} Microsoft.ML.Runtime.Data.DvText
{市|长} Microsoft.ML.Runtime.Data.DvText
{春|药} Microsoft.ML.Runtime.Data.DvText
{药|店} Microsoft.ML.Runtime.Data.DvText
{店|<␃>} Microsoft.ML.Runtime.Data.DvText
```

I am still hope ML.NET will supports custom text tokenization.

zhengchun on 11 Jun 2018

👍1

Let me add some background here so that people don't know Chinese a lot can still join this discussion.

There are tons of ways to tokenize Chinese sentences because we do NOT have separators like “ “ (aka space) in English. For example, “today’s weather is very good” in Chinese is “今天天氣很好”. A Chinese tokenizer needs to find boundaries between meaningful tokens so that “今天天氣很好” can be tokenized into “今天” (today), “天氣” (weather), “很” (very), “好” (good). Because Chinese has its own way of doing tokenization, a customized tokenizer could be very helpful.