Rasa: MITIE and Chinese support

Created on 6 Apr 2018 · 23Comments · Source: RasaHQ/rasa

Attn: users who use Rasa NLU for Chinese. Could you please try your datasets (at least intent classification) with the new tensorflow_embedding pipeline? We would love to know how the performance is.

We are thinking of dropping support for MITIE because training times are long, and in our regular performance benchmarks it doesn't show any advantages in terms of performance.

However, to my knowledge most users who use Rasa to do Chinese NLU use MITIE, so I would love to understand how well alternatives do there.

help wanted type

Source

amn41

Most helpful comment

As a spaCy contributor, I am currently working on adding Chinese language supporting to spaCy. Actually I already communicated with spaCy official developer about this. They are also working on this topic very hard. I will cooperate with the spaCy developer to complete this project. I don't know the release date of spaCy with Chinese language supporting, but It will be released with a good performance in the near future. If there are more details, I will keep the RSAS community updated.

howl-anderson on 10 Apr 2018

👍4

All 23 comments

lol, always one for brevity. Though I am assuming the No description provided should read something like:

Since removing MITIE we've discovered that MITIE was the closest/easiest path for our users to get Chinese NLU working. Now that we've removed it we may have to add it back for Chinese support or work to get spacy understanding Chinese.

Not trying to put words in your mouth or anything ;) Linking a couple issues here just for cross reference.

975

705

wrathagom on 9 Apr 2018

👍1

howl-anderson on 10 Apr 2018

👍4

That sounds very promising!

tmbo on 10 Apr 2018

@howl-anderson 👍
大概要多久？1个月能行吗？

winner484 on 10 Apr 2018

Q: How long it will take before release of spaCy model with Chinese language supporting? (2018-04-11)
A: It’s hard to tell when the model will be released. Because the model must be tested that show a good/acceptable performance. spaCy also need make several changes to support Chinese, Japanese and Vietnamese. This will take time too.

howl-anderson on 11 Apr 2018

@howl-anderson thank you!

winner484 on 12 Apr 2018

sapCy support Chinese language now.
https://github.com/howl-anderson/Chinese_models_for_SpaCy

but entities can not be detected. need help!

ubuntu -- python3.5 -- "rasa_nlu_version": "0.12.3"-- spaCy 2--
step: 1 install zh_core_web_sm
2 python3 -m spacy link zh_core_web_sm zh
3 train
issue: intent OK ,but don`t have any entities.
NEED HELP!

winner484 on 3 May 2018

@winner484 is it ner_crf not returning any entities or ner_spacy?

wrathagom on 3 May 2018

@wrathagom never had an entity return. i have tried many text, but never had any entities return

winner484 on 3 May 2018

the metadata.json in Model is :
{
"training_data": "training_data.json",
"pipeline": [
{
"case_sensitive": false,
"model": "zh",
"class": "rasa_nlu.utils.spacy_utils.SpacyNLP",
"name": "nlp_spacy"
},
{
"class": "rasa_nlu.tokenizers.spacy_tokenizer.SpacyTokenizer",
"name": "tokenizer_spacy"
},
{
"class": "rasa_nlu.featurizers.spacy_featurizer.SpacyFeaturizer",
"name": "intent_featurizer_spacy"
},
{
"regex_file": "regex_featurizer.json",
"class": "rasa_nlu.featurizers.regex_featurizer.RegexFeaturizer",
"name": "intent_entity_featurizer_regex"
},
{
"class": "rasa_nlu.extractors.crf_entity_extractor.CRFEntityExtractor",
"max_iterations": 50,
"features": [
[
"low",
"title",
"upper",
"pos",
"pos2"
],
[
"bias",
"low",
"word3",
"word2",
"upper",
"title",
"digit",
"pos",
"pos2",
"pattern"
],
[
"low",
"title",
"upper",
"pos",
"pos2"
]
],
"L1_c": 1,
"name": "ner_crf",
"L2_c": 0.001,
"BILOU_flag": true,
"classifier_file": "crf_model.pkl"
},
{
"class": "rasa_nlu.extractors.entity_synonyms.EntitySynonymMapper",
"name": "ner_synonyms",
"synonyms_file": "entity_synonyms.json"
},
{
"class": "rasa_nlu.classifiers.sklearn_intent_classifier.SklearnIntentClassifier",
"name": "intent_classifier_sklearn",
"classifier_file": "intent_classifier_sklearn.pkl",
"max_cross_validation_folds": 5,
"C": [
1,
2,
5,
10,
20,
100
],
"kernels": [
"linear"
]
}
],
"trained_at": "20180503-103724",
"language": "zh",
"rasa_nlu_version": "0.12.3"
}

winner484 on 3 May 2018

$ curl -X POST localhost:5000/parse -d '{"q":"我想吃火锅"}' | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 645 0 622 100 23 17210 636 --:--:-- --:--:-- --:--:-- 17771
{
"entities": [],
"intent": {
"confidence": 0.45199854018449354,
"name": "restaurant_search"
},
"intent_ranking": [
{
"confidence": 0.45199854018449354,
"name": "restaurant_search"
},
{
"confidence": 0.3750782818220956,
"name": "medical"
},
{
"confidence": 0.11279676245958703,
"name": "affirm"
},
{
"confidence": 0.04185093011383089,
"name": "goodbye"
},
{
"confidence": 0.018275485419993073,
"name": "greet"
}
],
"model": "model_20180503-103724",
"project": "default",
"text": "\u6211\u60f3\u5403\u706b\u9505"
}

winner484 on 3 May 2018

and the part of training data about the entity "火锅" is here:

{
"text": "我想吃火锅啊",
"intent": "restaurant_search",
"entities": [
{
"start": 2,
"end": 5,
"value": "火锅",
"entity": "food"
}
]
},

winner484 on 3 May 2018

@winner484 speaking without being able to read the language 😅 are you providing more entity examples than just that? entities can take a lot of data to train. Also if 火锅 is really the entity then it is mislabeled. I believe the training data should have a range from 3 to 5 instead of 2 to 5.

{
  "text": "我想吃火锅啊",
    "intent": "restaurant_search",
    "entities": [
      {
        "start": 3,
        "end": 5,
        "value": "火锅",
        "entity": "food"
      }
    ]
},

wrathagom on 3 May 2018

@wrathagom @winner484 Just for the record, although https://github.com/howl-anderson/Chinese_models_for_SpaCy currently is the only SpaCy model that support Chinese language but it is not the official Chinese language models for SpaCy, and most importantly it is still working in progress. Named Entity Recognition (AKA NER) is currently (2018-05-03) not supported, I am still working on it.

howl-anderson on 3 May 2018

@howl-anderson thank you for your great work! may i learn from you , maybe i could help you finish the job?

winner484 on 3 May 2018

@wrathagom thank you

winner484 on 3 May 2018

@amn41
I am using Rasa to do Japanese NLU with MITIE and the result is quite good. My config_mitie_ja.yml is:
language: "ja"

pipeline:

name: "nlp_mitie"
model: "mitie/total_word_feature_extractor_ja.dat"
name: "tokenizer_japanese" # I used tinysegmenter as Japanese tokenizer
name: "ner_mitie"
name: "ner_synonyms"
name: "intent_featurizer_mitie"
name: "intent_classifier_sklearn" # I modified the intent classifier. Instead of GridSearchCV I used linear model with Logistic regression as intent classifier.

My result after training model:
{'entities': [{'extractor': 'ner_mitie', 'start': 0, 'confidence': None, 'value': '千葉', 'end': 2, 'entity': 'ロケーション'}], 'intent': {'confidence': 0.9422146832263528, 'name': 'レストランを検索する'}, 'intent_ranking': [{'confidence': 0.9422146832263528, 'name': 'レストランを検索する'}, {'confidence': 0.038330105668737326, 'name': '肯定する'}, {'confidence': 0.011094799507902988, 'name': 'さようなら'}, {'confidence': 0.008360411597006933, 'name': '挨拶する'}], 'text': '千葉にレストランを探したい。'}

buivietan on 4 May 2018

very cool :+1: I think it might sense to provide default configurations for different languages to make it even esaiser to get started with a certain language. thoughts?

tmbo on 5 May 2018

👍2

@amn41 I didn't understand how the Supervised Word Vectors work before the corpus feed into tensorflow model yet. Could I just segment a Chinese sentence using some tokenizer such as Jieba, and then join the result with space. Then I put it into the count_vectors_featurizer (maybe I should tweak some parameters here). The result goes straight into the tensorflow_embedding part. Should the above procedure work?

geekboood on 29 May 2018

@geekboood As far as I know, it should worked. Also I am working on a PR to make sure count_vectors_featurizer can also using feature tokens which provide by tokenizers such as Jieba. ~~It will be released soon~~. It is released at #1115

howl-anderson on 29 May 2018

@amn41
I am using Rasa to do Japanese NLU with MITIE and the result is quite good. My config_mitie_ja.yml is:
language: "ja"

pipeline:

name: "nlp_mitie"
model: "mitie/total_word_feature_extractor_ja.dat"

name: "tokenizer_japanese" # I used tinysegmenter as Japanese tokenizer

name: "ner_mitie"

name: "ner_synonyms"

name: "intent_featurizer_mitie"

name: "intent_classifier_sklearn" # I modified the intent classifier. Instead of GridSearchCV I used linear model with Logistic regression as intent classifier.

My result after training model:
{'entities': [{'extractor': 'ner_mitie', 'start': 0, 'confidence': None, 'value': '千葉', 'end': 2, 'entity': 'ロケーション'}], 'intent': {'confidence': 0.9422146832263528, 'name': 'レストランを検索する'}, 'intent_ranking': [{'confidence': 0.9422146832263528, 'name': 'レストランを検索する'}, {'confidence': 0.038330105668737326, 'name': '肯定する'}, {'confidence': 0.011094799507902988, 'name': 'さようなら'}, {'confidence': 0.008360411597006933, 'name': '挨拶する'}], 'text': '千葉にレストランを探したい。'}

Hi. Where you can get mitie/total_word_feature_extractor_ja.dat ?

ilham-bintang on 27 Sep 2019

@wrathagom @winner484 Just for the record, although https://github.com/howl-anderson/Chinese_models_for_SpaCy currently is the only SpaCy model that support Chinese language but it is not the official Chinese language models for SpaCy, and most importantly it is still working in progress. Named Entity Recognition (AKA NER) is currently (2018-05-03) not supported, I am still working on it.

Is there any progress