Rasa: MITIE and Chinese support

Created on 6 Apr 2018  ·  23Comments  ·  Source: RasaHQ/rasa

Attn: users who use Rasa NLU for Chinese. Could you please try your datasets (at least intent classification) with the new tensorflow_embedding pipeline? We would love to know how the performance is.

We are thinking of dropping support for MITIE because training times are long, and in our regular performance benchmarks it doesn't show any advantages in terms of performance.

However, to my knowledge most users who use Rasa to do Chinese NLU use MITIE, so I would love to understand how well alternatives do there.

help wanted type

Most helpful comment

As a spaCy contributor, I am currently working on adding Chinese language supporting to spaCy. Actually I already communicated with spaCy official developer about this. They are also working on this topic very hard. I will cooperate with the spaCy developer to complete this project. I don't know the release date of spaCy with Chinese language supporting, but It will be released with a good performance in the near future. If there are more details, I will keep the RSAS community updated.

All 23 comments

lol, always one for brevity. Though I am assuming the No description provided should read something like:

Since removing MITIE we've discovered that MITIE was the closest/easiest path for our users to get Chinese NLU working. Now that we've removed it we may have to add it back for Chinese support or work to get spacy understanding Chinese.

Not trying to put words in your mouth or anything ;) Linking a couple issues here just for cross reference.

975

705

As a spaCy contributor, I am currently working on adding Chinese language supporting to spaCy. Actually I already communicated with spaCy official developer about this. They are also working on this topic very hard. I will cooperate with the spaCy developer to complete this project. I don't know the release date of spaCy with Chinese language supporting, but It will be released with a good performance in the near future. If there are more details, I will keep the RSAS community updated.

That sounds very promising!

@howl-anderson 👍
大概要多久?1个月能行吗?

Q: How long it will take before release of spaCy model with Chinese language supporting? (2018-04-11)
A: It’s hard to tell when the model will be released. Because the model must be tested that show a good/acceptable performance. spaCy also need make several changes to support Chinese, Japanese and Vietnamese. This will take time too.

@howl-anderson thank you!

sapCy support Chinese language now.
https://github.com/howl-anderson/Chinese_models_for_SpaCy

but entities can not be detected. need help!

ubuntu -- python3.5 -- "rasa_nlu_version": "0.12.3"-- spaCy 2--
step: 1 install zh_core_web_sm
2 python3 -m spacy link zh_core_web_sm zh
3 train
issue: intent OK ,but don`t have any entities.
NEED HELP!

@winner484 is it ner_crf not returning any entities or ner_spacy?

@wrathagom never had an entity return. i have tried many text, but never had any entities return

the metadata.json in Model is :
{
"training_data": "training_data.json",
"pipeline": [
{
"case_sensitive": false,
"model": "zh",
"class": "rasa_nlu.utils.spacy_utils.SpacyNLP",
"name": "nlp_spacy"
},
{
"class": "rasa_nlu.tokenizers.spacy_tokenizer.SpacyTokenizer",
"name": "tokenizer_spacy"
},
{
"class": "rasa_nlu.featurizers.spacy_featurizer.SpacyFeaturizer",
"name": "intent_featurizer_spacy"
},
{
"regex_file": "regex_featurizer.json",
"class": "rasa_nlu.featurizers.regex_featurizer.RegexFeaturizer",
"name": "intent_entity_featurizer_regex"
},
{
"class": "rasa_nlu.extractors.crf_entity_extractor.CRFEntityExtractor",
"max_iterations": 50,
"features": [
[
"low",
"title",
"upper",
"pos",
"pos2"
],
[
"bias",
"low",
"word3",
"word2",
"upper",
"title",
"digit",
"pos",
"pos2",
"pattern"
],
[
"low",
"title",
"upper",
"pos",
"pos2"
]
],
"L1_c": 1,
"name": "ner_crf",
"L2_c": 0.001,
"BILOU_flag": true,
"classifier_file": "crf_model.pkl"
},
{
"class": "rasa_nlu.extractors.entity_synonyms.EntitySynonymMapper",
"name": "ner_synonyms",
"synonyms_file": "entity_synonyms.json"
},
{
"class": "rasa_nlu.classifiers.sklearn_intent_classifier.SklearnIntentClassifier",
"name": "intent_classifier_sklearn",
"classifier_file": "intent_classifier_sklearn.pkl",
"max_cross_validation_folds": 5,
"C": [
1,
2,
5,
10,
20,
100
],
"kernels": [
"linear"
]
}
],
"trained_at": "20180503-103724",
"language": "zh",
"rasa_nlu_version": "0.12.3"
}

$ curl -X POST localhost:5000/parse -d '{"q":"我想吃火锅"}' | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 645 0 622 100 23 17210 636 --:--:-- --:--:-- --:--:-- 17771
{
"entities": [],
"intent": {
"confidence": 0.45199854018449354,
"name": "restaurant_search"
},
"intent_ranking": [
{
"confidence": 0.45199854018449354,
"name": "restaurant_search"
},
{
"confidence": 0.3750782818220956,
"name": "medical"
},
{
"confidence": 0.11279676245958703,
"name": "affirm"
},
{
"confidence": 0.04185093011383089,
"name": "goodbye"
},
{
"confidence": 0.018275485419993073,
"name": "greet"
}
],
"model": "model_20180503-103724",
"project": "default",
"text": "\u6211\u60f3\u5403\u706b\u9505"
}

and the part of training data about the entity "火锅" is here:

{
"text": "我想吃火锅啊",
"intent": "restaurant_search",
"entities": [
{
"start": 2,
"end": 5,
"value": "火锅",
"entity": "food"
}
]
},

@winner484 speaking without being able to read the language 😅 are you providing more entity examples than just that? entities can take a lot of data to train. Also if 火锅 is really the entity then it is mislabeled. I believe the training data should have a range from 3 to 5 instead of 2 to 5.

{
  "text": "我想吃火锅啊",
    "intent": "restaurant_search",
    "entities": [
      {
        "start": 3,
        "end": 5,
        "value": "火锅",
        "entity": "food"
      }
    ]
},

@wrathagom @winner484 Just for the record, although https://github.com/howl-anderson/Chinese_models_for_SpaCy currently is the only SpaCy model that support Chinese language but it is not the official Chinese language models for SpaCy, and most importantly it is still working in progress. Named Entity Recognition (AKA NER) is currently (2018-05-03) not supported, I am still working on it.

@howl-anderson thank you for your great work! may i learn from you , maybe i could help you finish the job?

@wrathagom thank you

@amn41
I am using Rasa to do Japanese NLU with MITIE and the result is quite good. My config_mitie_ja.yml is:
language: "ja"

pipeline:

  • name: "nlp_mitie"
    model: "mitie/total_word_feature_extractor_ja.dat"
  • name: "tokenizer_japanese" # I used tinysegmenter as Japanese tokenizer
  • name: "ner_mitie"
  • name: "ner_synonyms"
  • name: "intent_featurizer_mitie"
  • name: "intent_classifier_sklearn" # I modified the intent classifier. Instead of GridSearchCV I used linear model with Logistic regression as intent classifier.

My result after training model:
{'entities': [{'extractor': 'ner_mitie', 'start': 0, 'confidence': None, 'value': '千葉', 'end': 2, 'entity': 'ロケーション'}], 'intent': {'confidence': 0.9422146832263528, 'name': 'レストランを検索する'}, 'intent_ranking': [{'confidence': 0.9422146832263528, 'name': 'レストランを検索する'}, {'confidence': 0.038330105668737326, 'name': '肯定する'}, {'confidence': 0.011094799507902988, 'name': 'さようなら'}, {'confidence': 0.008360411597006933, 'name': '挨拶する'}], 'text': '千葉にレストランを探したい。'}

very cool :+1: I think it might sense to provide default configurations for different languages to make it even esaiser to get started with a certain language. thoughts?

@amn41 I didn't understand how the Supervised Word Vectors work before the corpus feed into tensorflow model yet. Could I just segment a Chinese sentence using some tokenizer such as Jieba, and then join the result with space. Then I put it into the count_vectors_featurizer (maybe I should tweak some parameters here). The result goes straight into the tensorflow_embedding part. Should the above procedure work?

@geekboood As far as I know, it should worked. Also I am working on a PR to make sure count_vectors_featurizer can also using feature tokens which provide by tokenizers such as Jieba. It will be released soon. It is released at #1115

@amn41
I am using Rasa to do Japanese NLU with MITIE and the result is quite good. My config_mitie_ja.yml is:
language: "ja"

pipeline:

  • name: "nlp_mitie"
    model: "mitie/total_word_feature_extractor_ja.dat"
  • name: "tokenizer_japanese" # I used tinysegmenter as Japanese tokenizer
  • name: "ner_mitie"
  • name: "ner_synonyms"
  • name: "intent_featurizer_mitie"
  • name: "intent_classifier_sklearn" # I modified the intent classifier. Instead of GridSearchCV I used linear model with Logistic regression as intent classifier.

My result after training model:
{'entities': [{'extractor': 'ner_mitie', 'start': 0, 'confidence': None, 'value': '千葉', 'end': 2, 'entity': 'ロケーション'}], 'intent': {'confidence': 0.9422146832263528, 'name': 'レストランを検索する'}, 'intent_ranking': [{'confidence': 0.9422146832263528, 'name': 'レストランを検索する'}, {'confidence': 0.038330105668737326, 'name': '肯定する'}, {'confidence': 0.011094799507902988, 'name': 'さようなら'}, {'confidence': 0.008360411597006933, 'name': '挨拶する'}], 'text': '千葉にレストランを探したい。'}

Hi. Where you can get mitie/total_word_feature_extractor_ja.dat ?

@wrathagom @winner484 Just for the record, although https://github.com/howl-anderson/Chinese_models_for_SpaCy currently is the only SpaCy model that support Chinese language but it is not the official Chinese language models for SpaCy, and most importantly it is still working in progress. Named Entity Recognition (AKA NER) is currently (2018-05-03) not supported, I am still working on it.

Is there any progress

Was this page helpful?
0 / 5 - 0 ratings