Rasa: Rasa_nlu returns intent as null for training samples using tensorflow classifier(Chinese)

Created on 7 Nov 2018  ·  9Comments  ·  Source: RasaHQ/rasa

Rasa NLU version: 0.13.0

Operating system (windows, osx, ...): ubuntu 18.04

Content of model configuration file:

language: "zh"
project: "ivr_nlu"
fixed_model_name: "demo"
path: "models"
pipeline:
- name: "nlp_mitie"
  model: "data/total_word_feature_extractor_zh.dat"
- name: "tokenizer_jieba"
  dictionary_path: "data/userdict.txt"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"
- name: "ner_mitie"
- name: "ner_synonyms"

Issue:
The language of my system is Chinese. Most of the training samples are classified correctly, but a few samples return an intent as null. The model failed to classify the sample “在吗” which means “are you there” and “是” which means “yes”. Could the model “data/total_word_feature_extractor_zh.dat” affect intent classification? Besides, it didn't happen when I used spacy_sklearn as intent classifier. What is wrong with tf embedding or my configuration?

Most helpful comment

Thanks @howl-anderson for your clear explanation. So nice of you @akelad @MetcalfeTom

All 9 comments

Thanks for raising this issue, @MetcalfeTom will get back to you about it soon.

Hi @pfZhu,

Null intents are a result of out-of-vocabulary words when using the tensorflow embeddings. Are these samples in your training data?

In general, I have a couple architecture questions - are you following this repo's advice? I would recommend using intent_classifier_mitie for this purpose because total_word_feature_extractor_zh.dat was trained under the MITIE framework. It is possible it has incompatible features with tensorflow which are causing your null intents.

Hi @pfZhu,

Null intents are a result of out-of-vocabulary words when using the tensorflow embeddings. Are these samples in your training data?

In general, I have a couple architecture questions - are you following this repo's advice? I would recommend using intent_classifier_mitie for this purpose because total_word_feature_extractor_zh.dat was trained under the MITIE framework. It is possible it has incompatible features with tensorflow which are causing your null intents.

Hi @pfZhu,

Null intents are a result of out-of-vocabulary words when using the tensorflow embeddings. Are these samples in your training data?

In general, I have a couple architecture questions - are you following this repo's advice? I would recommend using intent_classifier_mitie for this purpose because total_word_feature_extractor_zh.dat was trained under the MITIE framework. It is possible it has incompatible features with tensorflow which are causing your null intents.

Thanks @MetcalfeTom ,
Yes the samples I mentioned are exactly from the training data. At the very start I thought maybe training data that involve these words is not enough, and then I copied them ten times in the training data (with ten different numbers appended after the string, such as "是0" "是1" ... "是9"), but the problem still exits.
And I followed this advice. Most samples from the training data is classified perfectly, but just a little caused this problem. And to be honest these problem samples doesn't seem like common words (in my scenario which is customer service system, it is common, but not in other Chinese systems).
I choose tf embedding because the rasa doc says "The advantage of the tensorflow_embedding pipeline is that your word vectors will be customised for your domain.".
So you mean, when I use this configuration, the word vectors of tensorflow embedding are from the model total_word_feature_extractor_zh.dat, right?
Thank you for your advice, I will try the performance of intent_classifier_mitie, or try another total_word_feature_extractor_zh.dat trained from other corpus to see if I can fix this problem.

That is odd then. I am wondering if some additional features are added by total_word_feature_extractor_zh.dat which cause tensorflow not to classify them properly.

You are correct about the pipeline customizing the word vectors to your domain, however can I ask if you tried out the whole pipeline before just using the single component? The tensorflow pipeline looks like this:

pipeline:
- name: "tokenizer_whitespace"
- name: "ner_crf"
- name: "ner_synonyms"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"

To answer your question, the word vectors are calculated in the intent_classifier_tensorflow_embedding component. The components before that are for tokenization and featurization

@howl-anderson do you have any insight on this?

@akelad OK, let me see.

@pfZhu My Wechat ID is here-we-meet. I need more info for this issue. We can talk on Wechat in Chinese directly. When I find where is the problem, I will update this thread in English.

The issue caused by intent_featurizer_count_vectors component. intent_featurizer_count_vectors using CountVectorizer internally which provided by scikit-learn. CountVectorizer has an argument called token_pattern which is a regular expression denoting what constitutes a “token” and default value is r'(?u)\b\w\w+\b' in Rasa NLU, the default regexp select tokens of 2 or more alphanumeric characters.
But in @pfZhu 's case: after tokenize, the training text is ['在 吗', '你好', '谢', '谢谢', '是', '是 的']. intent_featurizer_count_vectors will only treat 你好 and 谢谢 as token because it's length equal or greater than 2 and it ignore other characters. So after training, when we input '在吗' to the model, it will be tokenize to '在 吗', since none of them is in intent_featurizer_count_vectors's vocabulary, text_features will set to [0, 0].
In intent_classifier_tensorflow_embedding component, if a message's text_features is all zeros, it will set intent and intent_ranking to default value: {"name": None, "confidence": 0.0} and []. This is why intent is null.
One solution for this case is setting "token_pattern" to '(?u)\b\w+\b' for intent_featurizer_count_vectors component in pipeline.

Thanks @howl-anderson for your clear explanation. So nice of you @akelad @MetcalfeTom

Was this page helpful?
0 / 5 - 0 ratings