Rasa: DIET classifier _predict_entities function clean_up_entities for Chinese language issue

Created on 9 Jun 2020 · 3Comments · Source: RasaHQ/rasa

Rasa version:
1.10.1
Rasa SDK version (if used & relevant):
1.10.1
Rasa X version (if used & relevant):

Python version:
3.7.3
Operating system (windows, osx, ...):
osx
Issue:
Chinese Entities predicted correct by DIET model, but will change to wrong by entities = self.clean_up_entities(message, entities)

Error (including full traceback):

rasa/nlu/extractors/extractor.py  _token_clusters, this function put all Chinese sentence as a single word, So the correct entities will turn to wrong
    def _token_clusters(tokens: List[Token]) -> List[List[Token]]:
        """Build clusters of tokens that belong to one word.

        Args:
            tokens: list of tokens

        Returns:
            Token clusters.

        """
        # token cluster = list of token indices that belong to one word


the debug output ，the whole sentence "我想换成顺丰快递'" turn to a entities
2020-06-09 15:57:27 DEBUG    rasa.core.processor  - Received user message '我想换成顺丰快递' with intent '{'name': 'inform_choose_delivery', 'confide4753126800060272}' and entities '[{'entity': 'delivery', 'start': 0, 'end': 8, 'value': '我想换成顺丰快递', 'extractor': 'DIETClassifier'}]'

After comment out  nlu/classifiers/diet_classifier.py,    line 806
entities = self.clean_up_entities(message, entities)
it will correct output

Command or request that led to error:

Content of configuration file (config.yml) (if relevant):

language: zh

pipeline:
  - name: HFTransformersNLP
    model_name: "bert"
    model_weights: "bert-base-chinese"
    cache_dir: null
  - name: customrasa.printer.Printer
    alias: after HFTransformersNLP
#  - name: "JiebaTokenizer"
#    # Flag to check whether to split intents
#    "intent_tokenization_flag": False
#    # Symbol on which intent should be split
#    "intent_split_symbol": "_"
  - name: EntitySynonymMapper
  - name: "LanguageModelTokenizer"
    "intent_tokenization_flag": False
    # Symbol on which intent should be split
    "intent_split_symbol": "_"
  - name: customrasa.printer.Printer
    alias: after LanguageModelTokenizer
  - name: LanguageModelFeaturizer
#  - name: DucklingHTTPExtractor
#    url: http://localhost:8000
#    dimensions:
#      - number
  - name: customrasa.printer.Printer
    alias: after LanguageModelFeaturizer
  - name: DIETClassifier
    epochs: 100
  - name: customrasa.printer.Printer
    alias: after DIETClassifier

policies:
  - name: FormPolicy
  - name: FallbackPolicy
  - name: MemoizationPolicy
  - name: MappingPolicy
  - name: TEDPolicy

Content of domain file (domain.yml) (if relevant):

area type

Source

johnson7788

Most helpful comment

@johnson7788 Thanks for submitting the issue. The issue was already solved in https://github.com/RasaHQ/rasa/pull/5756. It will be released in the next minor release. It is not yet clear when this will happen, so please be patient. If you want to use the DIETClassifier and Chinese language, I guess best would be to use Rasa 1.9.7.