Rasa version:
1.10.1
Rasa SDK version (if used & relevant):
1.10.1
Rasa X version (if used & relevant):
Python version:
3.7.3
Operating system (windows, osx, ...):
osx
Issue:
Chinese Entities predicted correct by DIET model, but will change to wrong by entities = self.clean_up_entities(message, entities)
Error (including full traceback):
rasa/nlu/extractors/extractor.py _token_clusters, this function put all Chinese sentence as a single word, So the correct entities will turn to wrong
def _token_clusters(tokens: List[Token]) -> List[List[Token]]:
"""Build clusters of tokens that belong to one word.
Args:
tokens: list of tokens
Returns:
Token clusters.
"""
# token cluster = list of token indices that belong to one word
the debug output ,the whole sentence "我想换成顺丰快递'" turn to a entities
2020-06-09 15:57:27 DEBUG rasa.core.processor - Received user message '我想换成顺丰快递' with intent '{'name': 'inform_choose_delivery', 'confide4753126800060272}' and entities '[{'entity': 'delivery', 'start': 0, 'end': 8, 'value': '我想换成顺丰快递', 'extractor': 'DIETClassifier'}]'
After comment out nlu/classifiers/diet_classifier.py, line 806
entities = self.clean_up_entities(message, entities)
it will correct output
Command or request that led to error:
Content of configuration file (config.yml) (if relevant):
language: zh
pipeline:
- name: HFTransformersNLP
model_name: "bert"
model_weights: "bert-base-chinese"
cache_dir: null
- name: customrasa.printer.Printer
alias: after HFTransformersNLP
# - name: "JiebaTokenizer"
# # Flag to check whether to split intents
# "intent_tokenization_flag": False
# # Symbol on which intent should be split
# "intent_split_symbol": "_"
- name: EntitySynonymMapper
- name: "LanguageModelTokenizer"
"intent_tokenization_flag": False
# Symbol on which intent should be split
"intent_split_symbol": "_"
- name: customrasa.printer.Printer
alias: after LanguageModelTokenizer
- name: LanguageModelFeaturizer
# - name: DucklingHTTPExtractor
# url: http://localhost:8000
# dimensions:
# - number
- name: customrasa.printer.Printer
alias: after LanguageModelFeaturizer
- name: DIETClassifier
epochs: 100
- name: customrasa.printer.Printer
alias: after DIETClassifier
policies:
- name: FormPolicy
- name: FallbackPolicy
- name: MemoizationPolicy
- name: MappingPolicy
- name: TEDPolicy
Content of domain file (domain.yml) (if relevant):
@johnson7788 Thanks for submitting the issue. The issue was already solved in https://github.com/RasaHQ/rasa/pull/5756. It will be released in the next minor release. It is not yet clear when this will happen, so please be patient. If you want to use the DIETClassifier and Chinese language, I guess best would be to use Rasa 1.9.7.
Thank you very much, high efficiency
Most helpful comment
@johnson7788 Thanks for submitting the issue. The issue was already solved in https://github.com/RasaHQ/rasa/pull/5756. It will be released in the next minor release. It is not yet clear when this will happen, so please be patient. If you want to use the
DIETClassifierand Chinese language, I guess best would be to use Rasa 1.9.7.