Rasa version 1.10.3:
Python version 3.7.7:
windows 10 (windows, osx, ...):
Issue:
CRFEntityExtractor can't extract chinese nlp entities.
Content of test result file CRFEntityExtractor_errors.json:
'''json
[
{
"text": "我想查的号码是12260222425",
"entities": [
{
"start": 7,
"end": 18,
"value": "12260222425",
"entity": "number"
}
],
"predicted_entities": [
{
"entity": "number",
"start": 0,
"end": 18,
"confidence_entity": 0.9842847574519757,
"value": "我想查的号码是12260222425",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "我想查身份证441222296712305681",
"entities": [
{
"start": 6,
"end": 24,
"value": "441222296712305681",
"entity": "number"
}
],
"predicted_entities": [
{
"entity": "number",
"start": 0,
"end": 24,
"confidence_entity": 0.9912246457406692,
"value": "我想查身份证441222296712305681",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查电话11160222425",
"entities": [
{
"start": 1,
"end": 3,
"value": "电话",
"entity": "type"
},
{
"start": 3,
"end": 14,
"value": "11160222425",
"entity": "number"
}
],
"predicted_entities": [
{
"entity": "number",
"start": 0,
"end": 14,
"confidence_entity": 0.9876110342010952,
"value": "查电话11160222425",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "请告诉我电话号码为19860222425",
"entities": [
{
"start": 4,
"end": 8,
"value": "电话号码",
"entity": "type"
},
{
"start": 9,
"end": 20,
"value": "19860222425",
"entity": "number"
}
],
"predicted_entities": [
{
"entity": "number",
"start": 0,
"end": 20,
"confidence_entity": 0.9923245728477686,
"value": "请告诉我电话号码为19860222425",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查电话号码19800222425",
"entities": [
{
"start": 1,
"end": 5,
"value": "电话号码",
"entity": "type"
},
{
"start": 5,
"end": 16,
"value": "19800222425",
"entity": "number"
}
],
"predicted_entities": [
{
"entity": "number",
"start": 0,
"end": 16,
"confidence_entity": 0.9947824185301072,
"value": "查电话号码19800222425",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查询电话19862618425",
"entities": [
{
"start": 2,
"end": 4,
"value": "电话",
"entity": "type"
},
{
"start": 4,
"end": 15,
"value": "19862618425",
"entity": "number"
}
],
"predicted_entities": [
{
"entity": "number",
"start": 0,
"end": 15,
"confidence_entity": 0.9904694037192807,
"value": "查询电话19862618425",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查询身份证",
"entities": [
{
"start": 2,
"end": 5,
"value": "身份证",
"entity": "type"
}
],
"predicted_entities": [
{
"entity": "type",
"start": 0,
"end": 5,
"confidence_entity": 0.8171515788864899,
"value": "查询身份证",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "我要查下身份证",
"entities": [
{
"start": 4,
"end": 7,
"value": "身份证",
"entity": "type"
}
],
"predicted_entities": [
{
"entity": "type",
"start": 0,
"end": 7,
"confidence_entity": 0.8545899093913647,
"value": "我要查下身份证",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查询的号码是11160222425",
"entities": [
{
"start": 6,
"end": 17,
"value": "11160222425",
"entity": "number"
}
],
"predicted_entities": [
{
"entity": "number",
"start": 0,
"end": 17,
"confidence_entity": 0.9835158735904833,
"value": "查询的号码是11160222425",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "12260222425这手机号",
"entities": [
{
"start": 0,
"end": 11,
"value": "12260222425",
"entity": "number"
},
{
"start": 12,
"end": 15,
"value": "手机号",
"entity": "type"
}
],
"predicted_entities": [
{
"entity": "number",
"start": 0,
"end": 15,
"confidence_entity": 0.955705773586323,
"value": "12260222425这手机号",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "我想查身份证",
"entities": [
{
"start": 3,
"end": 6,
"value": "身份证",
"entity": "type"
}
],
"predicted_entities": [
{
"entity": "type",
"start": 0,
"end": 6,
"confidence_entity": 0.7205535113669648,
"value": "我想查身份证",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "帮我查询身份证",
"entities": [
{
"start": 4,
"end": 7,
"value": "身份证",
"entity": "type"
}
],
"predicted_entities": [
{
"entity": "type",
"start": 0,
"end": 7,
"confidence_entity": 0.8164589270039054,
"value": "帮我查询身份证",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "最近的违章记录",
"entities": [
{
"start": 3,
"end": 7,
"value": "违章记录",
"entity": "business"
}
],
"predicted_entities": [
{
"entity": "business",
"start": 0,
"end": 7,
"confidence_entity": 0.984410185837026,
"value": "最近的违章记录",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "住宿信息怎么样",
"entities": [
{
"start": 0,
"end": 4,
"value": "住宿信息",
"entity": "business"
}
],
"predicted_entities": [
{
"entity": "business",
"start": 0,
"end": 7,
"confidence_entity": 0.8651053068267097,
"value": "住宿信息怎么样",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查违法信息",
"entities": [
{
"start": 1,
"end": 5,
"value": "违法信息",
"entity": "business"
}
],
"predicted_entities": [
{
"entity": "business",
"start": 0,
"end": 5,
"confidence_entity": 0.9656421865682633,
"value": "查违法信息",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "我要查询这个人的犯罪记录",
"entities": [
{
"start": 8,
"end": 12,
"value": "犯罪记录",
"entity": "business"
}
],
"predicted_entities": [
{
"entity": "business",
"start": 0,
"end": 12,
"confidence_entity": 0.9757523659603139,
"value": "我要查询这个人的犯罪记录",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "这个人有没有违法",
"entities": [
{
"start": 6,
"end": 8,
"value": "违法",
"entity": "business"
}
],
"predicted_entities": [
{
"entity": "business",
"start": 0,
"end": 8,
"confidence_entity": 0.768765304063969,
"value": "这个人有没有违法",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查下教育信息",
"entities": [
{
"start": 2,
"end": 6,
"value": "教育信息",
"entity": "business"
}
],
"predicted_entities": [
{
"entity": "business",
"start": 0,
"end": 6,
"confidence_entity": 0.9995679280138035,
"value": "查下教育信息",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查下今天上海的天气",
"entities": [
{
"start": 2,
"end": 4,
"value": "今天",
"entity": "date_time"
},
{
"start": 4,
"end": 6,
"value": "上海",
"entity": "address"
}
],
"predicted_entities": [
{
"entity": "address",
"start": 0,
"end": 9,
"confidence_entity": 0.979807108097493,
"value": "查下今天上海的天气",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "深圳昨天下冰雹了,明天还会下吗?",
"entities": [
{
"start": 0,
"end": 2,
"value": "深圳",
"entity": "address"
},
{
"start": 9,
"end": 11,
"value": "明天",
"entity": "date_time"
}
],
"predicted_entities": [
{
"entity": "date_time",
"start": 0,
"end": 16,
"confidence_entity": 0.8427497095495378,
"value": "深圳昨天下冰雹了,明天还会下吗?",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查下广州的天气怎么样",
"entities": [
{
"start": 2,
"end": 4,
"value": "广州",
"entity": "address"
}
],
"predicted_entities": [
{
"entity": "address",
"start": 0,
"end": 10,
"confidence_entity": 0.8881646255574454,
"value": "查下广州的天气怎么样",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "我现在在上海",
"entities": [
{
"start": 4,
"end": 6,
"value": "上海",
"entity": "address"
}
],
"predicted_entities": [
{
"entity": "address",
"start": 0,
"end": 6,
"confidence_entity": 0.9762155698306253,
"value": "我现在在上海",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "常德的天气怎么样?",
"entities": [
{
"start": 0,
"end": 2,
"value": "常德",
"entity": "address"
}
],
"predicted_entities": [
{
"entity": "address",
"start": 0,
"end": 9,
"confidence_entity": 0.765170136595972,
"value": "常德的天气怎么样?",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "今天的",
"entities": [
{
"start": 0,
"end": 2,
"value": "今天",
"entity": "date_time"
}
],
"predicted_entities": [
{
"entity": "date_time",
"start": 0,
"end": 3,
"confidence_entity": 0.7083242453974274,
"value": "今天的",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查询深圳的天气",
"entities": [
{
"start": 2,
"end": 4,
"value": "深圳",
"entity": "address"
}
],
"predicted_entities": [
{
"entity": "address",
"start": 0,
"end": 7,
"confidence_entity": 0.9755224976844531,
"value": "查询深圳的天气",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "帮我查下今天的",
"entities": [
{
"start": 4,
"end": 6,
"value": "今天",
"entity": "date_time"
}
],
"predicted_entities": [
{
"entity": "date_time",
"start": 0,
"end": 7,
"confidence_entity": 0.9040225226085086,
"value": "帮我查下今天的",
"extractor": "CRFEntityExtractor"
}
]
}
]
'''
Content of nlu model files (*.md) :
weather_nlu.md:
'''md
number_nlu.md:
'''md
开房信息
chitchat_nlu.md
'''md
Content of configuration file (config.yml) :
language: "zh"
pipeline:
- name: "MitieNLP"
model: "data/total_word_feature_extractor_zh.dat"
- name: "JiebaTokenizer"
dictionary_path: "data/dict"
- name: "RegexFeaturizer"
- name: "MitieFeaturizer"
- name: "CRFEntityExtractor"
- name: "EntitySynonymMapper"
- name: "SklearnIntentClassifier"
policies:
- name: KerasPolicy
epochs: 500
max_history: 5
- name: FallbackPolicy
fallback_action_name: 'action_default_fallback'
- name: MemoizationPolicy
max_history: 5
- name: FormPolicy
Content of domain file (domain.yml) :
session_config:
session_expiration_time: 60
carry_over_slots_to_new_session: true
intents:
- affirm
- deny
- greet
- goodbye
- thanks
- whoareyou
- whattodo
- request_weather
- request_number
- inform
- inform_business
- stop
- chitchat
entities:
- date_time
- address
- type
- number
- business
- data_time
slots:
address:
type: unfeaturized
auto_fill: false
business:
type: unfeaturized
auto_fill: false
date_time:
type: unfeaturized
auto_fill: false
number:
type: unfeaturized
auto_fill: false
requested_slot:
type: unfeaturized
type:
type: unfeaturized
auto_fill: false
responses:
utter_answer_affirm:
- text: 嗯嗯,好的!
- text: 嗯嗯,很开心能够帮您解决问题~
- text: 嗯嗯,还需要什么我能够帮助您的呢?
utter_answer_greet:
- text: 您好!请问我可以帮到您吗?
- text: 您好!很高兴为您服务。请说出您要查询的功能?
utter_answer_goodbye:
- text: 再见
- text: 拜拜
- text: 虽然我有万般舍不得,但是天下没有不散的宴席~祝您安好!
- text: 期待下次再见!
- text: 嗯嗯,下次需要时随时记得我哟~
- text: see you!
utter_answer_deny:
- text: 主人,您不开心吗?不要离开我哦
- text: 怎么了,主人?
utter_answer_thanks:
- text: 嗯呢。不用客气~
- text: 这是我应该做的,主人~
- text: 嗯嗯,合作愉快!
utter_answer_whoareyou:
- text: 您好!我是小蒋呀,您的AI智能助理
utter_answer_whattodo:
- text: 您好!很高兴为您服务,我目前只支持查询天气哦。
utter_ask_date_time:
- text: 请问您要查询哪一天的天气?
utter_ask_address:
- text: 请问您要查下哪里的天气?
utter_ask_number:
- text: 请问您要查的{type}号码是多少?
utter_ask_business:
- text: 请问您要查询什么业务呢?
utter_default:
- text: 没听懂,请换种说法吧~
utter_ask_continue:
- text: 请问您还要继续吗?
utter_noworries:
- text: 不用客气 :)
- text: 没事啦
- text: 不客气哈,都是老朋友了 :)
utter_wrong_business:
- text: 当前还不支持{business}业务,请重新输入。
utter_wrong_type:
- text: 当前还不支持查询{type}。
utter_wrong_number:
- text: 您输入的{number}有误,请重新输入。
utter_chitchat:
- text: 呃呃呃呃呃
- text: 您这是在尬聊吗?
actions:
- utter_answer_affirm
- utter_answer_deny
- utter_answer_greet
- utter_answer_goodbye
- utter_answer_thanks
- utter_answer_whoareyou
- utter_answer_whattodo
- utter_ask_date_time
- utter_ask_address
- utter_ask_number
- utter_ask_business
- utter_ask_type
- action_default_fallback
- utter_default
- utter_ask_continue
- utter_noworries
- utter_wrong_business
- utter_wrong_type
- utter_wrong_number
- utter_chitchat
forms:
- weather_form
- number_form
What's the return result from the model that you training from this data and pipeline and what's your expected result.
What's the return result from the model that you training from this data and pipeline and what's your expected result.
大佬,我记得你的名字啊。 模型训练的测试结果上面列出来了。
比如我的输入是: 今天天气怎么样? , 然后这个CRF输出的实体是: 今天天气怎么样。
我希望的输出是: 今天, 这种正确的输出MitieEntityExtractor可以达到,但它训练起来太慢了。
请大佬指点,感谢!
如果是 MITIE 效果良好,但是 CRF 效果较差,差异最大的应该是特征提取:MITIE 因为是预先训练的所有会有很好的特征,CRF 本身提供的特征不够多(也可能是配置的不够好)导致效果较差。你可以联系我微信 here-we-meet,沟通会比较高效。
We will communicate via a local IM, if this issue can have a good solution, I will feedback here.
Thanks @howl-anderson very much for help。
This issue has sloved in rasa=1.10.5.
The code with crf_entity_extractor.py 198 line number. The function call clean_up_entities has been removed.