Rasa: CRFEntityExtractor predicted_entities probelm

Created on 8 Jul 2020  ·  5Comments  ·  Source: RasaHQ/rasa

Rasa version 1.10.3:

Python version 3.7.7:

windows 10 (windows, osx, ...):

Issue:
CRFEntityExtractor can't extract chinese nlp entities.

Content of test result file CRFEntityExtractor_errors.json:
'''json
[
{
"text": "我想查的号码是12260222425",
"entities": [
{
"start": 7,
"end": 18,
"value": "12260222425",
"entity": "number"
}
],
"predicted_entities": [
{
"entity": "number",
"start": 0,
"end": 18,
"confidence_entity": 0.9842847574519757,
"value": "我想查的号码是12260222425",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "我想查身份证441222296712305681",
"entities": [
{
"start": 6,
"end": 24,
"value": "441222296712305681",
"entity": "number"
}
],
"predicted_entities": [
{
"entity": "number",
"start": 0,
"end": 24,
"confidence_entity": 0.9912246457406692,
"value": "我想查身份证441222296712305681",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查电话11160222425",
"entities": [
{
"start": 1,
"end": 3,
"value": "电话",
"entity": "type"
},
{
"start": 3,
"end": 14,
"value": "11160222425",
"entity": "number"
}
],
"predicted_entities": [
{
"entity": "number",
"start": 0,
"end": 14,
"confidence_entity": 0.9876110342010952,
"value": "查电话11160222425",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "请告诉我电话号码为19860222425",
"entities": [
{
"start": 4,
"end": 8,
"value": "电话号码",
"entity": "type"
},
{
"start": 9,
"end": 20,
"value": "19860222425",
"entity": "number"
}
],
"predicted_entities": [
{
"entity": "number",
"start": 0,
"end": 20,
"confidence_entity": 0.9923245728477686,
"value": "请告诉我电话号码为19860222425",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查电话号码19800222425",
"entities": [
{
"start": 1,
"end": 5,
"value": "电话号码",
"entity": "type"
},
{
"start": 5,
"end": 16,
"value": "19800222425",
"entity": "number"
}
],
"predicted_entities": [
{
"entity": "number",
"start": 0,
"end": 16,
"confidence_entity": 0.9947824185301072,
"value": "查电话号码19800222425",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查询电话19862618425",
"entities": [
{
"start": 2,
"end": 4,
"value": "电话",
"entity": "type"
},
{
"start": 4,
"end": 15,
"value": "19862618425",
"entity": "number"
}
],
"predicted_entities": [
{
"entity": "number",
"start": 0,
"end": 15,
"confidence_entity": 0.9904694037192807,
"value": "查询电话19862618425",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查询身份证",
"entities": [
{
"start": 2,
"end": 5,
"value": "身份证",
"entity": "type"
}
],
"predicted_entities": [
{
"entity": "type",
"start": 0,
"end": 5,
"confidence_entity": 0.8171515788864899,
"value": "查询身份证",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "我要查下身份证",
"entities": [
{
"start": 4,
"end": 7,
"value": "身份证",
"entity": "type"
}
],
"predicted_entities": [
{
"entity": "type",
"start": 0,
"end": 7,
"confidence_entity": 0.8545899093913647,
"value": "我要查下身份证",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查询的号码是11160222425",
"entities": [
{
"start": 6,
"end": 17,
"value": "11160222425",
"entity": "number"
}
],
"predicted_entities": [
{
"entity": "number",
"start": 0,
"end": 17,
"confidence_entity": 0.9835158735904833,
"value": "查询的号码是11160222425",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "12260222425这手机号",
"entities": [
{
"start": 0,
"end": 11,
"value": "12260222425",
"entity": "number"
},
{
"start": 12,
"end": 15,
"value": "手机号",
"entity": "type"
}
],
"predicted_entities": [
{
"entity": "number",
"start": 0,
"end": 15,
"confidence_entity": 0.955705773586323,
"value": "12260222425这手机号",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "我想查身份证",
"entities": [
{
"start": 3,
"end": 6,
"value": "身份证",
"entity": "type"
}
],
"predicted_entities": [
{
"entity": "type",
"start": 0,
"end": 6,
"confidence_entity": 0.7205535113669648,
"value": "我想查身份证",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "帮我查询身份证",
"entities": [
{
"start": 4,
"end": 7,
"value": "身份证",
"entity": "type"
}
],
"predicted_entities": [
{
"entity": "type",
"start": 0,
"end": 7,
"confidence_entity": 0.8164589270039054,
"value": "帮我查询身份证",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "最近的违章记录",
"entities": [
{
"start": 3,
"end": 7,
"value": "违章记录",
"entity": "business"
}
],
"predicted_entities": [
{
"entity": "business",
"start": 0,
"end": 7,
"confidence_entity": 0.984410185837026,
"value": "最近的违章记录",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "住宿信息怎么样",
"entities": [
{
"start": 0,
"end": 4,
"value": "住宿信息",
"entity": "business"
}
],
"predicted_entities": [
{
"entity": "business",
"start": 0,
"end": 7,
"confidence_entity": 0.8651053068267097,
"value": "住宿信息怎么样",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查违法信息",
"entities": [
{
"start": 1,
"end": 5,
"value": "违法信息",
"entity": "business"
}
],
"predicted_entities": [
{
"entity": "business",
"start": 0,
"end": 5,
"confidence_entity": 0.9656421865682633,
"value": "查违法信息",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "我要查询这个人的犯罪记录",
"entities": [
{
"start": 8,
"end": 12,
"value": "犯罪记录",
"entity": "business"
}
],
"predicted_entities": [
{
"entity": "business",
"start": 0,
"end": 12,
"confidence_entity": 0.9757523659603139,
"value": "我要查询这个人的犯罪记录",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "这个人有没有违法",
"entities": [
{
"start": 6,
"end": 8,
"value": "违法",
"entity": "business"
}
],
"predicted_entities": [
{
"entity": "business",
"start": 0,
"end": 8,
"confidence_entity": 0.768765304063969,
"value": "这个人有没有违法",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查下教育信息",
"entities": [
{
"start": 2,
"end": 6,
"value": "教育信息",
"entity": "business"
}
],
"predicted_entities": [
{
"entity": "business",
"start": 0,
"end": 6,
"confidence_entity": 0.9995679280138035,
"value": "查下教育信息",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查下今天上海的天气",
"entities": [
{
"start": 2,
"end": 4,
"value": "今天",
"entity": "date_time"
},
{
"start": 4,
"end": 6,
"value": "上海",
"entity": "address"
}
],
"predicted_entities": [
{
"entity": "address",
"start": 0,
"end": 9,
"confidence_entity": 0.979807108097493,
"value": "查下今天上海的天气",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "深圳昨天下冰雹了,明天还会下吗?",
"entities": [
{
"start": 0,
"end": 2,
"value": "深圳",
"entity": "address"
},
{
"start": 9,
"end": 11,
"value": "明天",
"entity": "date_time"
}
],
"predicted_entities": [
{
"entity": "date_time",
"start": 0,
"end": 16,
"confidence_entity": 0.8427497095495378,
"value": "深圳昨天下冰雹了,明天还会下吗?",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查下广州的天气怎么样",
"entities": [
{
"start": 2,
"end": 4,
"value": "广州",
"entity": "address"
}
],
"predicted_entities": [
{
"entity": "address",
"start": 0,
"end": 10,
"confidence_entity": 0.8881646255574454,
"value": "查下广州的天气怎么样",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "我现在在上海",
"entities": [
{
"start": 4,
"end": 6,
"value": "上海",
"entity": "address"
}
],
"predicted_entities": [
{
"entity": "address",
"start": 0,
"end": 6,
"confidence_entity": 0.9762155698306253,
"value": "我现在在上海",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "常德的天气怎么样?",
"entities": [
{
"start": 0,
"end": 2,
"value": "常德",
"entity": "address"
}
],
"predicted_entities": [
{
"entity": "address",
"start": 0,
"end": 9,
"confidence_entity": 0.765170136595972,
"value": "常德的天气怎么样?",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "今天的",
"entities": [
{
"start": 0,
"end": 2,
"value": "今天",
"entity": "date_time"
}
],
"predicted_entities": [
{
"entity": "date_time",
"start": 0,
"end": 3,
"confidence_entity": 0.7083242453974274,
"value": "今天的",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "查询深圳的天气",
"entities": [
{
"start": 2,
"end": 4,
"value": "深圳",
"entity": "address"
}
],
"predicted_entities": [
{
"entity": "address",
"start": 0,
"end": 7,
"confidence_entity": 0.9755224976844531,
"value": "查询深圳的天气",
"extractor": "CRFEntityExtractor"
}
]
},
{
"text": "帮我查下今天的",
"entities": [
{
"start": 4,
"end": 6,
"value": "今天",
"entity": "date_time"
}
],
"predicted_entities": [
{
"entity": "date_time",
"start": 0,
"end": 7,
"confidence_entity": 0.9040225226085086,
"value": "帮我查下今天的",
"extractor": "CRFEntityExtractor"
}
]
}
]
'''

Content of nlu model files (*.md) :
weather_nlu.md:
'''md

intent:request_weather

intent:inform

intent:greet

  • 你好
  • hello
  • 你好呀
    '''

number_nlu.md:
'''md

intent:request_number

intent:inform_business

synonym:身份证号码

  • 身份证
  • 身份证号码

synonym:电话号码

  • 电话
  • 电话号
  • 电话号码
  • 手机
  • 手机号
  • 手机号码

synonym:违章记录

  • 违章
  • 违章信息
  • 违章记录
  • 违章处罚
  • 违章情况

synonym:教育经历

  • 教育
  • 教育经历
  • 教育信息
  • 教育情况

synonym:出行轨迹

  • 出行
  • 出行路线
  • 出行信息
  • 出行轨迹
  • 出行记录

synonym:开房信息

  • 开房记录
  • 开房
  • 住宿记录
  • 住宿信息
  • 开房信息

  • synonym:犯罪记录

  • 犯罪
  • 犯罪信息
  • 犯罪记录
  • 违法记录
  • 违法
  • 违法犯罪

regex:phone_number

  • ((\d{3,4}-)?\d{7,8})|(((+86)|(86))?(1)\d{10})

regex:number

  • ([1-9]\d{5}(18|19|([23]\d))\d{2}((0[1-9])|(10|11|12))(([0-2][1-9])|10|20|30|31)\d{3}[0-9Xx])|([1-9]\d{5}\d{2}((0[1-9])|(10|11|12))(([0-2][1-9])|10|20|30|31)\d{2}[0-9Xx])
    '''

chitchat_nlu.md
'''md

intent:greet

  • 你好
  • 你好啊
  • 早上好
  • 晚上好
  • hello
  • hi
  • 嗨喽
  • 见到你很高兴
  • 上午好
  • hello哈喽
  • 哈喽哈喽
  • hello hello
  • 喂喂

intent:goodbye

  • goodbye
  • bye
  • bye bye
  • 88
  • 886
  • 再见
  • 拜拜
  • 拜拜,下次再聊
  • 下次见
  • 回头见
  • 下次再见
  • 下次再聊
  • 有空再聊
  • 先这样吧
  • 好了,就说这么多了
  • 好了,先这样
  • 没事

intent:whoareyou

  • 你是谁
  • 我知道你吗
  • 我认识你吗
  • 这是谁啊
  • 是谁
  • 请问你是谁
  • 请问我认识你吗
  • 你是哪位
  • 你是?
  • 是谁?
  • 可以告诉我你的名字吗
  • 你叫什么名字

intent:whattodo

  • 你支持什么功能
  • 你有什么功能
  • 你能干什么
  • 你能做什么

intent:thanks

  • 谢谢
  • thanks
  • thank you
  • 真的太感谢你了,帮了我大忙
  • 谢谢你帮了我大忙
  • 你帮了我大忙,谢谢你小智
  • 非常感谢
  • 谢了

intent:deny

  • 不了
  • 算了
  • 不用了
  • 不需要
  • 就这样吧
  • no
  • 不可以
  • 不是的
  • 不认同
  • 否定
  • 不是这样子的
  • 我不同意你的观点
  • 不同意
  • 不好
  • 你长得很美,就不要想得太美。
  • 拒绝
  • 不行

intent:affirm

  • 知道了
  • 是的
  • 当然
  • 好的
  • ok
  • 可以
  • 你可以这么做
  • 你做得可以啊
  • 同意
  • 听起来不错
  • 是这样的
  • 的确是这样子的
  • 我同意你的观点
  • 对的
  • 好滴
  • 还行
  • 当然可以

intent:chitchat

  • 你是不是傻
  • 你喜欢靓仔靓女吗
  • 能跟我说说你的男朋友吗
  • 嘿,朋友,你好啊?
  • 你是堆的
  • 你今天过得开心吗
  • 你没事吧
  • 今天感觉怎么样
  • 广州今天下雨了
  • 我能够知道谁邀请你吗
  • 请告诉我谁创造了你
  • 你的老板是谁
  • 我想知道谁发明了你
  • 你是哪家公司的
  • 你是谁发明的
  • 你是什么
  • 你还好吗
  • 你现在在干什么
  • 最近怎么样
  • 你最近过得怎么样
  • 谁是林俊杰
  • 卢十瓦是不是疯了
  • 你这个早上干什么了
  • 外面真安静
  • 现在的生活变得越来越好了
  • 是不是下雨了
  • 今天的天气很好,不是吗
  • 外面好冷啊
  • 你认识张天爱吗
  • 现在的世界真乱,你说是不是
  • 你今天看不起不错哟
  • 你觉得华为手机怎么样
  • 你知不知道新型冠形病毒
  • 你现在在干什么

intent:stop

  • 你是不是傻
  • 你就像个傻逼一样
  • 真差劲,什么都不懂
  • 真蠢
  • 什么都查不到
  • 你什么都不会
  • 感觉你什么都不知道
  • 你太蠢了
  • 我不查了,太傻了
  • 就这样吧
  • 感觉你什么都不会
  • 没有了
  • 什么都没查到
  • 我猜你是不是坏了
    '''

Content of configuration file (config.yml) :

language: "zh"

pipeline:
- name: "MitieNLP"
  model: "data/total_word_feature_extractor_zh.dat"
- name: "JiebaTokenizer"
  dictionary_path: "data/dict"
- name: "RegexFeaturizer"
- name: "MitieFeaturizer"
- name: "CRFEntityExtractor"
- name: "EntitySynonymMapper"
- name: "SklearnIntentClassifier"

policies:
  - name: KerasPolicy
    epochs: 500
    max_history: 5
  - name: FallbackPolicy
    fallback_action_name: 'action_default_fallback'
  - name: MemoizationPolicy
    max_history: 5
  - name: FormPolicy

Content of domain file (domain.yml) :

session_config:
  session_expiration_time: 60
  carry_over_slots_to_new_session: true
intents:
- affirm
- deny
- greet
- goodbye
- thanks
- whoareyou
- whattodo
- request_weather
- request_number
- inform
- inform_business
- stop
- chitchat
entities:
- date_time
- address
- type
- number
- business
- data_time
slots:
  address:
    type: unfeaturized
    auto_fill: false
  business:
    type: unfeaturized
    auto_fill: false
  date_time:
    type: unfeaturized
    auto_fill: false
  number:
    type: unfeaturized
    auto_fill: false
  requested_slot:
    type: unfeaturized
  type:
    type: unfeaturized
    auto_fill: false
responses:
  utter_answer_affirm:
  - text: 嗯嗯,好的!
  - text: 嗯嗯,很开心能够帮您解决问题~
  - text: 嗯嗯,还需要什么我能够帮助您的呢?
  utter_answer_greet:
  - text: 您好!请问我可以帮到您吗?
  - text: 您好!很高兴为您服务。请说出您要查询的功能?
  utter_answer_goodbye:
  - text: 再见
  - text: 拜拜
  - text: 虽然我有万般舍不得,但是天下没有不散的宴席~祝您安好!
  - text: 期待下次再见!
  - text: 嗯嗯,下次需要时随时记得我哟~
  - text: see you!
  utter_answer_deny:
  - text: 主人,您不开心吗?不要离开我哦
  - text: 怎么了,主人?
  utter_answer_thanks:
  - text: 嗯呢。不用客气~
  - text: 这是我应该做的,主人~
  - text: 嗯嗯,合作愉快!
  utter_answer_whoareyou:
  - text: 您好!我是小蒋呀,您的AI智能助理
  utter_answer_whattodo:
  - text: 您好!很高兴为您服务,我目前只支持查询天气哦。
  utter_ask_date_time:
  - text: 请问您要查询哪一天的天气?
  utter_ask_address:
  - text: 请问您要查下哪里的天气?
  utter_ask_number:
  - text: 请问您要查的{type}号码是多少?
  utter_ask_business:
  - text: 请问您要查询什么业务呢?
  utter_default:
  - text: 没听懂,请换种说法吧~
  utter_ask_continue:
  - text: 请问您还要继续吗?
  utter_noworries:
  - text: 不用客气 :)
  - text: 没事啦
  - text: 不客气哈,都是老朋友了 :)
  utter_wrong_business:
  - text: 当前还不支持{business}业务,请重新输入。
  utter_wrong_type:
  - text: 当前还不支持查询{type}。
  utter_wrong_number:
  - text: 您输入的{number}有误,请重新输入。
  utter_chitchat:
  - text: 呃呃呃呃呃
  - text: 您这是在尬聊吗?
actions:
- utter_answer_affirm
- utter_answer_deny
- utter_answer_greet
- utter_answer_goodbye
- utter_answer_thanks
- utter_answer_whoareyou
- utter_answer_whattodo
- utter_ask_date_time
- utter_ask_address
- utter_ask_number
- utter_ask_business
- utter_ask_type
- action_default_fallback
- utter_default
- utter_ask_continue
- utter_noworries
- utter_wrong_business
- utter_wrong_type
- utter_wrong_number
- utter_chitchat
forms:
- weather_form
- number_form

area type

All 5 comments

What's the return result from the model that you training from this data and pipeline and what's your expected result.

What's the return result from the model that you training from this data and pipeline and what's your expected result.

大佬,我记得你的名字啊。 模型训练的测试结果上面列出来了。
比如我的输入是: 今天天气怎么样? , 然后这个CRF输出的实体是: 今天天气怎么样。
我希望的输出是: 今天, 这种正确的输出MitieEntityExtractor可以达到,但它训练起来太慢了。
请大佬指点,感谢!

如果是 MITIE 效果良好,但是 CRF 效果较差,差异最大的应该是特征提取:MITIE 因为是预先训练的所有会有很好的特征,CRF 本身提供的特征不够多(也可能是配置的不够好)导致效果较差。你可以联系我微信 here-we-meet,沟通会比较高效。

We will communicate via a local IM, if this issue can have a good solution, I will feedback here.

Thanks for the issue, @wochinge will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

Thanks @howl-anderson very much for help。
This issue has sloved in rasa=1.10.5.
The code with crf_entity_extractor.py 198 line number. The function call clean_up_entities has been removed.

Was this page helpful?
0 / 5 - 0 ratings