Rasa: rasa x, NLU Inbox change my entities string to \u05d3\u05de\u05d9 on the nlu.md

Created on 16 Jul 2020  ยท  8Comments  ยท  Source: RasaHQ/rasa

Rasa version: Rasa 1.10.7

Rasa SDK version (if used & relevant):

Rasa X version (if used & relevant):

Python version: 3.8

Operating system (windows, osx, ...): ubuntu 18

Issue:
After I confirmed the checkbox in the rasa x, NLU Inbox.
I see in the nlu.md file that the entities look very strange {"entity": "insurance_money", "value": "u05d3u05deu05d9 u05d0u05d1u05d8u05dcu05d4"}
I do not have this problem in the story.md file

my python encoding is UTF8 and the Pycharm as well
I'm a developer for Hebrew users

Content of configuration file (config.yml) (if relevant):

language: he
pipeline:
  - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  - name: CRFEntityExtractor
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
  - name: MemoizationPolicy
  - name: TEDPolicy
    max_history: 5
    epochs: 100
  - name: MappingPolicy
  - name: FormPolicy
  - name: TwoStageFallbackPolicy
    nlu_threshold: 0.3
    core_threshold: 0.3
    fallback_core_action_name: "custom_action_fallback"
    fallback_nlu_action_name: "custom_action_fallback"
    deny_suggestion_intent_name: "out_of_scope"

Content of domain file (domain.yml) (if relevant):


area area type

Most helpful comment

Thanks for reporting this, it definitely looks like a formatting bug with Rasa X. One of our back-end engineers will take a look when they get a chance.

All 8 comments

Thanks for the issue, @b-quachtran will get back to you about it soon!

You may find help in the docs and the forum, too ๐Ÿค—

@sara-tagger / @b-quachtran - I would be happy to update the issue:
I build a rasa chatbot and the chatbot language is Hebrew.
The bug happened just in the nlu.md file.
example:
'''

intent:corona_who_deserve_unemployment_benefit

  • ืขื‘ื“ืชื™ ืืฆืœ ื”ืžืขืกื™ืง ื”ืื—ืจื•ืŸ ืฉืœื•ืฉื” ื—ื•ื“ืฉื™ื ื•ื”ื•ื ื”ื•ืฆื™ื ืื•ืชื™ [ืœื—ืœ"ืช]{"entity": "case", "value": "ื—ื•ืคืฉื” ืœืœื ืชืฉืœื•ื"} . ืœืคื ื™ ื›ืŸ ืขื‘ื“ืชื™ ืืฆืœ ืžืขืกื™ืง ืื—ืจ [ืฉื ื” ืฉืœืžื”]{"entity": "work_period_right", "value": "12 ื—ื•ื“ืฉื™ื™ื"}. ื”ืื ืื ื™ [ื–ื›ืื™ ืœืื‘ื˜ืœื”]{"entity": "insurance_money", "value": "ื“ืžื™ ืื‘ื˜ืœื”"} ?
  • ื‘ื’ืœืœ [ื”ืงื•ืจื•ื ื”]{"entity": "disease", "value": "ืงื•ืจื•ื ื”"} ื”ื•ืฆืืชื™ [ืœื—ื•ืคืฉื” ืœืœื ืชืฉืœื•ื]{"entity": "case", "value": "ื—ื•ืคืฉื” ืœืœื ืชืฉืœื•ื"}, ืžื’ื™ืข ืœื™ ื“ืžื™ ืื‘ื˜ืœื”?
  • ืื ื™ ืขื•ื‘ื“ืช ื™ื•ืชืจ ืžืฉื ื” ื‘ืžืงื•ื ื”ืขื‘ื•ื“ื” ื”ืื—ืจื•ืŸ ื•ื”ื•ืฆืืชื™ [ืœื—ืœ"ืช]{"entity": "case", "value": "ื—ื•ืคืฉื” ืœืœื ืชืฉืœื•ื"}, ื›ื™ืฆื“ ืืงื‘ืœ ื“ืžื™ ืื‘ื˜ืœื”?
    '''

After confirmed the entities checkbox in the rasa x, NLU Inbox
The file look like this:
'''

intent:corona_who_deserve_unemployment_benefit

  • ืขื‘ื“ืชื™ ืืฆืœ ื”ืžืขืกื™ืง ื”ืื—ืจื•ืŸ ืฉืœื•ืฉื” ื—ื•ื“ืฉื™ื ื•ื”ื•ื ื”ื•ืฆื™ื ืื•ืชื™ [ืœื—ืœ"ืช]{"entity": "case", "value": "u05d7u05d5u05e4u05e9u05d4 u05dcu05dcu05d0 u05eau05e9u05dcu05d5u05dd"} . ืœืคื ื™ ื›ืŸ ืขื‘ื“ืชื™ ืืฆืœ ืžืขืกื™ืง ืื—ืจ [ืฉื ื” ืฉืœืžื”]{"entity": "work_period_right", "value": "12 u05d7u05d5u05d3u05e9u05d9u05d9u05dd"}. ื”ืื ืื ื™ [ื–ื›ืื™ ืœืื‘ื˜ืœื”]{"entity": "insurance_money", "value": "u05d3u05deu05d9 u05d0u05d1u05d8u05dcu05d4"} ?
  • ื‘ื’ืœืœ [ื”ืงื•ืจื•ื ื”]{"entity": "disease", "value": "u05e7u05d5u05e8u05d5u05e0u05d4"} ื”ื•ืฆืืชื™ [ืœื—ื•ืคืฉื” ืœืœื ืชืฉืœื•ื]{"entity": "case", "value": "u05d7u05d5u05e4u05e9u05d4 u05dcu05dcu05d0 u05eau05e9u05dcu05d5u05dd"}, ืžื’ื™ืข ืœื™ ื“ืžื™ ืื‘ื˜ืœื”?
  • ืื ื™ ืขื•ื‘ื“ืช ื™ื•ืชืจ ืžืฉื ื” ื‘ืžืงื•ื ื”ืขื‘ื•ื“ื” ื”ืื—ืจื•ืŸ ื•ื”ื•ืฆืืชื™ [ืœื—ืœ"ืช]{"entity": "case", "value": "u05d7u05d5u05e4u05e9u05d4 u05dcu05dcu05d0 u05eau05e9u05dcu05d5u05dd"}, ื›ื™ืฆื“ ืืงื‘ืœ ื“ืžื™ ืื‘ื˜ืœื”?
    '''

you can see that only the value entity changed the Unicode by the rasa x
When I write the commend _rasa data split nlu_ in the terminal in the folder train_test_split we see the same problem in the test + train md files

On the other side if I confirmed New Story from rasa x
The entity value is ok
example:
'''

unemployment benefit + period + age

  • greet

    • utter_greet

  • unemployment_want_insurance_benefit{"unemployment_benefits": "ื“ืžื™ ืื‘ื˜ืœื”", "case":"ื‘ื—ืœ"ืช"}

    • utter_ask_work_period

  • work_period{"work_period":"ืขื‘ื“ืชื™ ื™ื•ืชืจ ืžืขืฉืจ ืฉื ื™ื"}

    • utter_ask_retirement_age

  • under_retirement_age{"retirement_age":"ืื ื™ ืœื ื‘ื’ื™ืœ ื”ืคืจื™ืฉื”"}

    • utter_unemployment_benefit

New Story

  • greet

    • utter_greet

  • unemployment_want_insurance_benefit{"disease":"ืงื•ืจื•ื ื”","under_retirement":"ืœื","insurance_money":"ืื‘ื˜ืœื”"}

    • slot{"disease":"ืงื•ืจื•ื ื”"}

    • slot{"insurance_money":"ืื‘ื˜ืœื”"}

    • utter_ask_case

  • corona_who_deserve_unemployment_benefit{"case":"ืคื™ื˜ื•ืจื™ื"}

    • slot{"case":"ืคื™ื˜ื•ืจื™ื"}

    • utter_ask_work_period

  • work_period{"work_period":"ื™ื•ืชืจ ืžืฉืœื•ืฉ ืฉื ื™ื"}

    • utter_ask_retirement_age

  • under_retirement_age{"under_retirement":"ืœื, ืื ื™ ืœื ื‘ื’ื™ืœ ืคืจื™ืฉื”"}

    • utter_unemployment_benefit

'''

Hi, some good news?

Same type of issue here !

Unicode / UTF-8 are sometimes mixed up in e2e stories and nlu:

text.decode('utf-8')
> b'* form: inform: Pb [erron\xc3\xa9]{"entity": "motif_mutation", "value": "PB erron\\u00e9"}'

Steps to reproduce:

  1. Open rasa interactive
  2. Type intent which contains "รฉ"
  3. The NLU confirmation will contain u00e9 instead of "รฉ" encoded as xc3xa9
    Is the intent 'inform' correct for 'Pb [erronรฉ]{"entity": "motif_mutation", "value": "PB erron\u00e9"}' and are all entities labeled correctly? (Y/n)

To fix it, i had to manually re-encode unicode in utf-8:

res = text.decode('utf-8')
for unicode in re.findall(r"(\\u[A-Za-z0-9]+)", res):
    res = res.replace(unicode, unicode.encode('utf-8').decode('unicode_escape'))
res.encode('utf-8')
> b'* form: inform: Pb [erron\xc3\xa9]{"entity": "motif_mutation", "value": "PB erron\xc3\xa9"}'

Thanks for reporting this, it definitely looks like a formatting bug with Rasa X. One of our back-end engineers will take a look when they get a chance.

Hi @b-quachtran ,
If you need more info about the bug let me know.
have a good day

I have the same with entities containing non-ascii characters. After saving the results from rasa interactive, it should be

[kรผche](room)

in the nlu.md file but it's

kรผche[]{"entity": "room", "value": "k\u00fcche"}

I use Rasa 1.10.10, I do not use Rasa X.

Hi @b-quachtran,

If you need more info, let me know.

Was this page helpful?
0 / 5 - 0 ratings