Rasa: rasa x, NLU Inbox change my entities string to \u05d3\u05de\u05d9 on the nlu.md

Created on 16 Jul 2020 · 8Comments · Source: RasaHQ/rasa

Rasa version: Rasa 1.10.7

Rasa SDK version (if used & relevant):

Rasa X version (if used & relevant):

Python version: 3.8

Operating system (windows, osx, ...): ubuntu 18

Issue:
After I confirmed the checkbox in the rasa x, NLU Inbox.
I see in the nlu.md file that the entities look very strange {"entity": "insurance_money", "value": "u05d3u05deu05d9 u05d0u05d1u05d8u05dcu05d4"}
I do not have this problem in the story.md file

my python encoding is UTF8 and the Pycharm as well
I'm a developer for Hebrew users

Content of configuration file (config.yml) (if relevant):

language: he
pipeline:
  - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  - name: CRFEntityExtractor
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
  - name: MemoizationPolicy
  - name: TEDPolicy
    max_history: 5
    epochs: 100
  - name: MappingPolicy
  - name: FormPolicy
  - name: TwoStageFallbackPolicy
    nlu_threshold: 0.3
    core_threshold: 0.3
    fallback_core_action_name: "custom_action_fallback"
    fallback_nlu_action_name: "custom_action_fallback"
    deny_suggestion_intent_name: "out_of_scope"

Content of domain file (domain.yml) (if relevant):

area area type

Source

DavidKohav

Most helpful comment

Thanks for reporting this, it definitely looks like a formatting bug with Rasa X. One of our back-end engineers will take a look when they get a chance.

b-quachtran on 28 Jul 2020

👍2

All 8 comments

Thanks for the issue, @b-quachtran will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

sara-tagger on 17 Jul 2020

@sara-tagger / @b-quachtran - I would be happy to update the issue:
I build a rasa chatbot and the chatbot language is Hebrew.
The bug happened just in the nlu.md file.
example:
'''

intent:corona_who_deserve_unemployment_benefit

עבדתי אצל המעסיק האחרון שלושה חודשים והוא הוציא אותי [לחל"ת]{"entity": "case", "value": "חופשה ללא תשלום"} . לפני כן עבדתי אצל מעסיק אחר [שנה שלמה]{"entity": "work_period_right", "value": "12 חודשיים"}. האם אני [זכאי לאבטלה]{"entity": "insurance_money", "value": "דמי אבטלה"} ?
בגלל [הקורונה]{"entity": "disease", "value": "קורונה"} הוצאתי [לחופשה ללא תשלום]{"entity": "case", "value": "חופשה ללא תשלום"}, מגיע לי דמי אבטלה?
אני עובדת יותר משנה במקום העבודה האחרון והוצאתי [לחל"ת]{"entity": "case", "value": "חופשה ללא תשלום"}, כיצד אקבל דמי אבטלה?
'''

After confirmed the entities checkbox in the rasa x, NLU Inbox
The file look like this:
'''

intent:corona_who_deserve_unemployment_benefit

עבדתי אצל המעסיק האחרון שלושה חודשים והוא הוציא אותי [לחל"ת]{"entity": "case", "value": "u05d7u05d5u05e4u05e9u05d4 u05dcu05dcu05d0 u05eau05e9u05dcu05d5u05dd"} . לפני כן עבדתי אצל מעסיק אחר [שנה שלמה]{"entity": "work_period_right", "value": "12 u05d7u05d5u05d3u05e9u05d9u05d9u05dd"}. האם אני [זכאי לאבטלה]{"entity": "insurance_money", "value": "u05d3u05deu05d9 u05d0u05d1u05d8u05dcu05d4"} ?
בגלל [הקורונה]{"entity": "disease", "value": "u05e7u05d5u05e8u05d5u05e0u05d4"} הוצאתי [לחופשה ללא תשלום]{"entity": "case", "value": "u05d7u05d5u05e4u05e9u05d4 u05dcu05dcu05d0 u05eau05e9u05dcu05d5u05dd"}, מגיע לי דמי אבטלה?
אני עובדת יותר משנה במקום העבודה האחרון והוצאתי [לחל"ת]{"entity": "case", "value": "u05d7u05d5u05e4u05e9u05d4 u05dcu05dcu05d0 u05eau05e9u05dcu05d5u05dd"}, כיצד אקבל דמי אבטלה?
'''

you can see that only the value entity changed the Unicode by the rasa x
When I write the commend _rasa data split nlu_ in the terminal in the folder train_test_split we see the same problem in the test + train md files

On the other side if I confirmed New Story from rasa x
The entity value is ok
example:
'''

unemployment benefit + period + age

greet
- utter_greet
unemployment_want_insurance_benefit{"unemployment_benefits": "דמי אבטלה", "case":"בחל"ת"}
- utter_ask_work_period
work_period{"work_period":"עבדתי יותר מעשר שנים"}
- utter_ask_retirement_age
under_retirement_age{"retirement_age":"אני לא בגיל הפרישה"}
- utter_unemployment_benefit

New Story

greet
- utter_greet
unemployment_want_insurance_benefit{"disease":"קורונה","under_retirement":"לא","insurance_money":"אבטלה"}
- slot{"disease":"קורונה"}
- slot{"insurance_money":"אבטלה"}
- utter_ask_case
corona_who_deserve_unemployment_benefit{"case":"פיטורים"}
- slot{"case":"פיטורים"}
- utter_ask_work_period
work_period{"work_period":"יותר משלוש שנים"}
- utter_ask_retirement_age
under_retirement_age{"under_retirement":"לא, אני לא בגיל פרישה"}
- utter_unemployment_benefit

'''

DavidKohav on 17 Jul 2020

Hi, some good news?

DavidKohav on 21 Jul 2020

Same type of issue here !

Unicode / UTF-8 are sometimes mixed up in e2e stories and nlu:

text.decode('utf-8')
> b'* form: inform: Pb [erron\xc3\xa9]{"entity": "motif_mutation", "value": "PB erron\\u00e9"}'

Steps to reproduce:

Open rasa interactive
Type intent which contains "é"
The NLU confirmation will contain u00e9 instead of "é" encoded as xc3xa9
Is the intent 'inform' correct for 'Pb [erroné]{"entity": "motif_mutation", "value": "PB erron\u00e9"}' and are all entities labeled correctly? (Y/n)

To fix it, i had to manually re-encode unicode in utf-8:

res = text.decode('utf-8')
for unicode in re.findall(r"(\\u[A-Za-z0-9]+)", res):
    res = res.replace(unicode, unicode.encode('utf-8').decode('unicode_escape'))

res.encode('utf-8')
> b'* form: inform: Pb [erron\xc3\xa9]{"entity": "motif_mutation", "value": "PB erron\xc3\xa9"}'

puechtom on 22 Jul 2020

👍1

Thanks for reporting this, it definitely looks like a formatting bug with Rasa X. One of our back-end engineers will take a look when they get a chance.

b-quachtran on 28 Jul 2020

👍2

Hi @b-quachtran ,
If you need more info about the bug let me know.
have a good day

DavidKohav on 17 Aug 2020

I have the same with entities containing non-ascii characters. After saving the results from rasa interactive, it should be

[küche](room)

in the nlu.md file but it's

küche[]{"entity": "room", "value": "k\u00fcche"}

I use Rasa 1.10.10, I do not use Rasa X.

SuzanaK on 25 Aug 2020

Hi @b-quachtran,

If you need more info, let me know.

DavidKohav on 26 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

rasa interactive doesnt work

nicolasfarina · 3Comments

How to send and receive message per sender_id?

nahidalam · 3Comments

rasa_core.policies.ensemble.InvalidPolicyConfig: You didn't define any policies. Please define them under 'policies:' in your policy configuration file.

Arghya999 · 3Comments

Problem with self.validate_slots in def validate after updating rasa to version 1.7.0

igormiranda001 · 3Comments

Rasa training is very slow due to excessive copy of the domain, fails on machine with low memory.

edouardlp · 3Comments