Rasa version: Rasa 1.10.7
Rasa SDK version (if used & relevant):
Rasa X version (if used & relevant):
Python version: 3.8
Operating system (windows, osx, ...): ubuntu 18
Issue:
After I confirmed the checkbox in the rasa x, NLU Inbox.
I see in the nlu.md file that the entities look very strange {"entity": "insurance_money", "value": "u05d3u05deu05d9 u05d0u05d1u05d8u05dcu05d4"}
I do not have this problem in the story.md file
my python encoding is UTF8 and the Pycharm as well
I'm a developer for Hebrew users
Content of configuration file (config.yml) (if relevant):
language: he
pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: CRFEntityExtractor
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100
# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
- name: MemoizationPolicy
- name: TEDPolicy
max_history: 5
epochs: 100
- name: MappingPolicy
- name: FormPolicy
- name: TwoStageFallbackPolicy
nlu_threshold: 0.3
core_threshold: 0.3
fallback_core_action_name: "custom_action_fallback"
fallback_nlu_action_name: "custom_action_fallback"
deny_suggestion_intent_name: "out_of_scope"
Content of domain file (domain.yml) (if relevant):
@sara-tagger / @b-quachtran - I would be happy to update the issue:
I build a rasa chatbot and the chatbot language is Hebrew.
The bug happened just in the nlu.md file.
example:
'''
After confirmed the entities checkbox in the rasa x, NLU Inbox
The file look like this:
'''
you can see that only the value entity changed the Unicode by the rasa x
When I write the commend _rasa data split nlu_ in the terminal in the folder train_test_split we see the same problem in the test + train md files
On the other side if I confirmed New Story from rasa x
The entity value is ok
example:
'''
'''
Hi, some good news?
Same type of issue here !
Unicode / UTF-8 are sometimes mixed up in e2e stories and nlu:
text.decode('utf-8')
> b'* form: inform: Pb [erron\xc3\xa9]{"entity": "motif_mutation", "value": "PB erron\\u00e9"}'
Steps to reproduce:
Is the intent 'inform' correct for 'Pb [erronรฉ]{"entity": "motif_mutation", "value": "PB erron\u00e9"}' and are all entities labeled correctly? (Y/n)To fix it, i had to manually re-encode unicode in utf-8:
res = text.decode('utf-8')
for unicode in re.findall(r"(\\u[A-Za-z0-9]+)", res):
res = res.replace(unicode, unicode.encode('utf-8').decode('unicode_escape'))
res.encode('utf-8')
> b'* form: inform: Pb [erron\xc3\xa9]{"entity": "motif_mutation", "value": "PB erron\xc3\xa9"}'
Thanks for reporting this, it definitely looks like a formatting bug with Rasa X. One of our back-end engineers will take a look when they get a chance.
Hi @b-quachtran ,
If you need more info about the bug let me know.
have a good day
I have the same with entities containing non-ascii characters. After saving the results from rasa interactive, it should be
[kรผche](room)
in the nlu.md file but it's
kรผche[]{"entity": "room", "value": "k\u00fcche"}
I use Rasa 1.10.10, I do not use Rasa X.
Hi @b-quachtran,
If you need more info, let me know.
Most helpful comment
Thanks for reporting this, it definitely looks like a formatting bug with Rasa X. One of our back-end engineers will take a look when they get a chance.