Rasa: Rasa X Decoding error with German umlauts

Created on 1 Aug 2019  ·  10Comments  ·  Source: RasaHQ/rasa

Rasa version: 1.1.7

Rasa X version (if used & relevant): 0.19.5

Python version: 3.7.0

Operating system (windows, osx, ...): windows

Issue:

I am getting a decoding error when I want to start rasa x with German umlauts in the domain.yml. If I remove the special characters, I can start rasa x without problems. Same issue has been reported already on a rasa-x-demo repository here : https://github.com/RasaHQ/rasa-x-demo/issues/16

After testing, this error also occurs when running rasa train.

Error (including full traceback):

(base) C:\Users\Documents\workspace_python\FuBo\bot>rasa x
Starting Rasa X in local mode... 🚀
Traceback (most recent call last):
  File "c:\users\appdata\local\continuum\anaconda3\lib\site-packages\rasa\cli\x.py", line 322, in run_locally
    local.main(args, project_path, args.data, token=rasa_x_token)
  File "c:\users\appdata\local\continuum\anaconda3\lib\site-packages\rasax\community\local.py", line 190, in main
    project_path, data_path, session, args.port
  File "c:\users\appdata\local\continuum\anaconda3\lib\site-packages\rasax\community\local.py", line 139, in _initialize_with_local_data
    domain_path, domain_service, COMMUNITY_PROJECT_NAME, COMMUNITY_USERNAME
  File "c:\users\appdata\local\continuum\anaconda3\lib\site-packages\rasax\community\initialise.py", line 136, in inject_domain
    domain_yaml=read_file(domain_path),
  File "c:\users\appdata\local\continuum\anaconda3\lib\site-packages\rasa\utils\io.py", line 130, in read_file
    return f.read()
  File "c:\users\appdata\local\continuum\anaconda3\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 127: invalid start byte

Command or request that led to error:

rasa x

Content of configuration file (config.yml) (if relevant):

# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: de
pipeline: pretrained_embeddings_spacy

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
  - name: MemoizationPolicy
  - name: KerasPolicy
  - name: MappingPolicy

Content of domain file (domain.yml) (if relevant):

intents:
- affirm
- deny
- goodbye
- greet
templates:
  utter_greet:
  - text: Hallo ich bin dein persönlicher Assistent. Wie kann ich Dir helfen?
  utter_did_that_help:
  - text: Konnte ich Dir damit weiterhelfen?
  utter_goodbye:
  - text: Ich wünsche Dir noch einen schönen Tag!
actions:
- utter_did_that_help
- utter_goodbye
- utter_greet

type

Most helpful comment

Thanks for the very descriptive into @taotsetung. I've tracked down the part where the domain gets written in Rasa X and you're right, the encoding isn't specified:

def dump_yaml_to_file(filename: Text, content: Any) -> Optional[str]:
    """Dump content to yaml."""
    with open(filename, "w") as f:
        f.write(dump_yaml(content))

I assume that with open(filename, "w", encoding="utf-8") as f:should do the job, but we'll check it out.

All 10 comments

Thanks for raising this issue, @gausie will get back to you about it soon.

@kristiankolthoff Which encoding is your file in? Can you please save it as utf-8?

I saved it as utf-8 explcitly and the error still remains.

@erohmensing Can you please check whether you can reproduce that when I'm gone? Thanks!

Hm, I added ö to my domain.yml, nlu.md and stories.md and it loaded up with no problem. I can see the umlauts in all 3 of these places on the UI too. Of course I am also running the latest version, so you might want to try updating.

Can you run this in the console for me?

❯ python
Python 3.7.3 (default, Mar 27 2019, 16:54:48)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> sys.stdin.encoding
'UTF-8'

I think the user in the post you mentioned is probably right with Rasa-X opens the domain.yml file, modifies it, stores it with system default encoding (ISO 8859-2 for me on Windows)

@erohmensing This error still reproduces for me on the latest version, using Windows in German locale. The stdin and stdout streams show UTF-8, but they are not the root cause here.

The underlying issue is that python by default writes to files with the system code page, unless an override is provided when opening the file, and rasa does not specificy UTF8. Additionally, when loading the domain.yml file rasa first reformats and saves it, before actually loading and parsing it, during the first step we lose the encoding, and when loading we are no longer in UTF8 causing the error.

Workaround: (Python 3.7+ only) set the environment variable PYTHONUTF8 to 1 before running rasa, this forces python to use utf8 as default encoding. On Windows: set PYTHONUTF8=1

Solution for rasa/rasa x: When saving the domain file (and other files as well .. ) specify utf8 as override. Python 3.7+ only: Enable utf8 mode in code.

Thanks for the very descriptive into @taotsetung. I've tracked down the part where the domain gets written in Rasa X and you're right, the encoding isn't specified:

def dump_yaml_to_file(filename: Text, content: Any) -> Optional[str]:
    """Dump content to yaml."""
    with open(filename, "w") as f:
        f.write(dump_yaml(content))

I assume that with open(filename, "w", encoding="utf-8") as f:should do the job, but we'll check it out.

@erohmensing yes, adding the encoding to the open call fixed the error. Rasa-X is not open source so we can't make a PR?

Yes, unfortunately. But we actually already merged the PR to fix this issue :) will close it when the fix is released.

fix is part of Rasa X 0.20.3

Was this page helpful?
0 / 5 - 0 ratings