Rasa: Failed parsing Markdown training section header with colon (:)

Created on 29 Oct 2019  路  5Comments  路  Source: RasaHQ/rasa

Rasa version: 1.4.2

Rasa X version (if used & relevant):

Python version: 3.7

Operating system (windows, osx, ...):
OS X
Issue:
A recently introduced regex parses Markdown section headers in the wrong way in training files.
This seems related to recently introduced code in

    def _find_section_header(self, line: Text) -> Optional[Tuple[Text, Text]]:
        """Checks if the current line contains a section header
        and returns the section and the title."""
        match = re.search(r"##\s*(.+):(.+)", line)
        if match is not None:
            return match.group(1), match.group(2)

        return None

Which performs a greedy lookup.
For a section header such as ## synonym:10:00 am, the section and value are reported as ('synonym:10', '00 am') instead of the expected ('synonym', '10:00 am').
This results in a failure to train.

Proposed Solution:
Change the regex to ##\s*(.+?):(.+)

Error (including full traceback):

Traceback (most recent call last):
  File "/Users/ethan/src/doodle/svc-doodlebot/env/bin/rasa", line 8, in <module>
    sys.exit(main())
  File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/__main__.py", line 76, in main
    cmdline_arguments.func(cmdline_arguments)
  File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/cli/train.py", line 76, in train
    kwargs=extract_additional_arguments(args),
  File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/train.py", line 45, in train
    kwargs=kwargs,
  File "uvloop/loop.pyx", line 1417, in uvloop.loop.Loop.run_until_complete
  File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/train.py", line 96, in train_async
    kwargs,
  File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/train.py", line 137, in _train_async_internal
    new_fingerprint = await model.model_fingerprint(file_importer)
  File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/model.py", line 204, in model_fingerprint
    nlu_data = await file_importer.get_nlu_data()
  File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/importers/importer.py", line 269, in get_nlu_data
    nlu_data = await asyncio.gather(*nlu_data)
  File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/importers/rasa.py", line 60, in get_nlu_data
    return utils.training_data_from_paths(self._nlu_files, language)
  File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/importers/utils.py", line 9, in training_data_from_paths
    training_datas = [loading.load_data(nlu_file, language) for nlu_file in paths]
  File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/importers/utils.py", line 9, in <listcomp>
    training_datas = [loading.load_data(nlu_file, language) for nlu_file in paths]
  File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/nlu/training_data/loading.py", line 67, in load_data
    data_sets = [_load(f, language) for f in files]
  File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/nlu/training_data/loading.py", line 67, in <listcomp>
    data_sets = [_load(f, language) for f in files]
  File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/nlu/training_data/loading.py", line 138, in _load
    return reader.read(filename, language=language, fformat=fformat)
  File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/nlu/training_data/formats/readerwriter.py", line 10, in read
    return self.reads(rasa.utils.io.read_file(filename), **kwargs)
  File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/nlu/training_data/formats/markdown.py", line 73, in reads
    self._set_current_section(header[0], header[1])
  File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/nlu/training_data/formats/markdown.py", line 192, in _set_current_section
    "".format(section, "', '".join(available_sections))
ValueError: Found markdown section 'synonym:10' which is not in the allowed sections 'intent', 'synonym', 'regex', 'lookup'.

Command or request that led to error:

rasa train

Content of configuration file (config.yml) (if relevant):

language: en
pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: CRFEntityExtractor
- name: EntitySynonymMapper
- name: SklearnIntentClassifier
- name: CountVectorsFeaturizer
- name: EmbeddingIntentClassifier
- name: DucklingHTTPExtractor
  url: http://localhost:8000
  locale: en_US
  dimensions:
  - time
  - duration
  timezone: UTC
policies:
- name: KerasPolicy
- name: MappingPolicy

Content of domain file (domain.yml) (if relevant):


Content of training NLU Markdown

## synonym:10:00 am
- @10:00 am

(Some extra before and after)

type

All 5 comments

Thanks for the issue, @erohmensing will get back to you about it soon!

You may find help in the docs and the forum, too 馃

I think this seems reasonable, I don't believe there is any reason why the synonym value couldn't contain a colon. Would you like to contribute a PR to fix it?

Sure. Any special permissions I need?

Nope, just fork the repo, create a branch for your fix, and open a PR! 馃槂I'll assign you the issue.

Was this page helpful?
0 / 5 - 0 ratings