Rasa: Fix `/train` endpoint (and potentially other server endpoints) for new data format

Created on 6 Aug 2020 · 17Comments · Source: RasaHQ/rasa

With the switch to YAML for 2.0, we forgot to update the /model/train endpoint in the server to be able to handle that, and so the tests are failing/hanging in the PR #6352

It still writes the files to markdown files, that should be yaml now and also the concept of nlu vs stories files doesn't exist anymore: https://github.com/RasaHQ/rasa/blob/master/rasa/server.py#L746

I haven't checked if there's other endpoints that incorrectly handle that, so we may have to address that too

area high type

Source

akelad

All 17 comments

I'd add another key rules here and change everything to yaml. We can also check the the header and change to md if it says markdown. If we decide to drop md support it's easy too remove (and not a lot of engineering overhead from our side now).

@akelad You mentioned

and yes we can probably change to yaml but also the concept of stories vs nlu files doesn’t exist anymore

Is the additional rules key solution fine? I wouldn't merge the rules and stories keys into a core key because they are still somewhat separated in yaml (stories: vs rules:)

wochinge on 6 Aug 2020

@wochinge i would avoid using the term "core" - why can't we just pass it all into one training_data_files key for example?

akelad on 7 Aug 2020

passing in rules: and stories doesn't make sense to me since you can have those in one file

akelad on 7 Aug 2020

passing in rules: and stories doesn't make sense to me since you can have those in one file

True. I assumed that when using the API (especially from Rasa X), we'd have everything separate anyway. Putting all in one ( core (stories+rules), nlu, domain) makes more sense when thinking from a disk perspective 👍

On the other hand: If you think of YAML as JSON then it's already separated by keys now. As our training data is YAML now it also makes sense to support it's JSON representation, no?

In addition we shouldn't simply change the API endpoint without at least support the old separated way as deprecated option. How about doing both:

Having everything as giant string in a training_data key (I slightly prefer that to training_data_files)
Having everything split by rules, stories and so on (maybe print a deprecation warning then)

We can add support for full JSON payloads later.

wochinge on 7 Aug 2020

sure, whichever works, as long as we make sure that if people run a rasa server without Rasa X, they can send a request to the /train endpoint without a huge hassle :D

akelad on 7 Aug 2020

👍1

What is the benefit of introducing the training_data key?

I am not sure I fully understand what you are proposing @wochinge, I think an example would help a lot.

I think the goal should be:

support the existing format at least for the old training data format
add an endpoint that supports submitting yaml only training data + domain + config

if possible:

support the new training data format with the old api

tmbo on 10 Aug 2020

I am not sure I fully understand what you are proposing @wochinge, I think an example would help a lot.

If I understood Akela correctly in this comment then the motivation is that - as all training data can be in one single file from with YAML - there is only one key for all training data. The request payload would look like as follows:

{
   "training_data": "... yaml string containing nlu, core (stories, rules)",
   "domain": "domain as string"
}

In addition I'd support the old way and distinguish MD / YAML based on the HTTP header (YAML as default):

{
  "nlu": "nlu data string",
  "stories": "stories as string",
  "responses": "responses as string",
  "rules": "rules as string",
  "domain": "domain as string"
}

Supporting they way you propose (see snippet below) makes a lot of sense, but I'd do it as a separate PR as it also changes the current behavior of the API and we don't need it atm (to be fair the same could be said about example 1). In that case I'd do it nevertheless as it improves the UX and Akela stated "as long as we make sure that if people run a rasa server without Rasa X, they can send a request to the /train endpoint without a huge hassle :D")

{
  "nlu": [{"intent": greet", examples: [...]}]
}

wochinge on 10 Aug 2020

If I understood Akela correctly in this comment then the motivation is that - as all training data can be in one single file from with YAML - there is only one key for all training data. The request payload would look like as follows:

I don't see a reason to split training data and domain there, I'd rather go for the payload being just yaml (without a json wrapper) since the information is already split in yaml using different toplevel keys.

In addition I'd support the old way and distinguish MD / YAML based on the HTTP header (YAML as default):

This will break compatibility, so needs proper explanation in migration guide (and probably a pointer to it in the exception that we will throw when we are trying to parse md as yaml when a user didn't send the header).

tmbo on 10 Aug 2020

I don't see a reason to split training data and domain there, I'd rather go for the payload being just yaml (without a json wrapper)

I understand that part, but what do you mean by "since the information is already split in yaml using different toplevel keys."? It's either a big string for each key or we make everything JSON, no?

This will break compatibility, so needs proper explanation in migration guide (and probably a pointer to it in the exception that we will throw when we are trying to parse md as yaml when a user didn't send the header).

We can also keep MD as default. Imo this depends how we move forward regarding keeping / dropping markdown support.

wochinge on 10 Aug 2020

wait, this doesn't make sense:

if we use json, e.g.

{
  "nlu": "nlu data string",
  "stories": "stories as string",
  "responses": "responses as string",
  "rules": "rules as string",
  "domain": "domain as string"
}

the HTTP content type header should be set to JSON (and not to markdown or YAML). I think it should work to keep MD the default here.

To support YAML, we can use the content type header, set it to YAML and avoid the json wrapper, e.g. we'd post something like this to the endpoint (with content type set to yaml):

intents:
- greet

stories:
- ...

responses:
- ...

tmbo on 10 Aug 2020

Got what you mean know 👍 🙈
Will implement it later today or tomorrow 👍

wochinge on 10 Aug 2020

sounds great 💯 let me know how things go so I can update the multimedia responses PR

tmbo on 10 Aug 2020

👍1

Other endpoints:

retrieve_story should be changed when story writer is done (TODO: Create issue)
evaluate_stories works with new payloads

wochinge on 11 Aug 2020

@tmbo The PR is now in review.

wochinge on 11 Aug 2020

@tmbo This is merged btw.

wochinge on 13 Aug 2020

@wochinge It looks like the example payload we provide on our docs page doesn't match the correct YAML spec for the endpoint.

Is application/json only compatible with markdown at the moment?

b-quachtran on 20 Oct 2020

Is application/json only compatible with markdown at the moment?

Yes!
You need to use application/x-yaml for the yaml payload

wochinge on 21 Oct 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Exit from bot with text "cancel" when working with FormAction

jahid-ict · 3Comments

DIET classifier _predict_entities function clean_up_entities for Chinese language issue

johnson7788 · 3Comments

Exception: Not all required packages are installed. To use this pipeline, you need to install the missing dependencies. Please install sklearn

irfan-zoefit · 3Comments

Slots not populating in Rasa X

anishrav · 3Comments

Regarding Multiple Entity Extraction

rayush7 · 3Comments