Rasa: Fix `/train` endpoint (and potentially other server endpoints) for new data format

Created on 6 Aug 2020  路  17Comments  路  Source: RasaHQ/rasa

With the switch to YAML for 2.0, we forgot to update the /model/train endpoint in the server to be able to handle that, and so the tests are failing/hanging in the PR #6352

It still writes the files to markdown files, that should be yaml now and also the concept of nlu vs stories files doesn't exist anymore: https://github.com/RasaHQ/rasa/blob/master/rasa/server.py#L746

I haven't checked if there's other endpoints that incorrectly handle that, so we may have to address that too

area high type

All 17 comments

I'd add another key rules here and change everything to yaml. We can also check the the header and change to md if it says markdown. If we decide to drop md support it's easy too remove (and not a lot of engineering overhead from our side now).

@akelad You mentioned

and yes we can probably change to yaml but also the concept of stories vs nlu files doesn鈥檛 exist anymore

Is the additional rules key solution fine? I wouldn't merge the rules and stories keys into a core key because they are still somewhat separated in yaml (stories: vs rules:)

@wochinge i would avoid using the term "core" - why can't we just pass it all into one training_data_files key for example?

passing in rules: and stories doesn't make sense to me since you can have those in one file

passing in rules: and stories doesn't make sense to me since you can have those in one file

True. I assumed that when using the API (especially from Rasa X), we'd have everything separate anyway. Putting all in one ( core (stories+rules), nlu, domain) makes more sense when thinking from a disk perspective 馃憤

On the other hand: If you think of YAML as JSON then it's already separated by keys now. As our training data is YAML now it also makes sense to support it's JSON representation, no?

In addition we shouldn't simply change the API endpoint without at least support the old separated way as deprecated option. How about doing both:

  • Having everything as giant string in a training_data key (I slightly prefer that to training_data_files)
  • Having everything split by rules, stories and so on (maybe print a deprecation warning then)

We can add support for full JSON payloads later.

sure, whichever works, as long as we make sure that if people run a rasa server without Rasa X, they can send a request to the /train endpoint without a huge hassle :D

What is the benefit of introducing the training_data key?

I am not sure I fully understand what you are proposing @wochinge, I think an example would help a lot.

I think the goal should be:

  • support the existing format at least for the old training data format
  • add an endpoint that supports submitting yaml only training data + domain + config

if possible:

  • support the new training data format with the old api

I am not sure I fully understand what you are proposing @wochinge, I think an example would help a lot.

If I understood Akela correctly in this comment then the motivation is that - as all training data can be in one single file from with YAML - there is only one key for all training data. The request payload would look like as follows:

{
   "training_data": "... yaml string containing nlu, core (stories, rules)",
   "domain": "domain as string"
}

In addition I'd support the old way and distinguish MD / YAML based on the HTTP header (YAML as default):

{
  "nlu": "nlu data string",
  "stories": "stories as string",
  "responses": "responses as string",
  "rules": "rules as string",
  "domain": "domain as string"
}

Supporting they way you propose (see snippet below) makes a lot of sense, but I'd do it as a separate PR as it also changes the current behavior of the API and we don't need it atm (to be fair the same could be said about example 1). In that case I'd do it nevertheless as it improves the UX and Akela stated "as long as we make sure that if people run a rasa server without Rasa X, they can send a request to the /train endpoint without a huge hassle :D")

{
  "nlu": [{"intent": greet", examples: [...]}]
}

If I understood Akela correctly in this comment then the motivation is that - as all training data can be in one single file from with YAML - there is only one key for all training data. The request payload would look like as follows:

I don't see a reason to split training data and domain there, I'd rather go for the payload being just yaml (without a json wrapper) since the information is already split in yaml using different toplevel keys.

In addition I'd support the old way and distinguish MD / YAML based on the HTTP header (YAML as default):

This will break compatibility, so needs proper explanation in migration guide (and probably a pointer to it in the exception that we will throw when we are trying to parse md as yaml when a user didn't send the header).

I don't see a reason to split training data and domain there, I'd rather go for the payload being just yaml (without a json wrapper)

I understand that part, but what do you mean by "since the information is already split in yaml using different toplevel keys."? It's either a big string for each key or we make everything JSON, no?

This will break compatibility, so needs proper explanation in migration guide (and probably a pointer to it in the exception that we will throw when we are trying to parse md as yaml when a user didn't send the header).

We can also keep MD as default. Imo this depends how we move forward regarding keeping / dropping markdown support.

wait, this doesn't make sense:

if we use json, e.g.

{
  "nlu": "nlu data string",
  "stories": "stories as string",
  "responses": "responses as string",
  "rules": "rules as string",
  "domain": "domain as string"
}

the HTTP content type header should be set to JSON (and not to markdown or YAML). I think it should work to keep MD the default here.

To support YAML, we can use the content type header, set it to YAML and avoid the json wrapper, e.g. we'd post something like this to the endpoint (with content type set to yaml):

intents:
- greet

stories:
- ...

responses:
- ...

Got what you mean know 馃憤 馃檲
Will implement it later today or tomorrow 馃憤

sounds great 馃挴 let me know how things go so I can update the multimedia responses PR

Other endpoints:

  • retrieve_story should be changed when story writer is done (TODO: Create issue)
  • evaluate_stories works with new payloads

@tmbo The PR is now in review.

@tmbo This is merged btw.

@wochinge It looks like the example payload we provide on our docs page doesn't match the correct YAML spec for the endpoint.

Is application/json only compatible with markdown at the moment?

Is application/json only compatible with markdown at the moment?

Yes!
You need to use application/x-yaml for the yaml payload

Was this page helpful?
0 / 5 - 0 ratings