Rasa: Add provenance to examples in NLU training data

Created on 11 Sep 2020 · 8Comments · Source: RasaHQ/rasa

Description of Problem:
Currently, the examples for NLU training data are all added under one head examples like this -

nlu:
- intent: greet
  examples: |
    - hey
    - hi
    - whats up
    - Hello, how's it going?

This does not indicate which training examples were added by the builder of the assistant and which of them were added through annotations in NLU Inbox and hence coming from real conversations. In summary, there is no provenance of examples recorded in the training data.

If this distinction is made available in the NLU training file, a lot of tooling can be built to make CDD more efficient. For example - adding a flag in rasa nlu train which lets the developer specify the ratio of examples from real v/s non-real conversations to be picked up for downstream model training.
Another example - bot builders can then actually just eyeball their training data and see how do actual user messages differ from messages they added.

Overview of the Solution:
The distinction between messages from real conversations v/s cooked up examples is already available in Rasa X inside the NLU Inbox. It would be beneficial if this distinction can be made clearer in the NLU training file as well. So have something like -

nlu:
- intent: greet
  examples_from_builder: |
    - hey
    - hi
    - whats up
  examples_from_users: |
    - Hello, how's it going?

Slack thread of an ongoing discussion on this

Summarizing thoughts from different people here below -

@TyDunn shared his opinion on how this can further motivate people to add more examples by their own because if we are making the distinction clearer and creating a dedicated section for it then developers feel the need to fill up that section even more. This can discourage developers from adding more data from real conversations.

@amn41 suggested that if this distinction is made clearer we can let developers know the 'health' of their training data during training based on number of examples coming from real v/s non-real conversations. More the number of examples from real conversations, the healthier your training data is.

@philippwolf1 mentioned that this has come up a couple of times with prospects where they want to understand which auto-generated data is simply not representative of the data real conversations have.

My personal take goes a bit further -

This also lets us augment the existing training data with more examples(for example through paraphrasing) but kept in a separate section and hence the downstream training can intelligently pick up the ratio of data points from each section to make the model more robust.
It also helps users see the benefit of CDD more swiftly than they can normally do right now. If they start building their assistant with let's say 2000 training examples, it still takes some time for the CDD process to match those number of examples in the training data. Even when you have 2000 more examples from real conversations the downstream models are still suffering from the non-real messages and it is better to have a way to drop them off during model training that use all of them combined. This distinction would again enable that. See the slack thread for a concrete real world example on this.
If users still feel compelled to come up with training examples by themselves, we may need to further motivate them to show the benefits of CDD.

Since we have adopted yml as the training data format we can simply add more attributes to an object without breaking the currently adopted training data format, for e.g. -

- intent: greet
  examples: |
    - hey
    - hi
    - whats up
  example_source: users
  examples: |
    - Hi bot!
    - Hello sir
    - Top of the morning
  example_source: augmentation
  examples: |
    - Hola!

area type

Source

dakshvar22

All 8 comments

Hey @dakshvar22
What about using the matadata field to distinguish training examples from each other? That's the main purpose of it.

degiz on 23 Sep 2020

@degiz Correct me if I am wrong but isn't metadata attached to each training example and not to a group of them? As Tom mentioned in the slack thread, it's not the best use case for it I think.

dakshvar22 on 23 Sep 2020

There are two metadata - one is on the training example level, another is on the intent level.
But I see that it might not be the perfect solution.

degiz on 23 Sep 2020

@degiz Can you please show with an example, how those would look in this context?

dakshvar22 on 23 Sep 2020

@dakshvar22 your example would look like this:

nlu:
- intent: greet
  metadata:
    example_source: users  
  examples: |
    - hey
    - hi
    - whats up

- intent: greet
  metadata:
    example_source: augmentation
  examples: |
    - Hi bot!
    - Hello sir
    - Top of the morning

- intent: greet
  examples: |
    - Hola!

This will not work now, because intent-level metadata will be overwritten every time. But we can change the implementation keeping the same syntax, and apply intent-level metadata to the training examples in belongs to.

I also don't understand one thing - should this "source of data" influence the training in any way? Or it's just the visual distinction?

degiz on 23 Sep 2020

👍1

Thanks @degiz . The idea is to use this "source of data" to influence the training going forward in future features which are aligned with the concept of CDD. I see this as a prerequisite for it.

dakshvar22 on 23 Sep 2020

🚀1

@TyDunn Did we decide something on this yet?

dakshvar22 on 13 Nov 2020

@dakshvar22 We did not. Let's put it into production inbox / 2.2 to start the discussion again

TyDunn on 13 Nov 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

DIET classifier _predict_entities function clean_up_entities for Chinese language issue

johnson7788 · 3Comments

2.0.0rc4 clean install: Conflicting dependencies and VisibleDeprecationWarnings

N-Olbert · 3Comments

Exception: Not all required packages are installed. To use this pipeline, you need to install the missing dependencies. Please install sklearn

irfan-zoefit · 3Comments

Problem with self.validate_slots in def validate after updating rasa to version 1.7.0

igormiranda001 · 3Comments

Regarding Multiple Entity Extraction

rayush7 · 3Comments