Description of Problem:
Currently, the examples for NLU training data are all added under one head examples like this -
nlu:
- intent: greet
examples: |
- hey
- hi
- whats up
- Hello, how's it going?
This does not indicate which training examples were added by the builder of the assistant and which of them were added through annotations in NLU Inbox and hence coming from real conversations. In summary, there is no provenance of examples recorded in the training data.
If this distinction is made available in the NLU training file, a lot of tooling can be built to make CDD more efficient. For example - adding a flag in rasa nlu train which lets the developer specify the ratio of examples from real v/s non-real conversations to be picked up for downstream model training.
Another example - bot builders can then actually just eyeball their training data and see how do actual user messages differ from messages they added.
Overview of the Solution:
The distinction between messages from real conversations v/s cooked up examples is already available in Rasa X inside the NLU Inbox. It would be beneficial if this distinction can be made clearer in the NLU training file as well. So have something like -
nlu:
- intent: greet
examples_from_builder: |
- hey
- hi
- whats up
examples_from_users: |
- Hello, how's it going?
Slack thread of an ongoing discussion on this
Summarizing thoughts from different people here below -
@TyDunn shared his opinion on how this can further motivate people to add more examples by their own because if we are making the distinction clearer and creating a dedicated section for it then developers feel the need to fill up that section even more. This can discourage developers from adding more data from real conversations.
@amn41 suggested that if this distinction is made clearer we can let developers know the 'health' of their training data during training based on number of examples coming from real v/s non-real conversations. More the number of examples from real conversations, the healthier your training data is.
@philippwolf1 mentioned that this has come up a couple of times with prospects where they want to understand which auto-generated data is simply not representative of the data real conversations have.
My personal take goes a bit further -
This also lets us augment the existing training data with more examples(for example through paraphrasing) but kept in a separate section and hence the downstream training can intelligently pick up the ratio of data points from each section to make the model more robust.
It also helps users see the benefit of CDD more swiftly than they can normally do right now. If they start building their assistant with let's say 2000 training examples, it still takes some time for the CDD process to match those number of examples in the training data. Even when you have 2000 more examples from real conversations the downstream models are still suffering from the non-real messages and it is better to have a way to drop them off during model training that use all of them combined. This distinction would again enable that. See the slack thread for a concrete real world example on this.
If users still feel compelled to come up with training examples by themselves, we may need to further motivate them to show the benefits of CDD.
Since we have adopted yml as the training data format we can simply add more attributes to an object without breaking the currently adopted training data format, for e.g. -
- intent: greet
examples: |
- hey
- hi
- whats up
example_source: users
examples: |
- Hi bot!
- Hello sir
- Top of the morning
example_source: augmentation
examples: |
- Hola!
Hey @dakshvar22
What about using the matadata field to distinguish training examples from each other? That's the main purpose of it.
@degiz Correct me if I am wrong but isn't metadata attached to each training example and not to a group of them? As Tom mentioned in the slack thread, it's not the best use case for it I think.
There are two metadata - one is on the training example level, another is on the intent level.
But I see that it might not be the perfect solution.
@degiz Can you please show with an example, how those would look in this context?
@dakshvar22 your example would look like this:
nlu:
- intent: greet
metadata:
example_source: users
examples: |
- hey
- hi
- whats up
- intent: greet
metadata:
example_source: augmentation
examples: |
- Hi bot!
- Hello sir
- Top of the morning
- intent: greet
examples: |
- Hola!
This will not work now, because intent-level metadata will be overwritten every time. But we can change the implementation keeping the same syntax, and apply intent-level metadata to the training examples in belongs to.
I also don't understand one thing - should this "source of data" influence the training in any way? Or it's just the visual distinction?
Thanks @degiz . The idea is to use this "source of data" to influence the training going forward in future features which are aligned with the concept of CDD. I see this as a prerequisite for it.
@TyDunn Did we decide something on this yet?
@dakshvar22 We did not. Let's put it into production inbox / 2.2 to start the discussion again