rasa NLU version (e.g. 0.7.3):
Used backend / pipeline (spacy_sklearn):
Operating system (Ubuntu 64bit ):
*demo-rasa.json has duplicate training data:
https://github.com/golastmile/rasa_nlu/blob/master/data/examples/rasa/demo-rasa.json
I just noticed that demo-rasa.json has duplicate data in "common_examples". I see that the whole examples are copied exactly twice.
I looked at the history of the file and found that this happened when the two sections "entity_examples" and "intent_examples" were merged into "common_examples" recently (btw both sections had the same data)
But now if I remove the duplicates from "common_examples" and train the model, I get the following warnings
....../sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
....../sklearn/metrics/classification.py:1115: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples.
'recall', 'true', average, warn_for)
Of course the confidence level in results also get affected badly, e.g. trying with "chinese restaurant in the center" doesn't classify "cuisine" entity and the confidence level drops to under 4
If I keep the duplicate data it runs fine and without any warnings... This behavior is strange to me, can someone please explain why the data duplication is required?
Thanks
Good catch!
So actually the duplication shouldn't be in there and we will remove it. The duplication of training examples also doesn't add any advantage during training (I know this sounds wrong given the change in confidence levels). Let me explain:
Thanks for the detailed explanation and fixing it spontaneously.
Most helpful comment
Good catch!
So actually the duplication shouldn't be in there and we will remove it. The duplication of training examples also doesn't add any advantage during training (I know this sounds wrong given the change in confidence levels). Let me explain: