Rasa: Training on demo-rasa.json results in UndefinedMetricWarning

Created on 20 Apr 2017  路  2Comments  路  Source: RasaHQ/rasa

rasa NLU version (e.g. 0.7.3):

Used backend / pipeline (spacy_sklearn):

Operating system (Ubuntu 64bit ):

*demo-rasa.json has duplicate training data:
https://github.com/golastmile/rasa_nlu/blob/master/data/examples/rasa/demo-rasa.json

I just noticed that demo-rasa.json has duplicate data in "common_examples". I see that the whole examples are copied exactly twice.
I looked at the history of the file and found that this happened when the two sections "entity_examples" and "intent_examples" were merged into "common_examples" recently (btw both sections had the same data)

But now if I remove the duplicates from "common_examples" and train the model, I get the following warnings

....../sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples. 'precision', 'predicted', average, warn_for) ....../sklearn/metrics/classification.py:1115: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples. 'recall', 'true', average, warn_for)

Of course the confidence level in results also get affected badly, e.g. trying with "chinese restaurant in the center" doesn't classify "cuisine" entity and the confidence level drops to under 4

If I keep the duplicate data it runs fine and without any warnings... This behavior is strange to me, can someone please explain why the data duplication is required?

Thanks

type

Most helpful comment

Good catch!

So actually the duplication shouldn't be in there and we will remove it. The duplication of training examples also doesn't add any advantage during training (I know this sounds wrong given the change in confidence levels). Let me explain:

  • so the warning is just a warning. It indicates that there are to few training examples for one / some of the intents. Adding more examples will fix this (thats why adding duplicates will remove this warning, but really you should be adding different examples).
  • the confidence level changes because the data gets "burned in" (more formally called over fitting) and the model learns to recognize exactly this example. Usually, one wants to avoid this over fitting, as it results in worse generalization (so the model will perform less favorable on new data). This is the bad side of overfitting, the good one is that the model will be strengthened in its believe it did the right choice on known training samples.

All 2 comments

Good catch!

So actually the duplication shouldn't be in there and we will remove it. The duplication of training examples also doesn't add any advantage during training (I know this sounds wrong given the change in confidence levels). Let me explain:

  • so the warning is just a warning. It indicates that there are to few training examples for one / some of the intents. Adding more examples will fix this (thats why adding duplicates will remove this warning, but really you should be adding different examples).
  • the confidence level changes because the data gets "burned in" (more formally called over fitting) and the model learns to recognize exactly this example. Usually, one wants to avoid this over fitting, as it results in worse generalization (so the model will perform less favorable on new data). This is the bad side of overfitting, the good one is that the model will be strengthened in its believe it did the right choice on known training samples.

Thanks for the detailed explanation and fixing it spontaneously.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

karnigili picture karnigili  路  3Comments

alonsopg picture alonsopg  路  3Comments

nicolasfarina picture nicolasfarina  路  3Comments

rayush7 picture rayush7  路  3Comments

Poojan66 picture Poojan66  路  3Comments