Rasa NLU version: 1.9
Used backend / pipeline spacy_sklearn:
Operating system: OSX
Issue:
Creating issue per @wrathagom 's recommendation. I'm trying to determine how much training data "enough training data to generalize" entity extraction with the spacy_sklearn pipeline (using ner_cerf).
How many examples are "enough"?
Should those examples include several different possible cuisine types?
And if you have enough should we expect the model to generalize new cuisine types mentioned?
Some of these things are lacking in the documentation... I've been working on a model and have tried as little as 40 examples and as many as 200,000 examples, but can't seen to find the right sweet spot to get the model to correctly extract entities.
As noted on gitter, my training examples look like the following:
{
"intent": "search",
"entities": [
{
"start": 8,
"end": 12,
"value": "mail",
"entity": "record_type_property"
},
{
"start": 19,
"end": 34,
"value": "Subject.keyword",
"entity": "search_property"
},
{
"start": 47,
"end": 56,
"value": "Caulfield",
"entity": "search_value"
}
],
"text": "Show me mail where Subject.keyword is equal to Caulfield"
}
and I've tried asking the following question of the models: Show me all of the calls with John in the Subject
The intent matching seems to work, but the entity extraction usually falls down - sometimes extracting 0 entities, others only extracting 1 or 2.
Wondering if there is better guidance to how my training examples should be formatted.
Thanks in advance!
@tmbo @amn41 I've been trying to help out with this on Gitter, but would definitely be interested in your alls inputs. Given the above I asked what the cardinality was of each of the entities shown. Here was @timtutt's response
record_type_properties: trained with 1 unique
search_properties: trained with values between from 1 to 45 unique values
search_values: trained with values between 1 and 10 unique values
number of unique phrases: between 40 and 200,000
Given the low number of values for the first few entities I suggested moving them to intent variants instead. So rather than just having search I suggested he have search_mail_by_subject as an intent and leave search_value as his only entity.
@timtutt I saw your reply on Gitter that this change did result in an improvement. Can we close or still not quite there yet?
Last I knew you were going to try with a bigger dataset.
I'm working on the larger dataset - it actually is taking a lot longer to
train than before (I believe due to the increased number of intents), but I
think we can go ahead and close this as this method appears to work.
Thanks much for your help and insights here.
On Thu, Sep 28, 2017 at 8:50 AM, Caleb M. Keller notifications@github.com
wrote:
@timtutt https://github.com/timtutt I saw your reply on Gitter that
this change did result in an improvement. Can we close or still not quite
there yet?Last I knew you were going to try with a bigger dataset.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/RasaHQ/rasa_nlu/issues/609#issuecomment-332826152,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABcntMUosvSQwpwWK6bmKqS1Buadlabsks5sm5YYgaJpZM4PlrJF
.
Alright, let us know if you need anything else.
Most helpful comment
I'm working on the larger dataset - it actually is taking a lot longer to
train than before (I believe due to the increased number of intents), but I
think we can go ahead and close this as this method appears to work.
Thanks much for your help and insights here.
On Thu, Sep 28, 2017 at 8:50 AM, Caleb M. Keller notifications@github.com
wrote: