test file for reference: https://github.com/RasaHQ/rasa/blob/2b12852ae04aa2d9de6bacdc5b44d1894295fb27/tests/nlu/extractors/test_extractor.py
(
"Amsterdam Berlin and London",
{
"entity": ["city", "city", "O", "city"],
"role": ["O", "O", "O", "O"],
"group": ["O", "O", "O", "O"],
},
None,
[
{"entity": "city", "start": 0, "end": 16, "value": "Amsterdam Berlin"},
{"entity": "city", "start": 21, "end": 27, "value": "London"},
],
),
expected should be :
{"entity": "city", "start": 0, "end": 8, "value": "Amsterdam"},
{"entity": "city", "start": 9, "end": 16, "value": "Berlin"},
{"entity": "city", "start": 21, "end": 27, "value": "London"}
Because Amsterdam (U-city) and Berlin (U-city) are different city entities.
Hi @tabergma, @tmbo
would be able to provide your inputs on this issue?
@praneethgb This is expected behaviour. For more explanation see this PR and the related forum post.
Hi @tabergma,
Since "Amsterdam Berlin" is not a city name.
For Example: consider this use case, my ingredients are eggs(ingredients) lemon juice(ingredients) and milk(ingredients).
ASR output was 'my ingredients are eggs lemon juice and milk.'
When the DIET model is trained for NER, it is trained to recognize them as eggs: U-ingredients, lemon: B-ingredients, juice: L-ingredients, milk: U-ingredients.
In postprocessing also, the results expected to be eggs, milk, and lemon juice as three ingredients. Instead, ingredients eggs and lemon juice merged as one.
DIET: removing BILOU tags at https://github.com/RasaHQ/rasa/blob/master/rasa/nlu/classifiers/diet_classifier.py#L938
@praneethgb Not sure I understand what you are suggesting. I know that "Amsterdam Berlin" is not a city and it is not ideal that we capture it as one entity. The problem is that if we don't merge entities with the same tag when they appear right next to each other, we would not be able to detect "San Fransisco", for example, it would always be detected as "San" and "Fransisco" - two independent entities. Which is also not ideal.
What is your idea to support both cases? Please keep in mind that not all users are using BILOU tagging.
Also, if you add a comma in between your ingredients, everything should be extracted as expected. E.g. "my ingredients are eggs, lemon juice, and milk".
I am starting to bump into similar issues with some queries so I am sharing my thoughts (I am not exposed enough to how BILOU tagging is used in DIET/ nlu data).
@praneethgb Not sure I understand what you are suggesting. I know that "Amsterdam Berlin" is not a city and it is not ideal that we capture it as one entity. The problem is that if we don't merge entities with the same tag when they appear right next to each other, we would not be able to detect "San Fransisco", for example, it would always be detected as "San" and "Fransisco" - two independent entities. Which is also not ideal.
This might be a bit optimistic but this is how I think the model should behave:
Input | Correct/expected prediction | Processed prediction (merging BILOU tags)
---|---|---
Amsterdam Berlin | [Amsterdam](U-city) [Berlin](U-city) | [Amsterdam](city) [Berlin](city)
San Fransisco | [San](B-city) [Fransisco](L-city) | [San Fransisco](city)
What is your idea to support both cases? Please keep in mind that not all users are using BILOU tagging.
Yes, I myself still get confused by BILOU tagging but doing a mapping like the one shown above would be convenient to the users.
Also, if you add a comma in between your ingredients, everything should be extracted as expected. E.g. "my ingredients are eggs, lemon juice, and milk".
We (as developers/ engineers/ researchers) can have our set of guidelines but it's sometimes frustrating to the users who interact with the chatbot to follow a certain format (or at least it would be better if we can support more ways of writing queries i.e: eggs lemon juice and milk and eggs, lemon juice and milk as long as doing so won't hurt the model's performance ).
Yeah, I think we can update this for BILOU tagging, but I guess it will not be possible in case the model is trained without BILOU tagging. @praneethgb or @AMR-KELEG anyone of you willing to create a PR for this?
Yeah, I think we can update this for BILOU tagging, but I guess it will not be possible in case the model is trained without BILOU tagging. @praneethgb or @AMR-KELEG anyone of you willing to create a PR for this?
I will need to have a look first but yes I am willing to work on it.
Also, if you add a comma in between your ingredients, everything should be extracted as expected. E.g. "my ingredients are eggs, lemon juice, and milk".
Also, If we use voice input, then comma won't be present in input at all from Automatic Speech Recognition models.
Yeah, I think we can update this for BILOU tagging, but I guess it will not be possible in case the model is trained without BILOU tagging
Yes.
Hi @tabergma,
I've created PR: https://github.com/RasaHQ/rasa/pull/6423 to support this use case.
Most helpful comment
I am starting to bump into similar issues with some queries so I am sharing my thoughts (I am not exposed enough to how BILOU tagging is used in DIET/ nlu data).
This might be a bit optimistic but this is how I think the model should behave:
Input | Correct/expected prediction | Processed prediction (merging BILOU tags)
---|---|---
Amsterdam Berlin |
[Amsterdam](U-city) [Berlin](U-city)|[Amsterdam](city) [Berlin](city)San Fransisco |
[San](B-city) [Fransisco](L-city)|[San Fransisco](city)Yes, I myself still get confused by BILOU tagging but doing a mapping like the one shown above would be convenient to the users.
We (as developers/ engineers/ researchers) can have our set of guidelines but it's sometimes frustrating to the users who interact with the chatbot to follow a certain format (or at least it would be better if we can support more ways of writing queries i.e:
eggs lemon juice and milkandeggs, lemon juice and milkas long as doing so won't hurt the model's performance ).