Rasa: Unable to recognize multiple entities of same type in a sentence without any separation symbol (or a single space)

Created on 4 Aug 2020 · 10Comments · Source: RasaHQ/rasa

test file for reference: https://github.com/RasaHQ/rasa/blob/2b12852ae04aa2d9de6bacdc5b44d1894295fb27/tests/nlu/extractors/test_extractor.py

(
"Amsterdam Berlin and London",
{
"entity": ["city", "city", "O", "city"],
"role": ["O", "O", "O", "O"],
"group": ["O", "O", "O", "O"],
},
None,
[
{"entity": "city", "start": 0, "end": 16, "value": "Amsterdam Berlin"},
{"entity": "city", "start": 21, "end": 27, "value": "London"},
],
),

expected should be :
{"entity": "city", "start": 0, "end": 8, "value": "Amsterdam"},
{"entity": "city", "start": 9, "end": 16, "value": "Berlin"},
{"entity": "city", "start": 21, "end": 27, "value": "London"}

Because Amsterdam (U-city) and Berlin (U-city) are different city entities.

area type

Source

praneethgb

Most helpful comment

I am starting to bump into similar issues with some queries so I am sharing my thoughts (I am not exposed enough to how BILOU tagging is used in DIET/ nlu data).

@praneethgb Not sure I understand what you are suggesting. I know that "Amsterdam Berlin" is not a city and it is not ideal that we capture it as one entity. The problem is that if we don't merge entities with the same tag when they appear right next to each other, we would not be able to detect "San Fransisco", for example, it would always be detected as "San" and "Fransisco" - two independent entities. Which is also not ideal.

This might be a bit optimistic but this is how I think the model should behave:
Input | Correct/expected prediction | Processed prediction (merging BILOU tags)
---|---|---
Amsterdam Berlin | [Amsterdam](U-city) [Berlin](U-city) | [Amsterdam](city) [Berlin](city)
San Fransisco | [San](B-city) [Fransisco](L-city) | [San Fransisco](city)

What is your idea to support both cases? Please keep in mind that not all users are using BILOU tagging.

Yes, I myself still get confused by BILOU tagging but doing a mapping like the one shown above would be convenient to the users.

Also, if you add a comma in between your ingredients, everything should be extracted as expected. E.g. "my ingredients are eggs, lemon juice, and milk".

We (as developers/ engineers/ researchers) can have our set of guidelines but it's sometimes frustrating to the users who interact with the chatbot to follow a certain format (or at least it would be better if we can support more ways of writing queries i.e: eggs lemon juice and milk and eggs, lemon juice and milk as long as doing so won't hurt the model's performance ).

AMR-KELEG on 10 Aug 2020

👍2

All 10 comments

Hi @tabergma, @tmbo

would be able to provide your inputs on this issue?

praneethgb on 4 Aug 2020

Thanks for the issue, @tttthomasssss will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

sara-tagger on 5 Aug 2020

@praneethgb This is expected behaviour. For more explanation see this PR and the related forum post.

tabergma on 5 Aug 2020

Hi @tabergma,

Since "Amsterdam Berlin" is not a city name.

For Example: consider this use case, my ingredients are eggs(ingredients) lemon juice(ingredients) and milk(ingredients).
ASR output was 'my ingredients are eggs lemon juice and milk.'

When the DIET model is trained for NER, it is trained to recognize them as eggs: U-ingredients, lemon: B-ingredients, juice: L-ingredients, milk: U-ingredients.

In postprocessing also, the results expected to be eggs, milk, and lemon juice as three ingredients. Instead, ingredients eggs and lemon juice merged as one.

DIET: removing BILOU tags at https://github.com/RasaHQ/rasa/blob/master/rasa/nlu/classifiers/diet_classifier.py#L938

praneethgb on 5 Aug 2020

👍1

@praneethgb Not sure I understand what you are suggesting. I know that "Amsterdam Berlin" is not a city and it is not ideal that we capture it as one entity. The problem is that if we don't merge entities with the same tag when they appear right next to each other, we would not be able to detect "San Fransisco", for example, it would always be detected as "San" and "Fransisco" - two independent entities. Which is also not ideal.
What is your idea to support both cases? Please keep in mind that not all users are using BILOU tagging.
Also, if you add a comma in between your ingredients, everything should be extracted as expected. E.g. "my ingredients are eggs, lemon juice, and milk".

tabergma on 10 Aug 2020

I am starting to bump into similar issues with some queries so I am sharing my thoughts (I am not exposed enough to how BILOU tagging is used in DIET/ nlu data).

@praneethgb Not sure I understand what you are suggesting. I know that "Amsterdam Berlin" is not a city and it is not ideal that we capture it as one entity. The problem is that if we don't merge entities with the same tag when they appear right next to each other, we would not be able to detect "San Fransisco", for example, it would always be detected as "San" and "Fransisco" - two independent entities. Which is also not ideal.

What is your idea to support both cases? Please keep in mind that not all users are using BILOU tagging.

Yes, I myself still get confused by BILOU tagging but doing a mapping like the one shown above would be convenient to the users.

Also, if you add a comma in between your ingredients, everything should be extracted as expected. E.g. "my ingredients are eggs, lemon juice, and milk".

AMR-KELEG on 10 Aug 2020

👍2

Yeah, I think we can update this for BILOU tagging, but I guess it will not be possible in case the model is trained without BILOU tagging. @praneethgb or @AMR-KELEG anyone of you willing to create a PR for this?

tabergma on 10 Aug 2020

👍1

Yeah, I think we can update this for BILOU tagging, but I guess it will not be possible in case the model is trained without BILOU tagging. @praneethgb or @AMR-KELEG anyone of you willing to create a PR for this?

I will need to have a look first but yes I am willing to work on it.

AMR-KELEG on 10 Aug 2020

👍1

Also, if you add a comma in between your ingredients, everything should be extracted as expected. E.g. "my ingredients are eggs, lemon juice, and milk".

Also, If we use voice input, then comma won't be present in input at all from Automatic Speech Recognition models.