Rasa: Regarding Multiple Entity Extraction

Created on 11 Sep 2017 · 3Comments · Source: RasaHQ/rasa

@tmbo

rasa NLU version (0.9.2):

Used backend / pipeline (spacy_sklearn):

Operating system (Ubuntu 14.04):

Issue:

The type of text I have is the following :

"Modern Cafe, Texas 3823742 Date: 23/04/2013 Time: 06:23 pm Cold Coffee Amount $5 Chicken Burger Amount $10 Service Charge $0.2 Total Amount $15.2"

The information that I am interested in extracting is Time, Place, Date and Total Amount. I need help in creating and annotating the training data.

1) Should I only need to annotate the entities I am interested in extracting i.e Time, Place, Date and Total Amount or along with this annotation I should annotate the not required part as "Other" or "NULL" entity? (I am using ner_crf extractor)
2) I am only interested in the total amount and not other sub-amounts in the text. How should I annotate my training dataset for that?
3) Could you please tell me what should be the ideal size of the training data in terms of number of examples such it starts extracting the correct values from the text?

type

Source

rayush7

Most helpful comment

Sounds interesting. I think Rasa will work well, but you're use case is definitely unique. I think Rasa's abilities will depend on how much the string changes from one receipt to the next. It doesn't rely on grammar rules or anything like that so that shouldn't be a problem.

If you need anymore help let us know.

@tmbo @amn41 just FYI since this one sounds unique.

wrathagom on 11 Sep 2017

👍3

All 3 comments

I'm happy to help, but first can I ask why you're trying to use NER to be able to extract information from these strings? It seems quite machine parseable and also not natural language.

Only annotate the pieces you want to extract.
An example annotation is shown below. More information on the training data format can be found in the docs. http://rasa-nlu.readthedocs.io/en/latest/dataformat.html

{
        "text": "Modern Cafe, Texas 3823742 Date: 23/04/2013 Time: 06:23 pm Cold Coffee Amount $5 Chicken Burger Amount $10 Service Charge $0.2 Total Amount $15.2",
        "intent": "",
        "entities": [
          {
            "start": 0,
            "end": 18,
            "value": "Modern Cafe, Texas",
            "entity": "location"
          },
          {
            "start": 33,
            "end": 43,
            "value": "23/04/2013",
            "entity": "date"
          },
          {
            "start": 50,
            "end": 58,
            "value": "06:23 pm",
            "entity": "time"
          },
          {
            "start": 140,
            "end": 145,
            "value": "$15.2",
            "entity": "total"
          }
        ]
      }

The ideal size is however many it takes for parsing to work. If you have 1000 samples take 600 of them and use them to train the model and reserve the remaining 400 to validate against.

wrathagom on 11 Sep 2017

@wrathagom Thanks for clarifying my doubts. This is very helpful.

To answer your question actually our team is building an expense management system. The input to our system is a receipt image (obtained from the vendor or the shop) followed by text detection, text recognition and spell checker sub modules. This string of text in the above question is the output of spell checker sub module. Now we want to categorize which element of the string correspond to date, place, time and total amount. That is where I think name entity extraction could help.

Do you think NER is a good way to go about solving this problem in a generic way? Is there a better way to solve this problem? And Is RASA NLU better suited for natural language text than a bunch of words and numbers put together without following any grammatical rules of the English language (like the text example in the question)?

rayush7 on 11 Sep 2017

If you need anymore help let us know.

@tmbo @amn41 just FYI since this one sounds unique.

wrathagom on 11 Sep 2017

👍3

Was this page helpful?

0 / 5 - 0 ratings