Rasa: Regex Entity Extractor

Created on 26 Jun 2019 · 7Comments · Source: RasaHQ/rasa

Description of Problem:
I think I hear often enough in a forum that people need Regex Entity Extractor (featurizer often doesn't help when u need reliable exact match extraction) and I'm not an exception here.

example: https://forum.rasa.com/t/unable-to-use-regex-feature/11976/2

Overview of the Solution:
I created a custom component to do this and seems to be working good.
I wanted to get rasa core developer's opinion and see if this should be useful/common enough to add as a built-in component. If that's the case, I will be more than happy to contribute.
Here is the component code.

class RegexEntityExtractor(EntityExtractor):
    # This extractor maybe kind of extreme as it takes user's message
    # and return regex match.
    # Confidence will be 1.0 just like Duckling

    provides = ["entities"]

    def __init__(
        self,
        component_config: Optional[Dict[Text, Text]] = None,
        regex_features: Optional[Dict[Text, Any]] = None
    ) -> None:
        super(RegexEntityExtractor, self).__init__(component_config)

        self.regex_feature = regex_features if regex_features else {}

    def train(
        self, training_data: TrainingData, config: RasaNLUModelConfig, **kwargs: Any
    ) -> None:

        self.regex_feature = training_data.regex_features

    @classmethod
    def load(
            cls,
            meta: Dict[Text, Any],
            model_dir: Optional[Text] = None,
            model_metadata: Optional[Metadata] = None,
            cached_component: Optional["RegexEntityExtractor"] = None,
            **kwargs: Any
    ) -> "RegexEntityExtractor":

        file_name = meta.get("file")

        if not file_name:
            regex_features = None
            return cls(meta, regex_features)

        # w/o string cast, mypy will tell me
        # expected "Union[str, _PathLike[str]]"
        regex_pattern_file = os.path.join(str(model_dir), file_name)
        if os.path.isfile(regex_pattern_file):
            regex_features = rasa.utils.io.read_json_file(regex_pattern_file)
        else:
            regex_features = None
            warnings.warn(
                "Failed to load regex pattern file from '{}'".format(regex_pattern_file)
            )
        return cls(meta, regex_features)

    def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
        """Persist this component to disk for future loading."""
        if self.regex_feature:
            file_name = file_name + ".json"
            regex_feature_file = os.path.join(model_dir, file_name)
            write_json_to_file(
                regex_feature_file,
                self.regex_feature, separators=(",", ": "))
            return {"file": file_name}
        else:
            return {"file": None}

    def match_regex(self, message):
        extracted = []
        for d in self.regex_feature:
            match = re.search(pattern=d['pattern'], string=message)
            if match:

                entity = {
                    "start": match.pos,
                    "end": match.endpos,
                    "value": match.group(),
                    "confidence": 1.0,
                    "entity": d['name'],
                }
                extracted.append(entity)

        extracted = self.add_extractor_name(extracted)
        return extracted

    def process(self, message: Message, **kwargs: Any) -> None:
        """Process an incoming message."""
        extracted = self.match_regex(message.text)
        message.set(
            "entities", message.get("entities", []) + extracted, add_to_output=True
        )

type

Source

naoko

👍2 🎉1

Most helpful comment

Mostly yes but not like unique pattern like [a-z]{3}a-\d{5} etc. Okay sounds good. Feel free to close if team decided not to proceed. I can continue use as custom component.

More detail analysis of differences are described here: https://medium.com/@naoko.reeves/rasa-regex-entity-extraction-317f047b28b6

naoko on 29 Jun 2019

👍2

All 7 comments

@naoko can I ask why you're not using the existing regex features?

akelad on 27 Jun 2019

This is different from featurizer. The regex featurized feature will be fed into NER CRF right?
So if you let it learn exactly 7 digits using Regex featurizer and NER CRF, without a lot of examples with some words before/after entity (e.g. my account number is 1234567), it won't work well from my experience.

So for instance, I have flight confirmation number that always generates [A-Z}{3}-\d{7}
You can add many examples but NER CRF will learn somewhat but if the example has ABC-12345678 and if you enter XYZ-9876543 then it fails to recognize. Maybe many more example? Sure, but some case like strict format like flight confirmation number, account number, etc, I think it would be straight forward to just be able to specify regex and works like duckling. You don't need an example. It matches exactly what your regex specifies.

I personally always use word boundary for this extractor so maybe it should require to make it safer otherwise super greedy match might make this extractor mess up everything else.

naoko on 27 Jun 2019

👍1

Could you not use duckling to extract things like numbers though? Thanks for the suggestion, we'll discuss this internally and get back to you

akelad on 28 Jun 2019

Mostly yes but not like unique pattern like [a-z]{3}a-\d{5} etc. Okay sounds good. Feel free to close if team decided not to proceed. I can continue use as custom component.

More detail analysis of differences are described here: https://medium.com/@naoko.reeves/rasa-regex-entity-extraction-317f047b28b6

naoko on 29 Jun 2019

👍2

example: https://forum.rasa.com/t/unable-to-use-regex-feature/11976/2

class RegexEntityExtractor(EntityExtractor):
    # This extractor maybe kind of extreme as it takes user's message
    # and return regex match.
    # Confidence will be 1.0 just like Duckling

    provides = ["entities"]

    def __init__(
        self,
        component_config: Optional[Dict[Text, Text]] = None,
        regex_features: Optional[Dict[Text, Any]] = None
    ) -> None:
        super(RegexEntityExtractor, self).__init__(component_config)

        self.regex_feature = regex_features if regex_features else {}

    def train(
        self, training_data: TrainingData, config: RasaNLUModelConfig, **kwargs: Any
    ) -> None:

        self.regex_feature = training_data.regex_features

    @classmethod
    def load(
            cls,
            meta: Dict[Text, Any],
            model_dir: Optional[Text] = None,
            model_metadata: Optional[Metadata] = None,
            cached_component: Optional["RegexEntityExtractor"] = None,
            **kwargs: Any
    ) -> "RegexEntityExtractor":

        file_name = meta.get("file")

        if not file_name:
            regex_features = None
            return cls(meta, regex_features)

        # w/o string cast, mypy will tell me
        # expected "Union[str, _PathLike[str]]"
        regex_pattern_file = os.path.join(str(model_dir), file_name)
        if os.path.isfile(regex_pattern_file):
            regex_features = rasa.utils.io.read_json_file(regex_pattern_file)
        else:
            regex_features = None
            warnings.warn(
                "Failed to load regex pattern file from '{}'".format(regex_pattern_file)
            )
        return cls(meta, regex_features)

    def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
        """Persist this component to disk for future loading."""
        if self.regex_feature:
            file_name = file_name + ".json"
            regex_feature_file = os.path.join(model_dir, file_name)
            write_json_to_file(
                regex_feature_file,
                self.regex_feature, separators=(",", ": "))
            return {"file": file_name}
        else:
            return {"file": None}

    def match_regex(self, message):
        extracted = []
        for d in self.regex_feature:
            match = re.search(pattern=d['pattern'], string=message)
            if match:

                entity = {
                    "start": match.pos,
                    "end": match.endpos,
                    "value": match.group(),
                    "confidence": 1.0,
                    "entity": d['name'],
                }
                extracted.append(entity)

        extracted = self.add_extractor_name(extracted)
        return extracted

    def process(self, message: Message, **kwargs: Any) -> None:
        """Process an incoming message."""
        extracted = self.match_regex(message.text)
        message.set(
            "entities", message.get("entities", []) + extracted, add_to_output=True
        )

Exactly what was bothering me. What comes with RASA by default doens't cut it for our use case. We needed to detect UUID v4 and IMEI numbers and some other internal hardware specific identity patterns and I expected it to simply work, but even after providing couple of examples, it detects match to match and fails to generalize.
The custom class you provided was just the thing I was thinking of implemented to expand the out of the box regex behavior. Super Thanks.

nishant-roambee on 5 Nov 2019

👍1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 3 Feb 2020

closed in #6214

erohmensing on 14 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

rasa_core.policies.ensemble.InvalidPolicyConfig: You didn't define any policies. Please define them under 'policies:' in your policy configuration file.

Arghya999 · 3Comments

rasa interactive doesnt work

nicolasfarina · 3Comments

Change in variations specification (doc or Rasa X bug)

connorbrinton · 3Comments

Performance Evaluation of a Trained Model?

mit4dev · 4Comments

No matching distribution found for tensorflow==1.15.0

Poojan66 · 3Comments