Rasa: Regex Entity Extractor

Created on 26 Jun 2019  路  7Comments  路  Source: RasaHQ/rasa

Description of Problem:
I think I hear often enough in a forum that people need Regex Entity Extractor (featurizer often doesn't help when u need reliable exact match extraction) and I'm not an exception here.

example: https://forum.rasa.com/t/unable-to-use-regex-feature/11976/2

Overview of the Solution:
I created a custom component to do this and seems to be working good.
I wanted to get rasa core developer's opinion and see if this should be useful/common enough to add as a built-in component. If that's the case, I will be more than happy to contribute.
Here is the component code.

class RegexEntityExtractor(EntityExtractor):
    # This extractor maybe kind of extreme as it takes user's message
    # and return regex match.
    # Confidence will be 1.0 just like Duckling

    provides = ["entities"]

    def __init__(
        self,
        component_config: Optional[Dict[Text, Text]] = None,
        regex_features: Optional[Dict[Text, Any]] = None
    ) -> None:
        super(RegexEntityExtractor, self).__init__(component_config)

        self.regex_feature = regex_features if regex_features else {}

    def train(
        self, training_data: TrainingData, config: RasaNLUModelConfig, **kwargs: Any
    ) -> None:

        self.regex_feature = training_data.regex_features

    @classmethod
    def load(
            cls,
            meta: Dict[Text, Any],
            model_dir: Optional[Text] = None,
            model_metadata: Optional[Metadata] = None,
            cached_component: Optional["RegexEntityExtractor"] = None,
            **kwargs: Any
    ) -> "RegexEntityExtractor":

        file_name = meta.get("file")

        if not file_name:
            regex_features = None
            return cls(meta, regex_features)

        # w/o string cast, mypy will tell me
        # expected "Union[str, _PathLike[str]]"
        regex_pattern_file = os.path.join(str(model_dir), file_name)
        if os.path.isfile(regex_pattern_file):
            regex_features = rasa.utils.io.read_json_file(regex_pattern_file)
        else:
            regex_features = None
            warnings.warn(
                "Failed to load regex pattern file from '{}'".format(regex_pattern_file)
            )
        return cls(meta, regex_features)

    def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
        """Persist this component to disk for future loading."""
        if self.regex_feature:
            file_name = file_name + ".json"
            regex_feature_file = os.path.join(model_dir, file_name)
            write_json_to_file(
                regex_feature_file,
                self.regex_feature, separators=(",", ": "))
            return {"file": file_name}
        else:
            return {"file": None}

    def match_regex(self, message):
        extracted = []
        for d in self.regex_feature:
            match = re.search(pattern=d['pattern'], string=message)
            if match:

                entity = {
                    "start": match.pos,
                    "end": match.endpos,
                    "value": match.group(),
                    "confidence": 1.0,
                    "entity": d['name'],
                }
                extracted.append(entity)

        extracted = self.add_extractor_name(extracted)
        return extracted

    def process(self, message: Message, **kwargs: Any) -> None:
        """Process an incoming message."""
        extracted = self.match_regex(message.text)
        message.set(
            "entities", message.get("entities", []) + extracted, add_to_output=True
        )

type

Most helpful comment

Mostly yes but not like unique pattern like [a-z]{3}a-\d{5} etc. Okay sounds good. Feel free to close if team decided not to proceed. I can continue use as custom component.

More detail analysis of differences are described here: https://medium.com/@naoko.reeves/rasa-regex-entity-extraction-317f047b28b6

All 7 comments

@naoko can I ask why you're not using the existing regex features?

This is different from featurizer. The regex featurized feature will be fed into NER CRF right?
So if you let it learn exactly 7 digits using Regex featurizer and NER CRF, without a lot of examples with some words before/after entity (e.g. my account number is 1234567), it won't work well from my experience.

So for instance, I have flight confirmation number that always generates [A-Z}{3}-\d{7}
You can add many examples but NER CRF will learn somewhat but if the example has ABC-12345678 and if you enter XYZ-9876543 then it fails to recognize. Maybe many more example? Sure, but some case like strict format like flight confirmation number, account number, etc, I think it would be straight forward to just be able to specify regex and works like duckling. You don't need an example. It matches exactly what your regex specifies.

I personally always use word boundary for this extractor so maybe it should require to make it safer otherwise super greedy match might make this extractor mess up everything else.

Could you not use duckling to extract things like numbers though? Thanks for the suggestion, we'll discuss this internally and get back to you

Mostly yes but not like unique pattern like [a-z]{3}a-\d{5} etc. Okay sounds good. Feel free to close if team decided not to proceed. I can continue use as custom component.

More detail analysis of differences are described here: https://medium.com/@naoko.reeves/rasa-regex-entity-extraction-317f047b28b6

Description of Problem:
I think I hear often enough in a forum that people need Regex Entity Extractor (featurizer often doesn't help when u need reliable exact match extraction) and I'm not an exception here.

example: https://forum.rasa.com/t/unable-to-use-regex-feature/11976/2

Overview of the Solution:
I created a custom component to do this and seems to be working good.
I wanted to get rasa core developer's opinion and see if this should be useful/common enough to add as a built-in component. If that's the case, I will be more than happy to contribute.
Here is the component code.

class RegexEntityExtractor(EntityExtractor):
    # This extractor maybe kind of extreme as it takes user's message
    # and return regex match.
    # Confidence will be 1.0 just like Duckling

    provides = ["entities"]

    def __init__(
        self,
        component_config: Optional[Dict[Text, Text]] = None,
        regex_features: Optional[Dict[Text, Any]] = None
    ) -> None:
        super(RegexEntityExtractor, self).__init__(component_config)

        self.regex_feature = regex_features if regex_features else {}

    def train(
        self, training_data: TrainingData, config: RasaNLUModelConfig, **kwargs: Any
    ) -> None:

        self.regex_feature = training_data.regex_features

    @classmethod
    def load(
            cls,
            meta: Dict[Text, Any],
            model_dir: Optional[Text] = None,
            model_metadata: Optional[Metadata] = None,
            cached_component: Optional["RegexEntityExtractor"] = None,
            **kwargs: Any
    ) -> "RegexEntityExtractor":

        file_name = meta.get("file")

        if not file_name:
            regex_features = None
            return cls(meta, regex_features)

        # w/o string cast, mypy will tell me
        # expected "Union[str, _PathLike[str]]"
        regex_pattern_file = os.path.join(str(model_dir), file_name)
        if os.path.isfile(regex_pattern_file):
            regex_features = rasa.utils.io.read_json_file(regex_pattern_file)
        else:
            regex_features = None
            warnings.warn(
                "Failed to load regex pattern file from '{}'".format(regex_pattern_file)
            )
        return cls(meta, regex_features)

    def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
        """Persist this component to disk for future loading."""
        if self.regex_feature:
            file_name = file_name + ".json"
            regex_feature_file = os.path.join(model_dir, file_name)
            write_json_to_file(
                regex_feature_file,
                self.regex_feature, separators=(",", ": "))
            return {"file": file_name}
        else:
            return {"file": None}

    def match_regex(self, message):
        extracted = []
        for d in self.regex_feature:
            match = re.search(pattern=d['pattern'], string=message)
            if match:

                entity = {
                    "start": match.pos,
                    "end": match.endpos,
                    "value": match.group(),
                    "confidence": 1.0,
                    "entity": d['name'],
                }
                extracted.append(entity)

        extracted = self.add_extractor_name(extracted)
        return extracted

    def process(self, message: Message, **kwargs: Any) -> None:
        """Process an incoming message."""
        extracted = self.match_regex(message.text)
        message.set(
            "entities", message.get("entities", []) + extracted, add_to_output=True
        )

Exactly what was bothering me. What comes with RASA by default doens't cut it for our use case. We needed to detect UUID v4 and IMEI numbers and some other internal hardware specific identity patterns and I expected it to simply work, but even after providing couple of examples, it detects match to match and fails to generalize.
The custom class you provided was just the thing I was thinking of implemented to expand the out of the box regex behavior. Super Thanks.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

closed in #6214

Was this page helpful?
0 / 5 - 0 ratings