Spacy: Support for Fuzzy matching in Matcher and PhraseMatcher

Created on 7 Feb 2019 · 6Comments · Source: explosion/spaCy

One more feature for #1971 will be to be able to use a fuzzy matcher. Since we have built in support for vectorization, we can exploit that for fuzzy match (cosine similarity) or use Levenshtein distance ratio and set a threshold in the matcher while matching. I don't think it is currently possible in Matcher and PhraseMatcher

enhancement feat / matcher

Source

Abhijit-2592

👍10

Most helpful comment

I haven't done a lot of work on this, but a colleague of mine is working on integrating fuzzy matches and NMT and he told me that many similarity metrics are really slow when you are doing this on a large scale. Since some users (as myself) often use spaCy to process enormous amounts of texts, speed is important. One solution that my colleague found was to use set similarity search, which as the name implies is fast because it compares unique values as far as I understand. This allows it to run on sets rather than lists, which has all the benefits that sets have - but it might give a bit less accurate estimates when matches have many duplicate tokens.

This comment simply to say that I am in favour of this idea, and that a multitude of similarity metrics can be build in to provide some flexibility in what is wanted.

BramVanroy on 26 Feb 2019

👍3

All 6 comments

This comment simply to say that I am in favour of this idea, and that a multitude of similarity metrics can be build in to provide some flexibility in what is wanted.

BramVanroy on 26 Feb 2019

👍3

I too vote for the idea of having fuzzy matching capability integrated in spacy.

yychenca on 15 Apr 2019

❤1 👍1

alok0590 on 8 May 2019

I agree with @BramVanroy that this is probably a less practical feature than it sounds. You really want to precompute the search sets, rather than do them on-the-fly in the matcher. Once you've precomputed the similarity values, you can use extension attributes and a >= comparison in the Matcher to perform the search.

I think this is a case where the implementation details strongly matter, and an API that obscures them would actually be a disservice. So I'm currently a 👎 on this.

honnibal on 11 May 2019

@honnibal The set comparison is exactly how I implemented the Fuzzy matcher for my case. The criteria and type of fuzzy match varies from case to case. I understood that only after implementing it for my case. We can close this issue I guess

Abhijit-2592 on 11 May 2019

❤1

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.