One more feature for #1971 will be to be able to use a fuzzy matcher. Since we have built in support for vectorization, we can exploit that for fuzzy match (cosine similarity) or use Levenshtein distance ratio and set a threshold in the matcher while matching. I don't think it is currently possible in Matcher and PhraseMatcher
I haven't done a lot of work on this, but a colleague of mine is working on integrating fuzzy matches and NMT and he told me that many similarity metrics are really slow when you are doing this on a large scale. Since some users (as myself) often use spaCy to process enormous amounts of texts, speed is important. One solution that my colleague found was to use set similarity search, which as the name implies is fast because it compares unique values as far as I understand. This allows it to run on sets rather than lists, which has all the benefits that sets have - but it might give a bit less accurate estimates when matches have many duplicate tokens.
This comment simply to say that I am in favour of this idea, and that a multitude of similarity metrics can be build in to provide some flexibility in what is wanted.
I too vote for the idea of having fuzzy matching capability integrated in spacy.
+1
I agree with @BramVanroy that this is probably a less practical feature than it sounds. You really want to precompute the search sets, rather than do them on-the-fly in the matcher. Once you've precomputed the similarity values, you can use extension attributes and a >= comparison in the Matcher to perform the search.
I think this is a case where the implementation details strongly matter, and an API that obscures them would actually be a disservice. So I'm currently a 馃憥 on this.
@honnibal The set comparison is exactly how I implemented the Fuzzy matcher for my case. The criteria and type of fuzzy match varies from case to case. I understood that only after implementing it for my case. We can close this issue I guess
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
I haven't done a lot of work on this, but a colleague of mine is working on integrating fuzzy matches and NMT and he told me that many similarity metrics are really slow when you are doing this on a large scale. Since some users (as myself) often use spaCy to process enormous amounts of texts, speed is important. One solution that my colleague found was to use set similarity search, which as the name implies is fast because it compares unique values as far as I understand. This allows it to run on sets rather than lists, which has all the benefits that sets have - but it might give a bit less accurate estimates when matches have many duplicate tokens.
This comment simply to say that I am in favour of this idea, and that a multitude of similarity metrics can be build in to provide some flexibility in what is wanted.