It would be very helpful to have an option to use PhraseMatcher as case insensitive.
For example, following on thePhraseMatcher example, this doc returns no matches, because of the lower c in clinton.
nlp = English()
matcher = PhraseMatcher(nlp.vocab)
matcher.add("Phrase", None, nlp("Hilary Clinton"))
results = matcher(nlp(u"Hilary clinton"))
Yes, that's definitely a good suggestion! The best way to implement this would be to add an option for setting a different attribute that the PhraseMatcher should match on – for example, LOWER instead of ORTH. The usage could look like this:
matcher = PhraseMatcher(nlp.vocab, attr='LOWER') # string representing attribute from spacy.attrs
This would also allow a lot of other cool use cases - for example, you could pass in a Doc and match phrases with the same part-of-speech tags or dependency labels. Using it with the SHAPE could be pretty powerful, too. Like, if you're matching phone numbers or something like that, you won't have to come up with complex token patterns. Instead, you simply feed the PhraseMatcher a bunch of examples.
Sat down to try to implement this, and after some thought, there's no way to make it work unfortunately.
The PhraseMatcher relies on the fact that all tokens of the same type refer back to the same lexeme data. The types are indexed by the ORTH key, so that's the only key we can phrase-match over.
A quick reminder of how this works (I'd forgotten): We set flags onto the Lexeme objects indicating that the word can start, end or continue the phrase. Then we use the matcher over these flag sequences. We access the flag from the token, by fetching it from the token's lexeme.
For lower-case matching, we would have to look up the lexeme via the vocab, and check the flag there. I think this would be no more efficient than the Matcher.
Eventually, I added each pattern 4 times to PhraseMatcher:
@eranhirs Yes, this is probably the best solution for now – even if you're adding 4 times as many patterns, it should still be significantly more efficient than adding one single token pattern.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Yes, that's definitely a good suggestion! The best way to implement this would be to add an option for setting a different attribute that the
PhraseMatchershould match on – for example,LOWERinstead ofORTH. The usage could look like this:This would also allow a lot of other cool use cases - for example, you could pass in a
Docand match phrases with the same part-of-speech tags or dependency labels. Using it with theSHAPEcould be pretty powerful, too. Like, if you're matching phone numbers or something like that, you won't have to come up with complex token patterns. Instead, you simply feed thePhraseMatchera bunch of examples.