Spacy: Case insensitive PhraseMatcher

Created on 14 Nov 2017 · 5Comments · Source: explosion/spaCy

It would be very helpful to have an option to use PhraseMatcher as case insensitive.
For example, following on thePhraseMatcher example, this doc returns no matches, because of the lower c in clinton.

nlp = English()
matcher = PhraseMatcher(nlp.vocab)
matcher.add("Phrase", None, nlp("Hilary Clinton"))
results = matcher(nlp(u"Hilary clinton"))

Your Environment

Operating System: Windows 10
Python Version Used: 3.5
spaCy Version Used: 2.0.2
Environment Information:

enhancement help wanted

Source

eranhirs

👍1

Most helpful comment

Yes, that's definitely a good suggestion! The best way to implement this would be to add an option for setting a different attribute that the PhraseMatcher should match on – for example, LOWER instead of ORTH. The usage could look like this:

matcher = PhraseMatcher(nlp.vocab, attr='LOWER')  # string representing attribute from spacy.attrs

This would also allow a lot of other cool use cases - for example, you could pass in a Doc and match phrases with the same part-of-speech tags or dependency labels. Using it with the SHAPE could be pretty powerful, too. Like, if you're matching phone numbers or something like that, you won't have to come up with complex token patterns. Instead, you simply feed the PhraseMatcher a bunch of examples.

ines on 15 Nov 2017

👍5

All 5 comments

matcher = PhraseMatcher(nlp.vocab, attr='LOWER')  # string representing attribute from spacy.attrs

ines on 15 Nov 2017

👍5

Sat down to try to implement this, and after some thought, there's no way to make it work unfortunately.

The PhraseMatcher relies on the fact that all tokens of the same type refer back to the same lexeme data. The types are indexed by the ORTH key, so that's the only key we can phrase-match over.

A quick reminder of how this works (I'd forgotten): We set flags onto the Lexeme objects indicating that the word can start, end or continue the phrase. Then we use the matcher over these flag sequences. We access the flag from the token, by fetching it from the token's lexeme.

For lower-case matching, we would have to look up the lexeme via the vocab, and check the flag there. I think this would be no more efficient than the Matcher.

honnibal on 23 Nov 2017

Eventually, I added each pattern 4 times to PhraseMatcher:

Without change
Lower
Upper
Title

eranhirs on 23 Nov 2017

👍2

@eranhirs Yes, this is probably the best solution for now – even if you're adding 4 times as many patterns, it should still be significantly more efficient than adding one single token pattern.

ines on 26 Nov 2017

👍1

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.