Spacy: PhraseMatcher returns only 1 match while more than 1 rules are verified

Created on 16 Jul 2019  路  3Comments  路  Source: explosion/spaCy

Hello,

We are working on a project (PatentCity) to collect inventors location directly from early ages patent documents. spaCy is undoubtedly our 馃 library. So, please, let me start by a big "Thank You" to the spaCy developing community for your awesome work 馃拰 .

Issue

We try to complement statistical GPE recognition with administrative entity lists look-up. To do so, we use PhraseMatcher (from spacy.matcher). We also want to track which kind of entity we match (e.g CITY, COUNTY, STATE, etc). Hence, we have a different rule for each.

Now, some entities have common ngrams. E.g, 'New York' is a STATE, a COUNTY and a CITY. In these case, the PhraseMatcher only returns 1 match. E.g, 'New York' is matched _only_ by a single rule, let say the CITY rule. It seems that the last rule has the priority.

We find this behavior potentially misleading. Is it in the spirit of the PhraseMatcher or should it be changed ?

Thanks in advance for help !

How to reproduce the behaviour

from spacy.matcher import PhraseMatcher

import en_core_web_sm
nlp = en_core_web_sm.load()

matcher = PhraseMatcher(nlp.vocab)

matcher.add('COUNTY', None, *[nlp('New York')])
matcher.add('CITY', None, *[nlp('New York')])

matcher._docs
# Check that the PhraseMatcher was properly populated
# {14532842148348552135: (New York,), 13852145969607952771: (New York,)}  # ok

matcher(nlp('I live in New York'))
# [(13852145969607952771, 3, 5)]  # Only 1 match

Environment

  • Operating System: Mac OS Mojave
  • Python Version Used: Python 3.7.0
  • spaCy Version Used: spacy 2.1.3
bug feat / matcher

Most helpful comment

I've gotten to the bottom line of why this is happening, it's a bug in the conversion between Matcher results and PhraseMatcher IDs. Trying to fix it!

All 3 comments

Thanks for the kind words and the detailed report 馃憤 I think you're correct here and I would have also expected the matcher to return two identical matches here, one for each match ID.

I've gotten to the bottom line of why this is happening, it's a bug in the conversion between Matcher results and PhraseMatcher IDs. Trying to fix it!

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings