Matchers appear to return incorrect match_id hashes for at least some patterns which use quantifiers. This results in not retrieving the correct pattern ID from the nlp.vocab StringStore, and instead getting back one of the terms being matched in the pattern. It can be triggered by *, ? or + quantifiers.
Example:
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
pattern = [{'LOWER': 'high'}, {'IS_PUNCT': True, 'OP': '?'}, {'LOWER': 'adrenaline'}]
matcher.add("test_pattern", None, pattern)
doc1 = nlp("This is a high-adrenaline situation.")
doc2 = nlp("This is a high adrenaline situation.")
def get_matches(doc):
matches = matcher(doc)
for match_id, start, end in matches:
rule_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(f"{match_id}, Rule '{rule_id}', {start}:{end}, '{span.text}'")
# Works correctly
get_matches(doc1)
# > 5651646042889419180, Rule 'test_pattern', 4:7, 'high-adrenaline'
# Returns wrong pattern ID
get_matches(doc2)
# > 15052847843637698704, Rule 'adrenaline', 4:6, 'high adrenaline'
Thanks for the report – that's very interesting 🤔 I just tested it with the latest v2.0.x and it worked as expected there, so this might be related to some bug in the new matcher engine in v2.1.x.
Thanks for the test case! Confirmed.
Still not 100% on the root causes of this, but the fix makes the code a bit more readable, and resolves the issue.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Still not 100% on the root causes of this, but the fix makes the code a bit more readable, and resolves the issue.