Spacy: Wrong ID for StringStore returned by Matcher using OP quantifiers

Created on 15 Aug 2018 · 4Comments · Source: explosion/spaCy

Matchers appear to return incorrect match_id hashes for at least some patterns which use quantifiers. This results in not retrieving the correct pattern ID from the nlp.vocab StringStore, and instead getting back one of the terms being matched in the pattern. It can be triggered by *, ? or + quantifiers.

Example:

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

pattern = [{'LOWER': 'high'}, {'IS_PUNCT': True, 'OP': '?'}, {'LOWER': 'adrenaline'}]
matcher.add("test_pattern", None, pattern)

doc1 = nlp("This is a high-adrenaline situation.")
doc2 = nlp("This is a high adrenaline situation.")

def get_matches(doc):
    matches = matcher(doc)
    for match_id, start, end in matches:
        rule_id = nlp.vocab.strings[match_id]
        span = doc[start:end]
        print(f"{match_id}, Rule '{rule_id}', {start}:{end}, '{span.text}'")

# Works correctly
get_matches(doc1)
# > 5651646042889419180, Rule 'test_pattern', 4:7, 'high-adrenaline'

# Returns wrong pattern ID
get_matches(doc2)
# > 15052847843637698704, Rule 'adrenaline', 4:6, 'high adrenaline'

Environment

Operating System: Ubuntu 18.04
Python Version Used: 3.6.6
spaCy Version Used: 2.1.0a0 (spaCy-nightly)

bug feat / matcher 🌙 nightly

Source

norrishd

👍1

Most helpful comment

Still not 100% on the root causes of this, but the fix makes the code a bit more readable, and resolves the issue.

honnibal on 15 Aug 2018

🎉3

All 4 comments

Thanks for the report – that's very interesting 🤔 I just tested it with the latest v2.0.x and it worked as expected there, so this might be related to some bug in the new matcher engine in v2.1.x.