I have the following matcher:
self.__matcher.add('Subject', self.__add_subject, [
{'LOWER': 'mit'},
{'LOWER': 'dem'},
{'LOWER': 'betreff'},
{'IS_ALPHA': True, 'OP': '+'}
])
Given the following sentence:
"Ich m枚chte mit dem Betreff Kindergeburtstag und Noah weitermachen und frage entsprechend nach"
It returns only:
"mit dem Betreff Kindergeburtstag Noah"
If I use the quantifiers "?" or "*" it is the same. As far as I understood the description, it should return:
"mit dem Betreff Kindergeburtstag und Noah weitermachen und frage entsprechend nach"
or am I wrong? If I am right, what's currently wrong? What I want to achieve is a phrase with 1-n lower tokens which are specified and then any amount of following tokens.
+1
Ran into a similar issue with the wildcard operators. The matching seems to be incorrect, in my case returning all combinations instead of just the greedy match.
To reproduce
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
doc = nlp('This is a test.')
matcher = Matcher(nlp.vocab)
matcher.add('FOOBAR', None, [{'IS_ALPHA': True, 'OP': '+'}])
matches = matcher(doc)
for _, start, end in matches:
print(doc[start: end])
Output:
This
This is
is
This is a
is a
a
This is a test
is a test
a test
test
Same code in spaCy version: 2.0.13 gives a different, incorrect result
This is
is a
a test
test
@JulianGerhard21
I confirmed that in version 2.1.0a1 the pattern in your example gives the set of all strings starting from "mit dem Betreff Kindergeburtstag", including the expected "mit dem Betreff Kindergeburtstag und Noah weitermachen und frage entsprechend nach".
This seems like the expected behavior as of now. See issue https://github.com/explosion/spaCy/issues/2569
So you will need to do some post-processing to get the result you want.
Yes, @arunbg's answer is correct. v2.1.0 will include a completely rewritten implementation of the Matcher engine that resolves problems like this. Since it's a breaking change that causes different behaviour/matches, it'll only be available in the next minor version. But you can already test it in spacy-nightly 馃檪
@JulianGerhard21
I confirmed that in version 2.1.0a1 the pattern in your example gives the set of all strings starting from "mit dem Betreff Kindergeburtstag", including the expected "mit dem Betreff Kindergeburtstag und Noah weitermachen und frage entsprechend nach".This seems like the expected behavior as of now. See issue #2569
So you will need to do some post-processing to get the result you want.
Hey @arunbg thanks for the investigation. Since I can't use the beta / nightly build, I had to solve it via postprocessing. I figured out that it's more like a dependency parsing problem than something for the matcher as the matcher only indicates certain "to be expected" kind of phrase in the sentence.
@ines Thank you too for your comment. I've read this announcement for 2.1.0 nightly and I understood the potential "expect different behaviour" thing for the Matcher engine but is there any documentation about what exactly had changed? We are using the Matcher enginge at production level but only on a (imho) flat kind of way - only LIKE_NUM, IS_STOP, and LOWER are used - so it would indeed be interesting to test / read about the changes.
I wish you a nice sunday!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
@JulianGerhard21
I confirmed that in version 2.1.0a1 the pattern in your example gives the set of all strings starting from "mit dem Betreff Kindergeburtstag", including the expected "mit dem Betreff Kindergeburtstag und Noah weitermachen und frage entsprechend nach".
This seems like the expected behavior as of now. See issue https://github.com/explosion/spaCy/issues/2569
So you will need to do some post-processing to get the result you want.