Spacy: tag every token from the matched sentence

Created on 22 Jul 2019 · 3Comments · Source: explosion/spaCy

Feature description

As i understand from the documentation, we can match sentence using rules with adding patterns,
example :

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i want to buy an iPhone X"
)
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]

matcher.add("Phone", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

the program will return :

Total matches found: 1
Match found: iPhone X

now we want to tag each token from the matched sentence:

we want a result as:

Phone Product: iPhone
Version: X

Phone Product and Version are two variables tags provided by the user

is there a way to achieve this result ?

feat / matcher usage

Source

nadachaabani1

Most helpful comment

You could either iterate over the tokens and store the results in some structure like this (very naive approach):

result = {}

for match_id, start, end in matches:
    match = doc[start:end]
    print("Match found:", match.text)
    if len(match) == 2:
        result[match.text] = {}
        result[match.text]["Phone Product"] = match[0].text
        result[match.text]["Version"] = match[1].text

print(result)

# {'iPhone X': {'Phone Product': 'iPhone', 'Version': 'X'}}

or use spaCys extension attributes. These will allow you to add attributes to the tokens and make them available via the underscore token._.phone_product and token._.version.

BreakBB on 23 Jul 2019

👍2 ❤1

All 3 comments

You could either iterate over the tokens and store the results in some structure like this (very naive approach):

result = {}

for match_id, start, end in matches:
    match = doc[start:end]
    print("Match found:", match.text)
    if len(match) == 2:
        result[match.text] = {}
        result[match.text]["Phone Product"] = match[0].text
        result[match.text]["Version"] = match[1].text

print(result)

# {'iPhone X': {'Phone Product': 'iPhone', 'Version': 'X'}}

or use spaCys extension attributes. These will allow you to add attributes to the tokens and make them available via the underscore token._.phone_product and token._.version.

BreakBB on 23 Jul 2019

👍2 ❤1

Thanks for your advice. It was very helpful.

nadachaabani1 on 23 Jul 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.