Spacy: tag every token from the matched sentence

Created on 22 Jul 2019  路  3Comments  路  Source: explosion/spaCy

Feature description

As i understand from the documentation, we can match sentence using rules with adding patterns,
example :

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i want to buy an iPhone X"
)
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]

matcher.add("Phone", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

the program will return :

Total matches found: 1
Match found: iPhone X

now we want to tag each token from the matched sentence:

we want a result as:

Phone Product: iPhone
Version: X

Phone Product and Version are two variables tags provided by the user

is there a way to achieve this result ?

feat / matcher usage

Most helpful comment

You could either iterate over the tokens and store the results in some structure like this (very naive approach):

result = {}

for match_id, start, end in matches:
    match = doc[start:end]
    print("Match found:", match.text)
    if len(match) == 2:
        result[match.text] = {}
        result[match.text]["Phone Product"] = match[0].text
        result[match.text]["Version"] = match[1].text

print(result)

# {'iPhone X': {'Phone Product': 'iPhone', 'Version': 'X'}}

or use spaCys extension attributes. These will allow you to add attributes to the tokens and make them available via the underscore token._.phone_product and token._.version.

All 3 comments

You could either iterate over the tokens and store the results in some structure like this (very naive approach):

result = {}

for match_id, start, end in matches:
    match = doc[start:end]
    print("Match found:", match.text)
    if len(match) == 2:
        result[match.text] = {}
        result[match.text]["Phone Product"] = match[0].text
        result[match.text]["Version"] = match[1].text

print(result)

# {'iPhone X': {'Phone Product': 'iPhone', 'Version': 'X'}}

or use spaCys extension attributes. These will allow you to add attributes to the tokens and make them available via the underscore token._.phone_product and token._.version.

Thanks for your advice. It was very helpful.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings