Spacy: PhraseMatcher not matching text preceded by @ at sign

Created on 20 Jan 2018 · 7Comments · Source: explosion/spaCy

Your Environment

Operating System: Windows 10
Python Version Used: 3.5
spaCy Version Used:
Environment Information:

When adding any text to PhraseMatcher, it won't match if it precedes a @ sign.
For example, adding sometext won't match on @sometext.

This is different than the behavior with # hashtag.
For example, adding sometext will match on #sometext.

usage

Source

eranhirs

Most helpful comment

To be honest, I'm not sure there's a deeper reason for it. An "@" infix may indicate an email or email-like token, but I can't really think of a prefix or suffix that would require the string to stay one token. So it's likely that it just hasn't been added yet, and never really came up before.

If you want to experiment with it, you can find the global punctuation rules in lang/punctuation.py. If splitting the @ has no bad side-effects and the tests pass, we'd be happy to accept a PR for this.

ines on 21 Jan 2018

👍2

All 7 comments

The problem here is that, just like the regular Matcher, the PhraseMatcher also depends on the tokenization. It can match spans of text across multiple tokens – but not spans of text within a token. By default, the tokenizer splits #sometext into two tokens, but not @sometext:

>>> [token.text for token in nlp(u"#sometext")] 
['#', 'sometext']
>>> [token.text for token in nlp(u"@sometext")]
['@sometext']

This means that a pattern for nlp(u"sometext") will match the second token of #sometext – but not the one token @sometext.

As a solution, you could customise spaCy's tokenization rules to make sure the @ is always split off. You might also want to look into this matcher example that shows how to use token patterns with regular expressions, e.g. to define more fine-grained token rules.

ines on 20 Jan 2018

👍1

By default, the tokenizer splits #sometext into two tokens, but not @sometext

Could you elaborate about the reason behind this?

eranhirs on 20 Jan 2018

ines on 21 Jan 2018

👍2

in that specific example, once you can match whatever is in the matcher rules, how do you generate one (and only one) token from that?
In the case of @something the tokenizer will still spit @ and something
How can I merge the two into 1 token @something ?

fgadaleta on 6 Feb 2018

@fgadaleta

How can I merge the two into 1 token @something ?

Check out Span.merge! Here's a simple example:

doc = nlp(u"Hello #something")
[token.text for token in doc]  # ['Hello', '#', 'something']
span = doc[1:3]                # create a span (slice of the doc)
span.merge()                   # merge the span
[token.text for token in doc]  # ['Hello', '#something']

ines on 6 Feb 2018

yeah I found it already.
Thanks a lot!

https://about.me/fgadaleta?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=gmail_api&utm_content=thumb
Francesco Gadaleta
about.me/fgadaleta
https://about.me/fgadaleta?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=gmail_api&utm_content=thumb

On 6 February 2018 at 16:38, Ines Montani notifications@github.com wrote:

How can I merge the two into 1 token @something ?

Check out Span.merge https://spacy.io/api/span#merge! Here's a simple
example:

doc = nlp(u"Hello #something")
[token.text for token in doc] # ['Hello', '#', 'something']
span = doc[1:3] # create a span (slice of the doc)
span.merge() # merge the span
[token.text for token in doc] # ['Hello', '#something']

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/explosion/spaCy/issues/1867#issuecomment-363573173,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALE47yDLBXwFB9BXXAKDEi6yBd2A-aryks5tSMZKgaJpZM4RliYj
.