Spacy: PhraseMatcher not matching text preceded by @ at sign

Created on 20 Jan 2018  Â·  7Comments  Â·  Source: explosion/spaCy

Your Environment

  • Operating System: Windows 10
  • Python Version Used: 3.5
  • spaCy Version Used:
  • Environment Information:

When adding any text to PhraseMatcher, it won't match if it precedes a @ sign.
For example, adding sometext won't match on @sometext.

This is different than the behavior with # hashtag.
For example, adding sometext will match on #sometext.

usage

Most helpful comment

To be honest, I'm not sure there's a deeper reason for it. An "@" infix may indicate an email or email-like token, but I can't really think of a prefix or suffix that would require the string to stay one token. So it's likely that it just hasn't been added yet, and never really came up before.

If you want to experiment with it, you can find the global punctuation rules in lang/punctuation.py. If splitting the @ has no bad side-effects and the tests pass, we'd be happy to accept a PR for this.

All 7 comments

The problem here is that, just like the regular Matcher, the PhraseMatcher also depends on the tokenization. It can match spans of text across multiple tokens – but not spans of text within a token. By default, the tokenizer splits #sometext into two tokens, but not @sometext:

>>> [token.text for token in nlp(u"#sometext")] 
['#', 'sometext']
>>> [token.text for token in nlp(u"@sometext")]
['@sometext']

This means that a pattern for nlp(u"sometext") will match the second token of #sometext – but not the one token @sometext.

As a solution, you could customise spaCy's tokenization rules to make sure the @ is always split off. You might also want to look into this matcher example that shows how to use token patterns with regular expressions, e.g. to define more fine-grained token rules.

By default, the tokenizer splits #sometext into two tokens, but not @sometext

Could you elaborate about the reason behind this?

To be honest, I'm not sure there's a deeper reason for it. An "@" infix may indicate an email or email-like token, but I can't really think of a prefix or suffix that would require the string to stay one token. So it's likely that it just hasn't been added yet, and never really came up before.

If you want to experiment with it, you can find the global punctuation rules in lang/punctuation.py. If splitting the @ has no bad side-effects and the tests pass, we'd be happy to accept a PR for this.

in that specific example, once you can match whatever is in the matcher rules, how do you generate one (and only one) token from that?
In the case of @something the tokenizer will still spit @ and something
How can I merge the two into 1 token @something ?

@fgadaleta

How can I merge the two into 1 token @something ?

Check out Span.merge! Here's a simple example:

doc = nlp(u"Hello #something")
[token.text for token in doc]  # ['Hello', '#', 'something']
span = doc[1:3]                # create a span (slice of the doc)
span.merge()                   # merge the span
[token.text for token in doc]  # ['Hello', '#something']

yeah I found it already.
Thanks a lot!

https://about.me/fgadaleta?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=gmail_api&utm_content=thumb
Francesco Gadaleta
about.me/fgadaleta
https://about.me/fgadaleta?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=gmail_api&utm_content=thumb

On 6 February 2018 at 16:38, Ines Montani notifications@github.com wrote:

How can I merge the two into 1 token @something ?

Check out Span.merge https://spacy.io/api/span#merge! Here's a simple
example:

doc = nlp(u"Hello #something")
[token.text for token in doc] # ['Hello', '#', 'something']
span = doc[1:3] # create a span (slice of the doc)
span.merge() # merge the span
[token.text for token in doc] # ['Hello', '#something']

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/explosion/spaCy/issues/1867#issuecomment-363573173,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALE47yDLBXwFB9BXXAKDEi6yBd2A-aryks5tSMZKgaJpZM4RliYj
.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

besirkurtulmus picture besirkurtulmus  Â·  3Comments

bebelbop picture bebelbop  Â·  3Comments

armsp picture armsp  Â·  3Comments

smartinsightsfromdata picture smartinsightsfromdata  Â·  3Comments

enerrio picture enerrio  Â·  3Comments