When adding any text to PhraseMatcher, it won't match if it precedes a @ sign.
For example, adding sometext won't match on @sometext.
This is different than the behavior with # hashtag.
For example, adding sometext will match on #sometext.
The problem here is that, just like the regular Matcher, the PhraseMatcher also depends on the tokenization. It can match spans of text across multiple tokens – but not spans of text within a token. By default, the tokenizer splits #sometext into two tokens, but not @sometext:
>>> [token.text for token in nlp(u"#sometext")]
['#', 'sometext']
>>> [token.text for token in nlp(u"@sometext")]
['@sometext']
This means that a pattern for nlp(u"sometext") will match the second token of #sometext – but not the one token @sometext.
As a solution, you could customise spaCy's tokenization rules to make sure the @ is always split off. You might also want to look into this matcher example that shows how to use token patterns with regular expressions, e.g. to define more fine-grained token rules.
By default, the tokenizer splits #sometext into two tokens, but not @sometext
Could you elaborate about the reason behind this?
To be honest, I'm not sure there's a deeper reason for it. An "@" infix may indicate an email or email-like token, but I can't really think of a prefix or suffix that would require the string to stay one token. So it's likely that it just hasn't been added yet, and never really came up before.
If you want to experiment with it, you can find the global punctuation rules in lang/punctuation.py. If splitting the @ has no bad side-effects and the tests pass, we'd be happy to accept a PR for this.
in that specific example, once you can match whatever is in the matcher rules, how do you generate one (and only one) token from that?
In the case of @something the tokenizer will still spit @ and something
How can I merge the two into 1 token @something ?
@fgadaleta
How can I merge the two into 1 token
@something?
Check out Span.merge! Here's a simple example:
doc = nlp(u"Hello #something")
[token.text for token in doc] # ['Hello', '#', 'something']
span = doc[1:3] # create a span (slice of the doc)
span.merge() # merge the span
[token.text for token in doc] # ['Hello', '#something']
yeah I found it already.
Thanks a lot!
https://about.me/fgadaleta?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=gmail_api&utm_content=thumb
Francesco Gadaleta
about.me/fgadaleta
https://about.me/fgadaleta?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=gmail_api&utm_content=thumb
On 6 February 2018 at 16:38, Ines Montani notifications@github.com wrote:
How can I merge the two into 1 token @something ?
Check out Span.merge https://spacy.io/api/span#merge! Here's a simple
example:doc = nlp(u"Hello #something")
[token.text for token in doc] # ['Hello', '#', 'something']
span = doc[1:3] # create a span (slice of the doc)
span.merge() # merge the span
[token.text for token in doc] # ['Hello', '#something']—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/explosion/spaCy/issues/1867#issuecomment-363573173,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALE47yDLBXwFB9BXXAKDEi6yBd2A-aryks5tSMZKgaJpZM4RliYj
.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
To be honest, I'm not sure there's a deeper reason for it. An "@" infix may indicate an email or email-like token, but I can't really think of a prefix or suffix that would require the string to stay one token. So it's likely that it just hasn't been added yet, and never really came up before.
If you want to experiment with it, you can find the global punctuation rules in
lang/punctuation.py. If splitting the @ has no bad side-effects and the tests pass, we'd be happy to accept a PR for this.