The regex for the URL used for token_match during tokenization treats strings that are not URLs as URLs. For example, I'd expect the following text:
"This is the ticker symbol for Google, (NASDAQ:GOOG). Google\'s homepage is http://www.google.com"
to produce 'NASDAQ', ':', 'GOOG' as separate tokens while producing 'http://www.google.com' as a single token.
Using the URL regex proposed by https://gist.github.com/dperini/729294 yields better results, e.g.
token_match = re.compile(
r"^"
# protocol identifier
r"(?:(?:https?|ftp)://)"
# user:pass authentication
r"(?:\S+(?::\S*)?@)?"
r"(?:"
# IP address exclusion
# private & local networks
r"(?!(?:10|127)(?:\.\d{1,3}){3})"
r"(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
r"(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
# IP address dotted notation octets
# excludes loopback network 0.0.0.0
# excludes reserved space >= 224.0.0.0
# excludes network & broadcast addresses
# (first & last IP address of each class)
r"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
r"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}"
r"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
r"|"
# host name
r"(?:(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)"
# domain name
r"(?:\.(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)*"
# TLD identifier
r"(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
r")"
# port number
r"(?::\d{2,5})?"
# resource path
r"(?:/\S*)?"
r"$"
).match
Thanks, we'll definitely patch this case.
@oroszgy : Do you have thoughts on the regex in this gist, vs. the current one?
Looks good to me. (cf. dperini column here) However I'd make the protocol identifier optional . My assumption is the things like "google.com" could appear easily in not so formal texts.
If we touch the code, I think we should
While perhaps not rigorous, I have a data point on throughput.
Processing the same full en wikipedia dump on a AWS EC2 m4.16xlarge instance:
with regex proposed above: throughput == 1,390,586 words/sec
with 1.6.0 regex: throughput == 1,379,929 words/sec
Some details:
27 individual python processes running concurrently, each multithreaded generator
@rappdw Do you mind making a PR on this? :)
@oroszgy Yes, I'll submit a PR.
@oroszgy https://github.com/explosion/spaCy/pull/879 submitted
Merged!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.