Spacy: URL regex in tokenizer_exceptions.py too broad

Created on 16 Feb 2017  路  8Comments  路  Source: explosion/spaCy

The regex for the URL used for token_match during tokenization treats strings that are not URLs as URLs. For example, I'd expect the following text:

"This is the ticker symbol for Google, (NASDAQ:GOOG). Google\'s homepage is http://www.google.com"

to produce 'NASDAQ', ':', 'GOOG' as separate tokens while producing 'http://www.google.com' as a single token.

Using the URL regex proposed by https://gist.github.com/dperini/729294 yields better results, e.g.

token_match = re.compile(
    r"^"
    # protocol identifier
    r"(?:(?:https?|ftp)://)"
    # user:pass authentication
    r"(?:\S+(?::\S*)?@)?"
    r"(?:"
    # IP address exclusion
    # private & local networks
    r"(?!(?:10|127)(?:\.\d{1,3}){3})"
    r"(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
    r"(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
    # IP address dotted notation octets
    # excludes loopback network 0.0.0.0
    # excludes reserved space >= 224.0.0.0
    # excludes network & broadcast addresses
    # (first & last IP address of each class)
    r"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
    r"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}"
    r"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
    r"|"
    # host name
    r"(?:(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)"
    # domain name
    r"(?:\.(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)*"
    # TLD identifier
    r"(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
    r")"
    # port number
    r"(?::\d{2,5})?"
    # resource path
    r"(?:/\S*)?"
    r"$"
).match

Your Environment

  • Operating System: OSX & Linux
  • Python Version Used: 3.6.0
  • spaCy Version Used: 1.6.0
  • Environment Information:

All 8 comments

Thanks, we'll definitely patch this case.

@oroszgy : Do you have thoughts on the regex in this gist, vs. the current one?

Looks good to me. (cf. dperini column here) However I'd make the protocol identifier optional . My assumption is the things like "google.com" could appear easily in not so formal texts.

If we touch the code, I think we should

  • add new test cases based on the link above
  • check how the tokenizer's throughput changes

While perhaps not rigorous, I have a data point on throughput.

Processing the same full en wikipedia dump on a AWS EC2 m4.16xlarge instance:

with regex proposed above: throughput == 1,390,586 words/sec
with 1.6.0 regex: throughput == 1,379,929 words/sec

Some details:
27 individual python processes running concurrently, each multithreaded generator

@rappdw Do you mind making a PR on this? :)

@oroszgy Yes, I'll submit a PR.

Merged!

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings