Spacy: URL regex in tokenizer_exceptions.py too broad

Created on 16 Feb 2017 · 8Comments · Source: explosion/spaCy

The regex for the URL used for token_match during tokenization treats strings that are not URLs as URLs. For example, I'd expect the following text:

"This is the ticker symbol for Google, (NASDAQ:GOOG). Google\'s homepage is http://www.google.com"

to produce 'NASDAQ', ':', 'GOOG' as separate tokens while producing 'http://www.google.com' as a single token.

Using the URL regex proposed by https://gist.github.com/dperini/729294 yields better results, e.g.

token_match = re.compile(
    r"^"
    # protocol identifier
    r"(?:(?:https?|ftp)://)"
    # user:pass authentication
    r"(?:\S+(?::\S*)?@)?"
    r"(?:"
    # IP address exclusion
    # private & local networks
    r"(?!(?:10|127)(?:\.\d{1,3}){3})"
    r"(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
    r"(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
    # IP address dotted notation octets
    # excludes loopback network 0.0.0.0
    # excludes reserved space >= 224.0.0.0
    # excludes network & broadcast addresses
    # (first & last IP address of each class)
    r"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
    r"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}"
    r"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
    r"|"
    # host name
    r"(?:(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)"
    # domain name
    r"(?:\.(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)*"
    # TLD identifier
    r"(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
    r")"
    # port number
    r"(?::\d{2,5})?"
    # resource path
    r"(?:/\S*)?"
    r"$"
).match

Your Environment

Operating System: OSX & Linux
Python Version Used: 3.6.0
spaCy Version Used: 1.6.0
Environment Information:

Source

rappdw

All 8 comments

Thanks, we'll definitely patch this case.

@oroszgy : Do you have thoughts on the regex in this gist, vs. the current one?

honnibal on 16 Feb 2017

Looks good to me. (cf. dperini column here) However I'd make the protocol identifier optional . My assumption is the things like "google.com" could appear easily in not so formal texts.

If we touch the code, I think we should