Spacy: 💫 Improve tokenization, refactor regular expressions and get rid of "regex" dependency

Created on 26 Nov 2017 · 14Comments · Source: explosion/spaCy

Related issues: #1504, #1623, #1371, #1265, #1235, #1184, #3002

Problems

re.DEFAULT_VERSION is currently set globally, which is bad and can break other modules. There might also be other unintended side-effects of this (see #1623).
The non-standard versioning of the module, e.g. 0.1.20110315 vs. 2017.06.23 seems to cause problems in combination with other packages that specify older versions of regex (see #1184).
The tokenizer is getting slower (see #1371), parts of which can likely be tracked down to bad regular expressions and lookbehinds defined in the punctuation rules, and potentially the growing character classes.

We've already done some profiling and the suffix rules definitely seem to be a bigger problem than the prefix and infix rules. The biggest issue seems to be variable-width lookbehinds, caused by character classes which actually consist of multiple-character tokens (like ’’). This means that they need to be grouped into disjunctive expressions using |, rather than character classes using [] which only require a set lookup. However, variable-width lookbehinds mean that the tokenizer has to keep backtracking. This is also the reason most regex implementations don't support variable-width lookarounds. Russ Cox (who wrote re2) has a very comprehensive overview of these issues.

In spaCy's case, the regex module seem to be less efficient that re. If we simplify our regular expressions, we could also have an optional dependency on re2, which is even stricter than Python's re, but much faster. It can be used via a wrapper like this one and will require the system-level dependency to be installed – but if it is, spaCy will be much faster.

Target

We want the tokenizer to run at around 500,000 words per second.

Considerations

Simplifying the tokenization rules may mean that we have to accept some regressions. This is okay, as long as it only affects very specific cases. We might also want to adopt a flag system that lets the user enable additional, advanced rules that may result in slower tokenization speed, but higher accuracy (e.g. enabling full emoji and symbols support).
We initially thought it'd be a good idea to include all alphanumeric characters in the character classes used by the global punctuation rules. The idea here was that even if you're processing English text and come across a non-English alphanumeric character, you'd still want the punctuation rules to work as expected. While this still makes sense for characters like ä or ø, we should probably re-think this approach for character sets of Hebrew, Russian etc. Instead, those languages should probably define their own rules using and character classes in the respective language data – not globally.
On the plus side, our language-specific and language-independent tokenizer tests are pretty good and comprehensive. This should hopefully help a lot with debugging and testing possible solutions.

Solutions

Refactor and simplify regular expressions defined in the global and language-specific rules.
Move classes for alphanumeric characters in Hebrew, Russian etc. to the language-specific data so they're only used when needed.
Drop regex dependency.
Add support for re2.

Relevant source and helpers

There's also the (currently experimental and thus undocumented) profile CLI command, which profiles the spaCy pipeline to find out which functions and operations take the most time. The command takes a model name or language ID, and an optional input file (JSONL with one "text" key per line). If no input is specified, the IMDB dataset provided by Thinc is used.

python -m spacy profile en
python -m spacy profile en_core_web_sm /tmp/my_input.jsonl

feat / tokenizer help wanted perf / accuracy perf / speed

Source

ines

Most helpful comment

@ines
Just as a quick progress update, I've been fiddling with this and decided to break this work up in 3 steps:

replace regex with re (minimizing regressions as much as possible)
- rewrite e.g. \p{Latin} to the actual unicode ranges
- from disjunctive expressions to character classes where needed to eliminate variable-width look-behind
improve speed (potentially introducing some regressions as trade-off):
- further limit unicode ranges in character classes
- eliminate/simplify expressions coding for low-frequent edge-cases
- experiment with re2
improve maintainability of the punctuation rules (should introduce no further regressions)
- avoid copy-pasting of expressions where possible

I'm benchmarking this both on the existing unit tests, introducing additional ones, and testing accuracy & speed using the UD eval script.

Step 1 seems to be coming along quite nicely - models using the default rules are done with virtually no regressions. Now I'm looking into a few language-specific rules that require rewriting. While I'm doing this I'm collecting ideas for step 2 and 3 but will add those to separate PRs. This will allow us to better trace the cause of potential regressions along the way.

svlandeg on 30 Jan 2019

🚀1 ❤1

All 14 comments

OK, so I've just started looking at this. It seems that _suffixes in spacy/lang/punctuation.py can be simplified to

_suffixes = (LIST_PUNCT + LIST_ELLIPSES + LIST_QUOTES + LIST_ICONS +
             ["'s", "'S", "’s", "’S"] +
             [r'(?<=[0-9])(?:{})'.format(UNITS),
              r'(?<=[0-9{}{}(?:{})])\.'.format(ALPHA_LOWER, r'%²\-\)\]\+', QUOTES)])

without breaking any existing tests in spacy/tests/lang or spacy/tests/tokenizer

It's not clear to me yet how lookbehinds can be removed completely though, as we need to match without consuming characters when doing the token splits.

philgooch on 9 Mar 2018

👍1

There are also some apparent inconsistencies in the tests for different languages, e.g. in hu 100°C is expected to be a single token whereas in fr 4°C is expected to be 2 tokens.

Is tokenization of measurements, units etc language-specific?

I've not got much experience with NLP on non-English languages, but maybe one option is to have a really simple tokenizer and then fix up the exceptions that require further splitting?

E.g. something like

regex.split(r'(?<!\p{Lu})([. \p{P}])'

philgooch on 9 Mar 2018

@philgooch Thanks so much, this is cool! 👍 One of the biggest problems with the lookbehinds is that they're variable-width lookbehinds, i.e they currently don't just operate on single characters, but multi-character strings, that are then joined:

merge_chars = lambda char: char.strip().replace(' ', '|')

https://github.com/explosion/spaCy/blob/7e80550f13560e4e18a8e88743a3a815e0a2cb83/spacy/lang/char_classes.py#L31-L42

This is especially relevant in the units, quotes, currency symbols, hyphens etc. – and where our idea of merging tokens afterwards comes in. Instead of making the tokenizer backtrack on punctuation like ---, we could simply split on - and then merge [{ORTH: '-'}, {ORTH: '-'}, {ORTH: '-'}].

Similarly, it could also be easier and faster to split on all hyphens by default, and then merge [{IS_ALPHA: True}, {ORTH: '-'}, {IS_ALPHA: True}] back together afterwards.

If we can drop the variable-width lookbehinds and the currently overcomplicated ALPHA, ALPHA_LOWER and ALPHA_UPPER groups, we could then drop the regex dependency, in favour of a faster implementation.

Is tokenization of measurements, units etc language-specific?

Yes, some of it is – at least, the expected tokenization of the respective UD treebank.

We're also hoping that the approach of "generous splitting + merging" will make this easier, because it'll allow hu to simply add one more merge rule [{IS_DIGIT: True}, {LOWER: '°c'}], instead of having to overwrite all suffix rules. Similarly, it'd also allow users to implement any other, arbitrary tokenization standard, without having to effectively re-write the regular expressions.

ines on 9 Mar 2018

@ines thanks - I get it now :) . So the next step is to implement a really simple tokenizer, see what breaks, and then create merge pattern Matchers to fix these.

I'll start with English only to begin with then .

philgooch on 9 Mar 2018

👍1

@philgooch In case it's helpful, here are some merge patterns I wrote as part of my initial experiments:

merge_patterns = {
    'NUMBERS': [
        [{'IS_DIGIT': True, 'OP': '+'}],
        [{'IS_DIGIT': True, 'OP': '+'}, {'ORTH': ','}, {'IS_DIGIT': True, 'OP': '+'}]
    ],
    'PUNCT': [
        [{'ORTH': '*'}, {'ORTH': '*'}, {'ORTH': '*'}],
        [{'ORTH': '.'}, {'ORTH': '.'}, {'ORTH': '.'}],
        [{'ORTH': '-'}, {'ORTH': '-'}, {'ORTH': '-'}],
        [{'ORTH': '-'}, {'ORTH': '-'}],
        [{'ORTH': '\''}, {'ORTH': '\''}],
        [{'ORTH': '`'}, {'ORTH': '`'}],
        [{'ORTH': '‘'}, {'ORTH': '‘'}],
        [{'ORTH': '’'}, {'ORTH': '’'}]
    ],
    'UNITS': [
        [{'LOWER': 'km'}, {'ORTH': '²'}],
        [{'LOWER': 'dm'}, {'ORTH': '²'}],
        [{'LOWER': 'cm'}, {'ORTH': '²'}],
        [{'LOWER': 'mm'}, {'ORTH': '²'}],
        [{'LOWER': 'm'}, {'ORTH': '²'}],
        [{'LOWER': 'km'}, {'ORTH': '³'}],
        [{'LOWER': 'dm'}, {'ORTH': '³'}],
        [{'LOWER': 'cm'}, {'ORTH': '³'}],
        [{'LOWER': 'mm'}, {'ORTH': '³'}],
        [{'LOWER': 'm'}, {'ORTH': '³'}],
        [{'LOWER': 'us'}, {'ORTH': '$'}],
        [{'LOWER': 'c'}, {'ORTH': '$'}],
        [{'LOWER': 'a'}, {'ORTH': '$'}]
    ]
}

I then used the following simple pipeline component to do the merging:

from spacy.matcher import Matcher

def merge_tokens(doc):
    matcher = Matcher(doc.vocab)
    for pattern_id, patterns in merge_patterns.items():
        matcher.add(pattern_id, None, *patterns)
    spans = []
    for match_id, start, end in matcher(doc):
        spans.append(doc[start : end])
    for span in spans:
        span.merge()
    return doc

For example:

from spacy.lang.en import English

nlp = English()
nlp.add_pipe(merge_tokens, name='token_merger')

ines on 9 Mar 2018

👍1

We could potentially avoid the need for the PUNCT patterns by doing something like this, where multiple consecutive non-word/period/space characters are combined into a single token:

tokens = list(filter(None, regex.split(r'(\s+|[.]+|[^.\s\d\p{L}]+)', input_text)))

philgooch on 9 Mar 2018

That's interesting and definitely worth experimenting with. Ultimately, it comes down to which case is more frequent: punctuation that's merged but should be split or punctuation that's split and needs to be merged. Or, phrased differently, which list of exceptions would be smaller: punctuation that should be merged or punctuation that should be split? (Intuitively, I'd lean towards the first – but my intuition might be completely off.)

Here are some examples of expected tokenization:

hello world \n\n\n → ['hello', 'world', '\n\n\n']
hello --- world → ['hello', '---', 'world']
"“Hello!”, he said." → ['"', '“', 'Hello', '!', '”', ',', 'he', 'said', '.', '"']

ines on 9 Mar 2018

Great summary, I agree that the tokenization is currently too slow.
Simplifying the regex by removing the lookbehinds and switching to a faster regular expression library seems a good idea, if there are some easy way to merge tokens afterwards.

The Matcher class seems a good way to match patterns to merge, but I see a few limitations that need to be addressed to be used as the tool to merge tokens:

we may want to detect a pattern without merging all tokens of the pattern into a single one. For instance, in French, a-t-il is tokenized into ['a', '-t', '-il']. The pattern to detect is [{}, {'LOWER': '-'}, {'LOWER': 't'}, {'LOWER': 'il'}] (made of 4 tokens), but we should have 3 tokens after the merge. The way the tokens are combined should therefore be configurable
it may be useful to be able to configure the merged token attributes, in the same way we can configure the token attributes in TOKENIZER_EXCEPTIONS
easy way to do regex-based pattern matching. Even if it is currently possible (https://spacy.io/usage/linguistic-features#regex), we need to register a flag in the Vocab class before being able to use it in a pattern.

raphael0202 on 5 Apr 2018

👍2

@raphael0202 About how to get the right tokenization for the euphonic "t" in French:

We might want to leave the preceding verb or auxiliary out of the process because they form an open-ended class (all inflected forms of a verb or auxiliary, ending with a vowel), which means they provide context hence including them would amount to re-introducing lookbehind (but please correct me if the latter assertion is incorrect).
We still need to merge 4 tokens ["-", "t", "-", "il"] to 2 ["-t", "-il"] at once (same for "-t-elle" and "-t-on").
Then, I don't know whether it's best to assign their lemma ["t", "il"], or TAG or POS ["PART", "PRON"] at the same time, or defer that to the lemmatizer and POS tagger, although the current implementations of tokenizer_exceptions seem to favor the former option (I guess @ines and @honnibal have an informed take on the matter).

moreymat on 8 Apr 2018

@moreymat Yes the merge should not depend on the preceding verb, that's why the first token of the pattern I wrote ([{}, {'LOWER': '-'}, {'LOWER': 't'}, {'LOWER': 'il'}]) is {}, which means "match all tokens".
This way we're sure there is a token before the -t-il.

edit: I misread your comment. The current implementation of the Matcher class is not based on regex patterns, so there is no lookbehind involved.

raphael0202 on 16 Apr 2018

Finally moving forward on the tokenizer improvements. We'll be merging some more general tokenization performance-related issues onto here to keep things in one place.

/cc @svlandeg, who'll be working on this 🙌

ines on 7 Jan 2019

🎉2

@ines
Just as a quick progress update, I've been fiddling with this and decided to break this work up in 3 steps:

replace regex with re (minimizing regressions as much as possible)
- rewrite e.g. \p{Latin} to the actual unicode ranges
- from disjunctive expressions to character classes where needed to eliminate variable-width look-behind
improve speed (potentially introducing some regressions as trade-off):
- further limit unicode ranges in character classes
- eliminate/simplify expressions coding for low-frequent edge-cases
- experiment with re2
improve maintainability of the punctuation rules (should introduce no further regressions)
- avoid copy-pasting of expressions where possible

I'm benchmarking this both on the existing unit tests, introducing additional ones, and testing accuracy & speed using the UD eval script.

svlandeg on 30 Jan 2019

🚀1 ❤1

With regex now removed, there can be no more variable-width lookbehinds. As such I don't see an easy fix for Issue https://github.com/explosion/spaCy/issues/1235 as the current suffix rule tests for a preceding number, but can't go one step more back to check if there's a letter or not, because there might not be anything (e.g. 2g) which means that it would have to be a variable-width check to prevent a case like e2g.

svlandeg on 8 Feb 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.