Related issues: #1504, #1623, #1371, #1265, #1235, #1184, #3002
re.DEFAULT_VERSION is currently set globally, which is bad and can break other modules. There might also be other unintended side-effects of this (see #1623).0.1.20110315 vs. 2017.06.23 seems to cause problems in combination with other packages that specify older versions of regex (see #1184).We've already done some profiling and the suffix rules definitely seem to be a bigger problem than the prefix and infix rules. The biggest issue seems to be variable-width lookbehinds, caused by character classes which actually consist of multiple-character tokens (like ’’). This means that they need to be grouped into disjunctive expressions using |, rather than character classes using [] which only require a set lookup. However, variable-width lookbehinds mean that the tokenizer has to keep backtracking. This is also the reason most regex implementations don't support variable-width lookarounds. Russ Cox (who wrote re2) has a very comprehensive overview of these issues.
In spaCy's case, the regex module seem to be less efficient that re. If we simplify our regular expressions, we could also have an optional dependency on re2, which is even stricter than Python's re, but much faster. It can be used via a wrapper like this one and will require the system-level dependency to be installed – but if it is, spaCy will be much faster.
We want the tokenizer to run at around 500,000 words per second.
Simplifying the tokenization rules may mean that we have to accept some regressions. This is okay, as long as it only affects very specific cases. We might also want to adopt a flag system that lets the user enable additional, advanced rules that may result in slower tokenization speed, but higher accuracy (e.g. enabling full emoji and symbols support).
We initially thought it'd be a good idea to include all alphanumeric characters in the character classes used by the global punctuation rules. The idea here was that even if you're processing English text and come across a non-English alphanumeric character, you'd still want the punctuation rules to work as expected. While this still makes sense for characters like ä or ø, we should probably re-think this approach for character sets of Hebrew, Russian etc. Instead, those languages should probably define their own rules using and character classes in the respective language data – not globally.
On the plus side, our language-specific and language-independent tokenizer tests are pretty good and comprehensive. This should hopefully help a lot with debugging and testing possible solutions.
regex dependency.re2.There's also the (currently experimental and thus undocumented) profile CLI command, which profiles the spaCy pipeline to find out which functions and operations take the most time. The command takes a model name or language ID, and an optional input file (JSONL with one "text" key per line). If no input is specified, the IMDB dataset provided by Thinc is used.
python -m spacy profile en
python -m spacy profile en_core_web_sm /tmp/my_input.jsonl
OK, so I've just started looking at this. It seems that _suffixes in spacy/lang/punctuation.py can be simplified to
_suffixes = (LIST_PUNCT + LIST_ELLIPSES + LIST_QUOTES + LIST_ICONS +
["'s", "'S", "’s", "’S"] +
[r'(?<=[0-9])(?:{})'.format(UNITS),
r'(?<=[0-9{}{}(?:{})])\.'.format(ALPHA_LOWER, r'%²\-\)\]\+', QUOTES)])
without breaking any existing tests in spacy/tests/lang or spacy/tests/tokenizer
It's not clear to me yet how lookbehinds can be removed completely though, as we need to match without consuming characters when doing the token splits.
There are also some apparent inconsistencies in the tests for different languages, e.g. in hu 100°C is expected to be a single token whereas in fr 4°C is expected to be 2 tokens.
Is tokenization of measurements, units etc language-specific?
I've not got much experience with NLP on non-English languages, but maybe one option is to have a really simple tokenizer and then fix up the exceptions that require further splitting?
E.g. something like
regex.split(r'(?<!\p{Lu})([. \p{P}])'
@philgooch Thanks so much, this is cool! 👍 One of the biggest problems with the lookbehinds is that they're variable-width lookbehinds, i.e they currently don't just operate on single characters, but multi-character strings, that are then joined:
merge_chars = lambda char: char.strip().replace(' ', '|')
This is especially relevant in the units, quotes, currency symbols, hyphens etc. – and where our idea of merging tokens afterwards comes in. Instead of making the tokenizer backtrack on punctuation like ---, we could simply split on - and then merge [{ORTH: '-'}, {ORTH: '-'}, {ORTH: '-'}].
Similarly, it could also be easier and faster to split on all hyphens by default, and then merge [{IS_ALPHA: True}, {ORTH: '-'}, {IS_ALPHA: True}] back together afterwards.
If we can drop the variable-width lookbehinds and the currently overcomplicated ALPHA, ALPHA_LOWER and ALPHA_UPPER groups, we could then drop the regex dependency, in favour of a faster implementation.
Is tokenization of measurements, units etc language-specific?
Yes, some of it is – at least, the expected tokenization of the respective UD treebank.
We're also hoping that the approach of "generous splitting + merging" will make this easier, because it'll allow hu to simply add one more merge rule [{IS_DIGIT: True}, {LOWER: '°c'}], instead of having to overwrite all suffix rules. Similarly, it'd also allow users to implement any other, arbitrary tokenization standard, without having to effectively re-write the regular expressions.
@ines thanks - I get it now :) . So the next step is to implement a really simple tokenizer, see what breaks, and then create merge pattern Matchers to fix these.
I'll start with English only to begin with then .
@philgooch In case it's helpful, here are some merge patterns I wrote as part of my initial experiments:
merge_patterns = {
'NUMBERS': [
[{'IS_DIGIT': True, 'OP': '+'}],
[{'IS_DIGIT': True, 'OP': '+'}, {'ORTH': ','}, {'IS_DIGIT': True, 'OP': '+'}]
],
'PUNCT': [
[{'ORTH': '*'}, {'ORTH': '*'}, {'ORTH': '*'}],
[{'ORTH': '.'}, {'ORTH': '.'}, {'ORTH': '.'}],
[{'ORTH': '-'}, {'ORTH': '-'}, {'ORTH': '-'}],
[{'ORTH': '-'}, {'ORTH': '-'}],
[{'ORTH': '\''}, {'ORTH': '\''}],
[{'ORTH': '`'}, {'ORTH': '`'}],
[{'ORTH': '‘'}, {'ORTH': '‘'}],
[{'ORTH': '’'}, {'ORTH': '’'}]
],
'UNITS': [
[{'LOWER': 'km'}, {'ORTH': '²'}],
[{'LOWER': 'dm'}, {'ORTH': '²'}],
[{'LOWER': 'cm'}, {'ORTH': '²'}],
[{'LOWER': 'mm'}, {'ORTH': '²'}],
[{'LOWER': 'm'}, {'ORTH': '²'}],
[{'LOWER': 'km'}, {'ORTH': '³'}],
[{'LOWER': 'dm'}, {'ORTH': '³'}],
[{'LOWER': 'cm'}, {'ORTH': '³'}],
[{'LOWER': 'mm'}, {'ORTH': '³'}],
[{'LOWER': 'm'}, {'ORTH': '³'}],
[{'LOWER': 'us'}, {'ORTH': '$'}],
[{'LOWER': 'c'}, {'ORTH': '$'}],
[{'LOWER': 'a'}, {'ORTH': '$'}]
]
}
I then used the following simple pipeline component to do the merging:
from spacy.matcher import Matcher
def merge_tokens(doc):
matcher = Matcher(doc.vocab)
for pattern_id, patterns in merge_patterns.items():
matcher.add(pattern_id, None, *patterns)
spans = []
for match_id, start, end in matcher(doc):
spans.append(doc[start : end])
for span in spans:
span.merge()
return doc
For example:
from spacy.lang.en import English
nlp = English()
nlp.add_pipe(merge_tokens, name='token_merger')
We could potentially avoid the need for the PUNCT patterns by doing something like this, where multiple consecutive non-word/period/space characters are combined into a single token:
tokens = list(filter(None, regex.split(r'(\s+|[.]+|[^.\s\d\p{L}]+)', input_text)))
That's interesting and definitely worth experimenting with. Ultimately, it comes down to which case is more frequent: punctuation that's merged but should be split or punctuation that's split and needs to be merged. Or, phrased differently, which list of exceptions would be smaller: punctuation that should be merged or punctuation that should be split? (Intuitively, I'd lean towards the first – but my intuition might be completely off.)
Here are some examples of expected tokenization:
hello world \n\n\n → ['hello', 'world', '\n\n\n']hello --- world → ['hello', '---', 'world']"“Hello!”, he said." → ['"', '“', 'Hello', '!', '”', ',', 'he', 'said', '.', '"']Great summary, I agree that the tokenization is currently too slow.
Simplifying the regex by removing the lookbehinds and switching to a faster regular expression library seems a good idea, if there are some easy way to merge tokens afterwards.
The Matcher class seems a good way to match patterns to merge, but I see a few limitations that need to be addressed to be used as the tool to merge tokens:
a-t-il is tokenized into ['a', '-t', '-il']. The pattern to detect is [{}, {'LOWER': '-'}, {'LOWER': 't'}, {'LOWER': 'il'}] (made of 4 tokens), but we should have 3 tokens after the merge. The way the tokens are combined should therefore be configurableTOKENIZER_EXCEPTIONS@raphael0202 About how to get the right tokenization for the euphonic "t" in French:
tokenizer_exceptions seem to favor the former option (I guess @ines and @honnibal have an informed take on the matter).@moreymat Yes the merge should not depend on the preceding verb, that's why the first token of the pattern I wrote ([{}, {'LOWER': '-'}, {'LOWER': 't'}, {'LOWER': 'il'}]) is {}, which means "match all tokens".
This way we're sure there is a token before the -t-il.
edit: I misread your comment. The current implementation of the Matcher class is not based on regex patterns, so there is no lookbehind involved.
Finally moving forward on the tokenizer improvements. We'll be merging some more general tokenization performance-related issues onto here to keep things in one place.
/cc @svlandeg, who'll be working on this 🙌
@ines
Just as a quick progress update, I've been fiddling with this and decided to break this work up in 3 steps:
\p{Latin} to the actual unicode rangesI'm benchmarking this both on the existing unit tests, introducing additional ones, and testing accuracy & speed using the UD eval script.
Step 1 seems to be coming along quite nicely - models using the default rules are done with virtually no regressions. Now I'm looking into a few language-specific rules that require rewriting. While I'm doing this I'm collecting ideas for step 2 and 3 but will add those to separate PRs. This will allow us to better trace the cause of potential regressions along the way.
With regex now removed, there can be no more variable-width lookbehinds. As such I don't see an easy fix for Issue https://github.com/explosion/spaCy/issues/1235 as the current suffix rule tests for a preceding number, but can't go one step more back to check if there's a letter or not, because there might not be anything (e.g. 2g) which means that it would have to be a variable-width check to prevent a case like e2g.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
@ines
Just as a quick progress update, I've been fiddling with this and decided to break this work up in 3 steps:
\p{Latin}to the actual unicode rangesI'm benchmarking this both on the existing unit tests, introducing additional ones, and testing accuracy & speed using the UD eval script.
Step 1 seems to be coming along quite nicely - models using the default rules are done with virtually no regressions. Now I'm looking into a few language-specific rules that require rewriting. While I'm doing this I'm collecting ideas for step 2 and 3 but will add those to separate PRs. This will allow us to better trace the cause of potential regressions along the way.