Hi,
I wanted to tokenize a dataset such as 20newsgroups and I found spacy 2.0 to be quite slow. To be sure I also tried with spacy 1.9 and it was twice faster ! Actually I did some speed analysis between v1 and v2 according to document length (in character). It seems that in v2, it is more sensitive to the length of document, and processing time is more volatile... Is it something expected due to some new tokenizer features or the new machinery of v2 ?

Thank for the analysis! There are some open questions about this on the TODO list for spaCy 2 stable: https://github.com/explosion/spaCy/projects/4
There are a few sources of potential problems, that could be to blame for the regression here:
prefix_re?suffix_re?infix_finditer?token_match?The hope is that it's 1-4. 5-6 isn't so bad either. If it's 7 that'll take some more work and might force some hard decisions.
We can mostly exclude 5 by setting nlp.tokenizer.vocab.lex_attr_getters = {}. This way we don't compute any of the string features. If the caching isn't working well, this will make a big difference. If it doesn't make much difference, it's unlikely to be about the lexeme caching.
We can investigate 1-4 by assigning different functions to those attributes of the tokenizer. I think token_match is a very likely culprit, given the non-linearity you've identified.
I made some more experiments, on the first 5k text of the 20newsgroups corpus, averaged on 10 iterations. Here my script btw https://gist.github.com/thomasopsomer/5b044f86b9e8f1a327e409631360cc99
| | 2.0 | 1.9 |
| --------------------- |---------:| --------:|
| Avg processing time | 12.28 | 8.56 |
| Avg time per doc | 0.0024 | 0.0017 |
| Avg max time per doc | 0.75 | 0.59 |
prefix_re, suffix_re and infix_finditer for both version using the regex from spacy 1.9 and removing exceptions: rules={}. It's not clear but maybe v2 regexes arm a bit performance...| | 2.0 | 1.9 |
| --------------------- |---------:| --------:|
| Avg processing time | 10.81 | 7.53 |
| Avg time per doc | 0.0021 | 0.0015 |
| Avg max time per doc | 0.70 | 0.58 |
As suggested I tested 5. using nlp.tokenizer.vocab.lex_attr_getters. It seems that some performance leak might be related to caching as setting lex_attr_getters = {} decrease time by 2s in 1.9 but 4s in v2 ! (see below):
nlp.tokenizer.vocab.lex_attr_getters = {}| | 2.0 | 1.9 |
| --------------------- |-------:| ---------:|
| Avg processing time | 8.07 | 6.35 |
| Avg time per doc | 0.0016 | 0.0013 |
| Avg max time per doc | 0.49 | 0.47 |
I wanted to test v2 with change of #1411 but didn't manage to build the develop branch ^^
There has been a problem with the cache in the tokenizer. But even with the fix, the v2 tokenizer is still very slow. Working on this.
Is this still a known issue? It seems like the tokenizer is quite slow by default, even when called with pipe(). Should I be adding my own multiprocessing around it?
I did some experiments today to test the performance of tokenizer only. It looks like spacy 2.x is still somewhat slower than spacy 1.x. Also surprisingly spacy 2.x under python 3.6 are even twice slower than spacy 2.x under python 2.7. @honnibal can you help look why performance under python 3.6 is not so good?
py27_spacy1: 4189739 tokens, 404279.399319 WPS
py27_spacy2: 4191479 tokens, 297504.391077 WPS
py36_spacy1: 4189739 tokens, 416148.866741 WPS
py36_spacy2: 4191479 tokens, 149588.291103 WPS
Environment:
Machine: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz x 8
OS: Ubuntu 16.04.1
Python version: Python 2.7.15 or Python 3.6.8 :: Anaconda
spaCy version: 1.10.1 or 2.0.18
Hi @rulai-huajunzeng, I'm currently looking into improving the compilation of regular expressions in the tokenizer, with a focus on speed. We're definitely aiming to substantially improve upon the WPS stats. Which corpus did you do the above tests on?
Merging this thread with the master thread in #1642!
@svlandeg glad to know that you are working on that. I used a personal corpus which I could not share. It have more than 300K lines of text and each line contains one or several sentences.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
I made some more experiments, on the first 5k text of the 20newsgroups corpus, averaged on 10 iterations. Here my script btw https://gist.github.com/thomasopsomer/5b044f86b9e8f1a327e409631360cc99
| | 2.0 | 1.9 |
| --------------------- |---------:| --------:|
| Avg processing time | 12.28 | 8.56 |
| Avg time per doc | 0.0024 | 0.0017 |
| Avg max time per doc | 0.75 | 0.59 |
prefix_re,suffix_reandinfix_finditerfor both version using the regex from spacy 1.9 and removing exceptions:rules={}. It's not clear but maybe v2 regexes arm a bit performance...| | 2.0 | 1.9 |
| --------------------- |---------:| --------:|
| Avg processing time | 10.81 | 7.53 |
| Avg time per doc | 0.0021 | 0.0015 |
| Avg max time per doc | 0.70 | 0.58 |
As suggested I tested 5. using
nlp.tokenizer.vocab.lex_attr_getters. It seems that some performance leak might be related to caching as settinglex_attr_getters = {}decrease time by 2s in 1.9 but 4s in v2 ! (see below):nlp.tokenizer.vocab.lex_attr_getters = {}| | 2.0 | 1.9 |
| --------------------- |-------:| ---------:|
| Avg processing time | 8.07 | 6.35 |
| Avg time per doc | 0.0016 | 0.0013 |
| Avg max time per doc | 0.49 | 0.47 |
I wanted to test v2 with change of #1411 but didn't manage to build the develop branch ^^