Spacy: Tokenizer speed: 2.0 << 1.9 ?!

Created on 27 Sep 2017 · 9Comments · Source: explosion/spaCy

Hi,

I wanted to tokenize a dataset such as 20newsgroups and I found spacy 2.0 to be quite slow. To be sure I also tried with spacy 1.9 and it was twice faster ! Actually I did some speed analysis between v1 and v2 according to document length (in character). It seems that in v2, it is more sensitive to the length of document, and processing time is more volatile... Is it something expected due to some new tokenizer features or the new machinery of v2 ?

capture d ecran 2017-09-27 a 21 18 40

feat / tokenizer help wanted perf / speed

Source

thomasopsomer

Most helpful comment

I made some more experiments, on the first 5k text of the 20newsgroups corpus, averaged on 10 iterations. Here my script btw https://gist.github.com/thomasopsomer/5b044f86b9e8f1a327e409631360cc99

Default tokenizers of both version give the following time performance:

| | 2.0 | 1.9 |
| --------------------- |---------:| --------:|
| Avg processing time | 12.28 | 8.56 |
| Avg time per doc | 0.0024 | 0.0017 |
| Avg max time per doc | 0.75 | 0.59 |

Setting exactly the same prefix_re, suffix_re and infix_finditer for both version using the regex from spacy 1.9 and removing exceptions: rules={}. It's not clear but maybe v2 regexes arm a bit performance...

| | 2.0 | 1.9 |
| --------------------- |---------:| --------:|
| Avg processing time | 10.81 | 7.53 |
| Avg time per doc | 0.0021 | 0.0015 |
| Avg max time per doc | 0.70 | 0.58 |

As suggested I tested 5. using nlp.tokenizer.vocab.lex_attr_getters. It seems that some performance leak might be related to caching as setting lex_attr_getters = {} decrease time by 2s in 1.9 but 4s in v2 ! (see below):
- with: nlp.tokenizer.vocab.lex_attr_getters = {}

| | 2.0 | 1.9 |
| --------------------- |-------:| ---------:|
| Avg processing time | 8.07 | 6.35 |
| Avg time per doc | 0.0016 | 0.0013 |
| Avg max time per doc | 0.49 | 0.47 |

I wanted to test v2 with change of #1411 but didn't manage to build the develop branch ^^

thomasopsomer on 19 Oct 2017

👍5

All 9 comments

Thank for the analysis! There are some open questions about this on the TODO list for spaCy 2 stable: https://github.com/explosion/spaCy/projects/4

There are a few sources of potential problems, that could be to blame for the regression here:

More complex prefix_re?
More compex suffix_re?
More complex infix_finditer?
token_match?
Bad/incorrect lexeme caching?
Less/incorrect tokenizer caching?
Less efficient Vocab or StringStore performance?

The hope is that it's 1-4. 5-6 isn't so bad either. If it's 7 that'll take some more work and might force some hard decisions.

We can mostly exclude 5 by setting nlp.tokenizer.vocab.lex_attr_getters = {}. This way we don't compute any of the string features. If the caching isn't working well, this will make a big difference. If it doesn't make much difference, it's unlikely to be about the lexeme caching.

We can investigate 1-4 by assigning different functions to those attributes of the tokenizer. I think token_match is a very likely culprit, given the non-linearity you've identified.

honnibal on 28 Sep 2017

I made some more experiments, on the first 5k text of the 20newsgroups corpus, averaged on 10 iterations. Here my script btw https://gist.github.com/thomasopsomer/5b044f86b9e8f1a327e409631360cc99

Default tokenizers of both version give the following time performance:

| | 2.0 | 1.9 |
| --------------------- |---------:| --------:|
| Avg processing time | 12.28 | 8.56 |
| Avg time per doc | 0.0024 | 0.0017 |
| Avg max time per doc | 0.75 | 0.59 |

Setting exactly the same prefix_re, suffix_re and infix_finditer for both version using the regex from spacy 1.9 and removing exceptions: rules={}. It's not clear but maybe v2 regexes arm a bit performance...

| | 2.0 | 1.9 |
| --------------------- |---------:| --------:|
| Avg processing time | 10.81 | 7.53 |
| Avg time per doc | 0.0021 | 0.0015 |
| Avg max time per doc | 0.70 | 0.58 |

As suggested I tested 5. using nlp.tokenizer.vocab.lex_attr_getters. It seems that some performance leak might be related to caching as setting lex_attr_getters = {} decrease time by 2s in 1.9 but 4s in v2 ! (see below):
- with: nlp.tokenizer.vocab.lex_attr_getters = {}

| | 2.0 | 1.9 |
| --------------------- |-------:| ---------:|
| Avg processing time | 8.07 | 6.35 |
| Avg time per doc | 0.0016 | 0.0013 |
| Avg max time per doc | 0.49 | 0.47 |

I wanted to test v2 with change of #1411 but didn't manage to build the develop branch ^^

thomasopsomer on 19 Oct 2017

👍5

There has been a problem with the cache in the tokenizer. But even with the fix, the v2 tokenizer is still very slow. Working on this.

honnibal on 15 Nov 2017

👍1

Is this still a known issue? It seems like the tokenizer is quite slow by default, even when called with pipe(). Should I be adding my own multiprocessing around it?

phdowling on 7 Sep 2018

👍2

I did some experiments today to test the performance of tokenizer only. It looks like spacy 2.x is still somewhat slower than spacy 1.x. Also surprisingly spacy 2.x under python 3.6 are even twice slower than spacy 2.x under python 2.7. @honnibal can you help look why performance under python 3.6 is not so good?

py27_spacy1: 4189739 tokens, 404279.399319 WPS
py27_spacy2: 4191479 tokens, 297504.391077 WPS
py36_spacy1: 4189739 tokens, 416148.866741 WPS
py36_spacy2: 4191479 tokens, 149588.291103 WPS

Environment:
Machine: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz x 8
OS: Ubuntu 16.04.1
Python version: Python 2.7.15 or Python 3.6.8 :: Anaconda
spaCy version: 1.10.1 or 2.0.18

rulai-huajunzeng on 9 Jan 2019

Hi @rulai-huajunzeng, I'm currently looking into improving the compilation of regular expressions in the tokenizer, with a focus on speed. We're definitely aiming to substantially improve upon the WPS stats. Which corpus did you do the above tests on?

svlandeg on 9 Jan 2019

Merging this thread with the master thread in #1642!

ines on 9 Jan 2019

@svlandeg glad to know that you are working on that. I used a personal corpus which I could not share. It have more than 300K lines of text and each line contains one or several sentences.

rulai-huajunzeng on 9 Jan 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.