Spacy: Matcher memory usage

Created on 14 Jun 2017 · 19Comments · Source: explosion/spaCy

Hello.

I use wikidata and matcher to parse my documents:

#head -n 10 pattern.py

from spacy.symbols import ORTH, LOWER
def add(matcher):
  matcher.add('31', None, [{LOWER: 'belgium'}],[{LOWER: 'kingdom'},{LOWER: 'of'},{LOWER: 'belgium'}],[{LOWER: 'be'}],[{LOWER: '🇧🇪'}])
  matcher.add('1', None, [{LOWER: 'universe'}],[{LOWER: 'space'}],[{LOWER: 'cosmos'}],[{LOWER: 'outer'},{LOWER: 'space'}],[{LOWER: 'universe'},{LOWER: '(class)'}],[{LOWER: 'universe,'},{LOWER: 'a'}])
  matcher.add('13', None, [{LOWER: 'triskaidekaphobia'}],[{LOWER: 'fear'},{LOWER: 'of'},{LOWER: '13'}],[{LOWER: 'scared'},{LOWER: 'of'},{LOWER: '13'}],[{LOWER: 'fear'},{LOWER: 'of'},{LOWER: '#13'}])
  matcher.add('23', None, [{LOWER: 'george'},{LOWER: 'washington'}],[{LOWER: 'father'},{LOWER: 'of'},{LOWER: 'the'},{LOWER: 'united'},{LOWER: 'states'}],[{LOWER: 'washington'}],[{LOWER: 'president'},{LOWER: 'washington'}])
  matcher.add('35', None, [{LOWER: 'denmark'}],[{LOWER: 'dk'}],[{LOWER: 'danmark'}],[{LOWER: 'dnk'}],[{LOWER: 'dek'}],[{LOWER: 'dk'}],[{LOWER: 'denmark'},{LOWER: 'proper'}],[{LOWER: 'metropolitan'},{LOWER: 'denmark'}],[{LOWER: '🇩🇰'}])
  matcher.add('44', None, [{LOWER: 'beer'}])
  matcher.add('64', None, [{LOWER: 'berlin'}],[{LOWER: 'berlin,'},{LOWER: 'germany'}])
  matcher.add('82', None, [{LOWER: 'printer'}],[{LOWER: 'computer'},{LOWER: 'printer'}])

On small pattern files (10,000 LOC) it works pretty well.
But as you may know wikidata has huge amount of entities (currently 26,465,195 ). So when I imported part of wikidata (1,500,000) and try to load them:

import spacy
from spacy.matcher import Matcher
print("loading model")
nlp = spacy.load('en_core_web_sm')
print("loading patterns")
matcher = Matcher(nlp.vocab)
import patterns
patterns.add(matcher)

spacy used all RAM and hangs system completely.

So i decided to investigate this situation:

# I take only 100,000 of patterns
head -n 100000 patterns.py_ > patterns.py
du -h patterns.py
# 8,7M  patterns.py
# And monitor memory usage
watch -n 0.5 -d "free -h"

When I don't import patterns spacy instance need 100 MB.
When I do import patterns It takes 300 MB, which I think too much for 8,7MB patterns file.

Here is patterns: (rename .txt to .py) patterns.txt

Any help appreciated. Thanks

Info about spaCy

spaCy version: 2.0.0a0
Platform: Linux-4.11.3-1-ARCH-x86_64-with-arch
Python version: 3.6.1
Models: en_core_web_sm, en_core_web_sm-2.0.0a0

usage

Source

slavaGanzin

All 19 comments

Hey,

The Matcher itself doesn't scale well to lots of patterns. There's the PhraseMatcher class for this, although it's currently missing from the docs, and needs to be updated for spaCy 2. There's an example for it in the examples/ folder, and if you look around on the issue tracker there should be some more discussion.

honnibal on 15 Jun 2017

Hello, again.

I reviewed code of PhraseMatcher. It seems this would work more efficiently(in terms of speed) on documents. But it uses the same Matcher underneath: https://github.com/explosion/spaCy/blob/develop/spacy/matcher.pyx#L438

And the case of this issue is Matcher's overwhelming memory consumption (200Mb for 9Mb code patterns).

slavaGanzin on 15 Jun 2017

I ran PhraseMatcher for around 20M entities and it was great. The slowest part was automatic entity merging, but it seems like it doesn't happen automatically anymore.

For my use case it worked thousands times better than acora in both speed and memory consumption (memory was within couple of GB).

You should definitely give PhraseMatcher a try since your patterns seems to be simple phrases.

sadovnychyi on 16 Jun 2017

👍1

Hello!
Is it possible to use PhraseMatcher to match phrases composed of LEMMA(or at least LOWER)?

aatimofeev on 27 Jun 2017

@aatimofeev I think it's hardcoded right now here to use original string, but you could change that and re-compile it.

And before I was wrong that it doesn't merge them automatically -- it still does, and if you remove this two lines it will be significantly faster.

sadovnychyi on 27 Jun 2017

@sadovnychyi Thank you!
I've hoped it's possible to avoid digging into Cython code somehow=)

aatimofeev on 27 Jun 2017

@sadovnychyi @honnibal
Guys, can you give me some links to development tips&tricks with cython. I think recompiling everything on a little change isn't the only way.
I found pyxinport, but there are only toy examples. Which doesn't clarify how to use it in huge library.

p.s. I'll fix PhraseMatcher in spacy 2.0 as a bonus :)

slavaGanzin on 3 Aug 2017

@slavaGanzin The situation with Cython is generally:

You're always going to have to recompile the module you changed, no getting around that -- but this should be pretty quick
If you change code in a .pxd file, it'll recompile everything. This takes a minute or two.
Pyximport isn't really faster, since it's using the same stuff under the hood (it's still calling out to the C++ compiler). It's possible to use pyximport with spaCy, but there's a trick because spaCy uses lang=c++. The fix is to use a .pyxbld file, as here: https://stackoverflow.com/questions/26833947/how-can-i-set-cython-compiler-flags-when-using-pyximport

My regular workflow is editing files in vim and running python setup.py build_ext --inplace to compile. This should only compile what actually needs to be compiled. If it's compiling everything everytime, something's wrong.

honnibal on 3 Aug 2017

Btw, reportedly it's possible to implement Aho-Corasick in a way that supports more patterns, and that if this is done carefully it can actually be much faster than spaCy's PhraseMatcher. A user who works at Grammarly was talking about this on the Gitter chat. You might want to shop around for more Aho-Corasick implementations. If you find a good implementation in C or C++, wrapping it in Cython will likely be quite easy.

honnibal on 3 Aug 2017

👍1

@honnibal Thank you, Matthew

slavaGanzin on 4 Aug 2017

@honnibal Btw, reportedly it's possible to implement Aho-Corasick

https://pypi.python.org/pypi/acora/1.8: Acora is ‘fgrep’ for Python, a fast multi-keyword text search engine.
It is based on the Aho-Corasick algorithm and an NFA-to-DFA powerset construction.

@sadovnychyi For my use case it worked thousands times better than acora in both speed and memory consumption (memory was within couple of GB).

Acora comes with both a pure Python implementation and a fast binary module written in Cython

@sadovnychyi was you using Cython version? If yes, I think Aho-Corasick wouldn't help in this case.

slavaGanzin on 10 Aug 2017

@slavaGanzin Yes, I realise acora uses Aho-Corasick. But the implementation apparently doesn't scale well to lots of patterns. Apparently it's possible to do better for use cases with lots of patterns.

honnibal on 10 Aug 2017

I made a simple benchmark to test acora again spacy, and the difference is huge.

For 100k phrases, spacy used 247 MB of memory for matcher object, while acora consumed 3 GB (!). Acora seems to be faster for actual matching, but with such memory usage it's not helpful. This is with cython version of acora.

https://gist.github.com/sadovnychyi/90aa96a4dbaed71a466e82cc8ebe0a35

UPDATE:

pyahocorasick seems to be a winner with around 161 MB of memory used. See updated gist.

sadovnychyi on 12 Aug 2017

👍1

@sadovnychyi Awesome work, thanks!

slavaGanzin on 12 Aug 2017

Updates to PhraseMatcher for spaCy 2: https://github.com/explosion/spaCy/pull/1343

@sadovnychyi If you get a chance, could you try this out and check that it still works for your use-case? Thanks!

honnibal on 20 Sep 2017

👍1

Also, great work on the benchmarking!

It seems that most of the efficiency loss in the spaCy version is actually from the tokenizer. I replied with a script to investigate.

honnibal on 20 Sep 2017

Just removing hardcoded doc.merge(*match) should give huge performance improvement. Didn't test your PR yet but seems like everything is going to work. I think at this point performance shouldn't be a concern anymore, it does much more than pyahocorasick while consuming similar amount of resources.

How about using LEMMA instead of ORTH as @aatimofeev mentioned before? It could be just a public attribute on matcher object with default value of ORTH, but so you could re-assign it to something else if needed.

sadovnychyi on 21 Sep 2017

@sadovnychyi I mostly just wanted some sanity checking on the performance. Like, if we could get better performance by just using pyahocorasick internally, that'd be worth knowing!

I'll have a look at allowing customisation of the attribute to key the patterns. That's a good idea.

honnibal on 21 Sep 2017

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.