Hello.
I use wikidata and matcher to parse my documents:
#head -n 10 pattern.py
from spacy.symbols import ORTH, LOWER
def add(matcher):
matcher.add('31', None, [{LOWER: 'belgium'}],[{LOWER: 'kingdom'},{LOWER: 'of'},{LOWER: 'belgium'}],[{LOWER: 'be'}],[{LOWER: '🇧🇪'}])
matcher.add('1', None, [{LOWER: 'universe'}],[{LOWER: 'space'}],[{LOWER: 'cosmos'}],[{LOWER: 'outer'},{LOWER: 'space'}],[{LOWER: 'universe'},{LOWER: '(class)'}],[{LOWER: 'universe,'},{LOWER: 'a'}])
matcher.add('13', None, [{LOWER: 'triskaidekaphobia'}],[{LOWER: 'fear'},{LOWER: 'of'},{LOWER: '13'}],[{LOWER: 'scared'},{LOWER: 'of'},{LOWER: '13'}],[{LOWER: 'fear'},{LOWER: 'of'},{LOWER: '#13'}])
matcher.add('23', None, [{LOWER: 'george'},{LOWER: 'washington'}],[{LOWER: 'father'},{LOWER: 'of'},{LOWER: 'the'},{LOWER: 'united'},{LOWER: 'states'}],[{LOWER: 'washington'}],[{LOWER: 'president'},{LOWER: 'washington'}])
matcher.add('35', None, [{LOWER: 'denmark'}],[{LOWER: 'dk'}],[{LOWER: 'danmark'}],[{LOWER: 'dnk'}],[{LOWER: 'dek'}],[{LOWER: 'dk'}],[{LOWER: 'denmark'},{LOWER: 'proper'}],[{LOWER: 'metropolitan'},{LOWER: 'denmark'}],[{LOWER: '🇩🇰'}])
matcher.add('44', None, [{LOWER: 'beer'}])
matcher.add('64', None, [{LOWER: 'berlin'}],[{LOWER: 'berlin,'},{LOWER: 'germany'}])
matcher.add('82', None, [{LOWER: 'printer'}],[{LOWER: 'computer'},{LOWER: 'printer'}])
On small pattern files (10,000 LOC) it works pretty well.
But as you may know wikidata has huge amount of entities (currently 26,465,195 ). So when I imported part of wikidata (1,500,000) and try to load them:
import spacy
from spacy.matcher import Matcher
print("loading model")
nlp = spacy.load('en_core_web_sm')
print("loading patterns")
matcher = Matcher(nlp.vocab)
import patterns
patterns.add(matcher)
spacy used all RAM and hangs system completely.
So i decided to investigate this situation:
# I take only 100,000 of patterns
head -n 100000 patterns.py_ > patterns.py
du -h patterns.py
# 8,7M patterns.py
# And monitor memory usage
watch -n 0.5 -d "free -h"
When I don't import patterns spacy instance need 100 MB.
When I do import patterns It takes 300 MB, which I think too much for 8,7MB patterns file.
Here is patterns: (rename .txt to .py) patterns.txt
Any help appreciated. Thanks
Hey,
The Matcher itself doesn't scale well to lots of patterns. There's the PhraseMatcher class for this, although it's currently missing from the docs, and needs to be updated for spaCy 2. There's an example for it in the examples/ folder, and if you look around on the issue tracker there should be some more discussion.
Hello, again.
I reviewed code of PhraseMatcher. It seems this would work more efficiently(in terms of speed) on documents. But it uses the same Matcher underneath: https://github.com/explosion/spaCy/blob/develop/spacy/matcher.pyx#L438
And the case of this issue is Matcher's overwhelming memory consumption (200Mb for 9Mb code patterns).
I ran PhraseMatcher for around 20M entities and it was great. The slowest part was automatic entity merging, but it seems like it doesn't happen automatically anymore.
For my use case it worked thousands times better than acora in both speed and memory consumption (memory was within couple of GB).
You should definitely give PhraseMatcher a try since your patterns seems to be simple phrases.
Hello!
Is it possible to use PhraseMatcher to match phrases composed of LEMMA(or at least LOWER)?
@sadovnychyi Thank you!
I've hoped it's possible to avoid digging into Cython code somehow=)
@sadovnychyi @honnibal
Guys, can you give me some links to development tips&tricks with cython. I think recompiling everything on a little change isn't the only way.
I found pyxinport, but there are only toy examples. Which doesn't clarify how to use it in huge library.
p.s. I'll fix PhraseMatcher in spacy 2.0 as a bonus :)
@slavaGanzin The situation with Cython is generally:
lang=c++. The fix is to use a .pyxbld file, as here: https://stackoverflow.com/questions/26833947/how-can-i-set-cython-compiler-flags-when-using-pyximportMy regular workflow is editing files in vim and running python setup.py build_ext --inplace to compile. This should only compile what actually needs to be compiled. If it's compiling everything everytime, something's wrong.
Btw, reportedly it's possible to implement Aho-Corasick in a way that supports more patterns, and that if this is done carefully it can actually be much faster than spaCy's PhraseMatcher. A user who works at Grammarly was talking about this on the Gitter chat. You might want to shop around for more Aho-Corasick implementations. If you find a good implementation in C or C++, wrapping it in Cython will likely be quite easy.
@honnibal Thank you, Matthew
@honnibal Btw, reportedly it's possible to implement Aho-Corasick
https://pypi.python.org/pypi/acora/1.8: Acora is ‘fgrep’ for Python, a fast multi-keyword text search engine.
It is based on the Aho-Corasick algorithm and an NFA-to-DFA powerset construction.
@sadovnychyi For my use case it worked thousands times better than acora in both speed and memory consumption (memory was within couple of GB).
Acora comes with both a pure Python implementation and a fast binary module written in Cython
@sadovnychyi was you using Cython version? If yes, I think Aho-Corasick wouldn't help in this case.
@slavaGanzin Yes, I realise acora uses Aho-Corasick. But the implementation apparently doesn't scale well to lots of patterns. Apparently it's possible to do better for use cases with lots of patterns.
I made a simple benchmark to test acora again spacy, and the difference is huge.
For 100k phrases, spacy used 247 MB of memory for matcher object, while acora consumed 3 GB (!). Acora seems to be faster for actual matching, but with such memory usage it's not helpful. This is with cython version of acora.
https://gist.github.com/sadovnychyi/90aa96a4dbaed71a466e82cc8ebe0a35
UPDATE:
pyahocorasick seems to be a winner with around 161 MB of memory used. See updated gist.
@sadovnychyi Awesome work, thanks!
Updates to PhraseMatcher for spaCy 2: https://github.com/explosion/spaCy/pull/1343
@sadovnychyi If you get a chance, could you try this out and check that it still works for your use-case? Thanks!
Also, great work on the benchmarking!
It seems that most of the efficiency loss in the spaCy version is actually from the tokenizer. I replied with a script to investigate.
Just removing hardcoded doc.merge(*match) should give huge performance improvement. Didn't test your PR yet but seems like everything is going to work. I think at this point performance shouldn't be a concern anymore, it does much more than pyahocorasick while consuming similar amount of resources.
How about using LEMMA instead of ORTH as @aatimofeev mentioned before? It could be just a public attribute on matcher object with default value of ORTH, but so you could re-assign it to something else if needed.
@sadovnychyi I mostly just wanted some sanity checking on the performance. Like, if we could get better performance by just using pyahocorasick internally, that'd be worth knowing!
I'll have a look at allowing customisation of the attribute to key the patterns. That's a good idea.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.