Hi, I am using python 3.6.2, my Spacy version is 1.9.0, and my spacy is using English core_web_sm model version 1.2.0. Right now, by loading with NLP = spacy.load('en'), it takes 1GB of memory for my computer. Is this correct? I am surprised a 50MB model will take 1GB of memory when loaded. Since I only need to use it for sentence segmentation, which means I probably only need the tokenizer and dependency parser. I wonder is there a way to only load dependency parser and tokenizer into the memory? I've tried a few approaches
from spacy.en import English
nlp = English(tagger=False, entity=False)
This seems like a reasonable way of doing it, yet it's still using more than 900MB of the memory. While I added parser=False, the memory consumption dropped to 300MB, yet the dependency parser is no longer loaded in the memory.
Is there a way I can separately load the dependency parser and tokenizer so I can achieve my sentence segmentation with smaller memory consumption?
If you do pip install spacy-nightly, you'll be able to use spaCy 2, which has neural network models that require much less memory (should be under 200mb).
There's also a new sentence segmenter component that doesn't use the dependency parse. This will reduce memory requirements to <100mb:
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline.pyx
Usage:
import spacy
from spacy.pipeline import SentenceSegmenter
nlp = spacy.blank('en') # Loads no statistical models
nlp.pipeline.append(SentenceSegmenter(nlp.vocab))
The default SBD logic at the moment is super basic (just split on ., ,, !). You can replace this by passing a function strategy on initialization, or by assigning a function to nlp.pipeline[-1].strategy.
If you want to use the dependency parser etc, in spaCy 2 this doesn't require much memory, and accuracy is much better. It's slower though --- especially if you parse the documents 1-by-1, instead of using nlp.pipe(). In my tests it's about 3x slower than the 1.9 model, but other users have reported up to 10x slower. I'm still investigating this. Obviously it'll get faster in future :).
Would be great to have something like this:
nlp = spacy.load('en', vectors=False)
or
nlp = English(vectors=False)
To disable loading word vectors to lower the memory footprint.
https://spacy.io/docs/usage/language-processing-pipeline
You can set add_vectors=False. This is missing from v2 just now though.
In spaCy 2, models with word vectors now use the vectors as important features of the tagger, parser, entity recognizer and other pipeline components. An add_vectors=False setting would therefore be a misfeature: you would never want to set this, because if you did, you should disable all the pipeline components too.
Instead no-vector models from the _sm family can be installed and used by name. These models require very little memory.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
https://spacy.io/docs/usage/language-processing-pipeline
You can set
add_vectors=False. This is missing from v2 just now though.