Spacy: How to only load dependency parser and tokenizer to reduce memory consumption?

Created on 7 Sep 2017  路  5Comments  路  Source: explosion/spaCy

Hi, I am using python 3.6.2, my Spacy version is 1.9.0, and my spacy is using English core_web_sm model version 1.2.0. Right now, by loading with NLP = spacy.load('en'), it takes 1GB of memory for my computer. Is this correct? I am surprised a 50MB model will take 1GB of memory when loaded. Since I only need to use it for sentence segmentation, which means I probably only need the tokenizer and dependency parser. I wonder is there a way to only load dependency parser and tokenizer into the memory? I've tried a few approaches

from spacy.en import English
nlp = English(tagger=False, entity=False)

This seems like a reasonable way of doing it, yet it's still using more than 900MB of the memory. While I added parser=False, the memory consumption dropped to 300MB, yet the dependency parser is no longer loaded in the memory.

Is there a way I can separately load the dependency parser and tokenizer so I can achieve my sentence segmentation with smaller memory consumption?

usage

Most helpful comment

https://spacy.io/docs/usage/language-processing-pipeline

You can set add_vectors=False. This is missing from v2 just now though.

All 5 comments

If you do pip install spacy-nightly, you'll be able to use spaCy 2, which has neural network models that require much less memory (should be under 200mb).

There's also a new sentence segmenter component that doesn't use the dependency parse. This will reduce memory requirements to <100mb:

https://github.com/explosion/spaCy/blob/develop/spacy/pipeline.pyx

Usage:

import spacy
from spacy.pipeline import SentenceSegmenter

nlp = spacy.blank('en') # Loads no statistical models
nlp.pipeline.append(SentenceSegmenter(nlp.vocab))

The default SBD logic at the moment is super basic (just split on ., ,, !). You can replace this by passing a function strategy on initialization, or by assigning a function to nlp.pipeline[-1].strategy.

If you want to use the dependency parser etc, in spaCy 2 this doesn't require much memory, and accuracy is much better. It's slower though --- especially if you parse the documents 1-by-1, instead of using nlp.pipe(). In my tests it's about 3x slower than the 1.9 model, but other users have reported up to 10x slower. I'm still investigating this. Obviously it'll get faster in future :).

Would be great to have something like this:

nlp = spacy.load('en', vectors=False)

or

nlp = English(vectors=False)

To disable loading word vectors to lower the memory footprint.

https://spacy.io/docs/usage/language-processing-pipeline

You can set add_vectors=False. This is missing from v2 just now though.

In spaCy 2, models with word vectors now use the vectors as important features of the tagger, parser, entity recognizer and other pipeline components. An add_vectors=False setting would therefore be a misfeature: you would never want to set this, because if you did, you should disable all the pipeline components too.

Instead no-vector models from the _sm family can be installed and used by name. These models require very little memory.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings