Gensim: FastText memory usage greatly exceeds value returned by `estimate_memory`

Created on 3 Jan 2018 · 10Comments · Source: RaRe-Technologies/gensim

Description

When using gensim.models.fasttext.FastText, the actual memory usage is much higher (>2x) than predicted by FastText.estimate_memory.
My usage scenario is to generate 300-dimensional word embeddings using SkipGram training with window size 8. My corpus has ~55,000,000 documents with ~4,144,457 word types across ~20,000,000,000 tokens. The machine has 16GB of available memory, 15GB of which are available for the Gensim process, as well as 16GB of Swap space.

The estimated memory usage is ~11.2GB (see below), which is identical to the size estimated for the Word2Vec model with the same parameters. Training with Word2Vec works flawlessly and uses almost exactly as much memory as estimated.

It seems that FastText does not implement its own estimate_memory method, but inherits it from the Word2Vec class, yielding unreliable values as can be seen below. The critical section where the most memory is used seems to be this part in FastText.init_ngrams:

all_ngrams = []
for w, v in self.wv.vocab.items():
    self.wv.ngrams_word[w] = compute_ngrams(w, self.min_n, self.max_n)
    all_ngrams += self.wv.ngrams_word[w]

Steps/Code/Corpus to Reproduce

from gensim.models import fasttext

model = fasttext.FastText(size=300, sg=1, window=8, min_count=50, workers=8, iter=5)

# Word frequencies loaded from a finite state transducer on disk, i.e. no memory usage
freqs = load_frequencies()
vocab_size = sum(1 for typ, cnt in freqs.items() if cnt >= 50)
model.estimate_memory(vocab_size=vocab_size, report=True)
# { 'syn0': 4973348400,
#   'syn1neg': 4973348400,
#   'vocab': 2072228500
#   'total': 12018925300 }
# I.e. ~11.2GB, well within the available memory

model.build_vocab_from_freq(freqs, corpus_count=54878750)
# Memory usage is at ~7GB now, identical to Word2Vec

model.init_ngrams()
# ... Killed by OOM killer after swap space has run out

Expected Results

Should finish training without running out of memory.

Actual Results

Runs out of memory.

Versions

Linux-4.10.0-28-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.13.3
SciPy 1.0.0
gensim 3.2.0
FAST_VERSION (fasttext) 1
FAST_VERSION (word2vec) 1

bug difficulty easy

Source

jbaiter

Most helpful comment

I submitted my WIP PR here: https://github.com/RaRe-Technologies/gensim/pull/1916

jbaiter on 19 Feb 2018

👍3 🎉2

All 10 comments

Thanks for report @jbaiter!

The problem happens because this method isn't overridden in subclass (but must be), @manneshiva can you fix this (looks simple)?

menshikh-iv on 8 Jan 2018

I attempted to implement it here: https://github.com/jbaiter/gensim/commit/4a3bbcaeb5652c59aa3a0da666dc2053757641a0

However, I think that implementing the method is only one step. There's a lot of opportunity to reduce the memory overhead of the current implementation:

all_ngrams in the above code snippet uses a lot of memory, but is that huge of a temporary data structure really neccessary?
Why keep around a word -> [ngrams] mapping for each and every word in the vocabular at all? It only seems to be used to look up the ngram bucket, but this can be done on-the-fly with only access to the vocabulary word

I started some naive performance optimizations in a branch, but I don't think I have a complete enough picture of the implementation yet to be confident in those changes.

jbaiter on 9 Jan 2018

👍3

@jbaiter I like both your proposals. all_ngrams is not really needed and can indeed be a huge temporary memory overhead. I also agree with discarding word->[ngrams] mappings and calculating the ngrams on-the-fly. Considering that you have Cythonised the compute_ngrams function, this shouldn't have a drastic effect on the performance of the model in terms of time. I have gone through your code and it looks good to me except for a couple of minor issues:

You might have missed syn0_vocab and ngrams in estimate_memory (also in unittest).
Not sure where you would be using compute_num_ngrams.

@jayantj any comments?

manneshiva on 11 Jan 2018

You might have missed syn0_vocab and ngrams in estimate_memory (also in unittest).

Thank you, I'll try to put together a working pull request over the weekend :-)

Not sure where you would be using compute_num_ngrams.

That function is indeed no longer used, I wrote it for an earlier version of the memory estimation

jbaiter on 11 Jan 2018

Not sure if directly related, but I am trying to load a pre-trained fasttext model (3.9G bin, 1.6G vec) on Google Colab VM (12 GB memory limit) and the model eats over 11 GB ram (warning displayed) after which I can do nothing with it (any call to model kills the runtime). Using Gensim 3.3.
model = FastText.load_fasttext_format('wiki.ar')
Question: how much memory (roughly) would it take to load that model?

abedkhooli on 17 Feb 2018

@menshikh-iv @jbaiter @manneshiva what's the status here? It looks like a rather critical feature/bug.

piskvorky on 17 Feb 2018

@piskvorky I got stuck implementing my optimizations, since I became unsure about the bucketing mechanism used for the ngrams. Specifically, from my understanding it seems that with bucketing every ngram should have an embedding in a bucket, even if that ngram never actually occurred in the original corpus.

It would be great if someone more familiar with the code base could look over my changes and offer some guidance/critiques. Should I open a PR for that, even if the code as is currently does not pass the tests?

See my changes here:
https://github.com/RaRe-Technologies/gensim/compare/develop...jbaiter:fasttext-optimization

jbaiter on 19 Feb 2018

@jbaiter of course, feel free to open PR, we'll help you (with tests too)
CC: @manneshiva

menshikh-iv on 19 Feb 2018

I submitted my WIP PR here: https://github.com/RaRe-Technologies/gensim/pull/1916

jbaiter on 19 Feb 2018

👍3 🎉2

Tried the case above on Gensim 3.4 and it worked. Great work. Thank you all.

abedkhooli on 1 Mar 2018

🎉2

Was this page helpful?

0 / 5 - 0 ratings