When using gensim.models.fasttext.FastText, the actual memory usage is much higher (>2x) than predicted by FastText.estimate_memory.
My usage scenario is to generate 300-dimensional word embeddings using SkipGram training with window size 8. My corpus has ~55,000,000 documents with ~4,144,457 word types across ~20,000,000,000 tokens. The machine has 16GB of available memory, 15GB of which are available for the Gensim process, as well as 16GB of Swap space.
The estimated memory usage is ~11.2GB (see below), which is identical to the size estimated for the Word2Vec model with the same parameters. Training with Word2Vec works flawlessly and uses almost exactly as much memory as estimated.
It seems that FastText does not implement its own estimate_memory method, but inherits it from the Word2Vec class, yielding unreliable values as can be seen below. The critical section where the most memory is used seems to be this part in FastText.init_ngrams:
all_ngrams = []
for w, v in self.wv.vocab.items():
self.wv.ngrams_word[w] = compute_ngrams(w, self.min_n, self.max_n)
all_ngrams += self.wv.ngrams_word[w]
from gensim.models import fasttext
model = fasttext.FastText(size=300, sg=1, window=8, min_count=50, workers=8, iter=5)
# Word frequencies loaded from a finite state transducer on disk, i.e. no memory usage
freqs = load_frequencies()
vocab_size = sum(1 for typ, cnt in freqs.items() if cnt >= 50)
model.estimate_memory(vocab_size=vocab_size, report=True)
# { 'syn0': 4973348400,
# 'syn1neg': 4973348400,
# 'vocab': 2072228500
# 'total': 12018925300 }
# I.e. ~11.2GB, well within the available memory
model.build_vocab_from_freq(freqs, corpus_count=54878750)
# Memory usage is at ~7GB now, identical to Word2Vec
model.init_ngrams()
# ... Killed by OOM killer after swap space has run out
Should finish training without running out of memory.
Runs out of memory.
Linux-4.10.0-28-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.13.3
SciPy 1.0.0
gensim 3.2.0
FAST_VERSION (fasttext) 1
FAST_VERSION (word2vec) 1
Thanks for report @jbaiter!
The problem happens because this method isn't overridden in subclass (but must be), @manneshiva can you fix this (looks simple)?
I attempted to implement it here: https://github.com/jbaiter/gensim/commit/4a3bbcaeb5652c59aa3a0da666dc2053757641a0
However, I think that implementing the method is only one step. There's a lot of opportunity to reduce the memory overhead of the current implementation:
all_ngrams in the above code snippet uses a lot of memory, but is that huge of a temporary data structure really neccessary?I started some naive performance optimizations in a branch, but I don't think I have a complete enough picture of the implementation yet to be confident in those changes.
@jbaiter I like both your proposals. all_ngrams is not really needed and can indeed be a huge temporary memory overhead. I also agree with discarding word->[ngrams] mappings and calculating the ngrams on-the-fly. Considering that you have Cythonised the compute_ngrams function, this shouldn't have a drastic effect on the performance of the model in terms of time. I have gone through your code and it looks good to me except for a couple of minor issues:
syn0_vocab and ngrams in estimate_memory (also in unittest).compute_num_ngrams.@jayantj any comments?
You might have missed syn0_vocab and ngrams in estimate_memory (also in unittest).
Thank you, I'll try to put together a working pull request over the weekend :-)
Not sure where you would be using compute_num_ngrams.
That function is indeed no longer used, I wrote it for an earlier version of the memory estimation
Not sure if directly related, but I am trying to load a pre-trained fasttext model (3.9G bin, 1.6G vec) on Google Colab VM (12 GB memory limit) and the model eats over 11 GB ram (warning displayed) after which I can do nothing with it (any call to model kills the runtime). Using Gensim 3.3.
model = FastText.load_fasttext_format('wiki.ar')
Question: how much memory (roughly) would it take to load that model?
@menshikh-iv @jbaiter @manneshiva what's the status here? It looks like a rather critical feature/bug.
@piskvorky I got stuck implementing my optimizations, since I became unsure about the bucketing mechanism used for the ngrams. Specifically, from my understanding it seems that with bucketing every ngram should have an embedding in a bucket, even if that ngram never actually occurred in the original corpus.
It would be great if someone more familiar with the code base could look over my changes and offer some guidance/critiques. Should I open a PR for that, even if the code as is currently does not pass the tests?
See my changes here:
https://github.com/RaRe-Technologies/gensim/compare/develop...jbaiter:fasttext-optimization
@jbaiter of course, feel free to open PR, we'll help you (with tests too)
CC: @manneshiva
I submitted my WIP PR here: https://github.com/RaRe-Technologies/gensim/pull/1916
Tried the case above on Gensim 3.4 and it worked. Great work. Thank you all.
Most helpful comment
I submitted my WIP PR here: https://github.com/RaRe-Technologies/gensim/pull/1916