Consider reading just the bin file as sugested in https://github.com/RaRe-Technologies/gensim/issues/814#issuecomment-289464725
Compare to C++ code in https://github.com/salestock/fastText.py/blob/77bdf69ee97a7446e314b342b129c5d46e9e4e29/fasttext/fasttext.pyx#L143
@tmylk I would like to work on this
@prakhar2b that would be great
My thoughts about loading the FastText model using only the .bin file -
.bin file contains all information about the model.vec file was used mainly to be able to reuse code from KeyedVectors, changing to use simply the .bin file would require some architectural changes..vec file is a significant bottleneck, it would make sense to load using the .bin file only.I was looking to use fasttext in python and the gensim wrapper seems to be considerably slower than salestock/fasttext.py . The gensim wrapper takes around 35seconds to load a moderately large(~200K vocab) pre trained model (.bin-1Gb, .vec-0.2gb) while the fasttext.py takes just above a second to load the same model. Posting the cProfile stats for both the runs/loads below:
Gensim Wrapper

Fasttext.py

As seen from the first figure, loading the .bin file alone(by reading in python) would not bring the load time anywhere close to the salestock/fasttext.py load time. Achieving comparable load times would require reading in C/C++(like salestock).
Hi @manneshiva , thanks for the profiling report.
It could be possible to reduce the time required for loading the .bin file further by optimizing the bottlenecks (we could rewrite them in Cython if required), but we should definitely get rid of the call to load_word2vec_format.
Does the profiling report also include what exactly is slow inside the load_vectors function? That would be helpful in determining what the next steps should be.
@jayantj
Here is the profiling report on load_vectors.

The main bottleneck seems to be the ft_hash function. We could also look at ways to optimize compute_ngrams(takes ~3.2secs compared to 11secs taken by ft_hash). Would getting rid of load_word2vec_format and Cythonizing ft_hash and compute_ngrams be sufficient? I am interested to work on this issue.
@manneshiva Your help will be very appreciated here as it's a very needed feature. Cython rather than C/C++ is preferred. Happy to mentor it via our incubator programme. CC @menshikh-iv
Optimizations of init_ngrams & compute_ngrams & ft_hash short of cythonization might also yield a big speedup. (Avoid the seterr toggling every ngram; avoid reconstructing the constants; maybe finding overflow-oblivious approach; maybe inlining ft_hash; restructuring the loops to avoid the += appends and set(); maybe even reusing available prefix ngram hashes; etc.)
(Separately: the comment for init_ngram() mentioning vectors being discarded to 'save space', and the building of the ngrams-to-indexes dict, seems a little fishy to me. Does the native fasttext code do that too? I would think the usual case is that very few of the hashed-index vectors are completely unused – or else the chosen bucket was too large – and the space overhead of the mapping dict could be significant.)
@gojomo Those ideas sound good to me.
About the unused vectors being discarded - the native FastText code doesn't do that. The default value for bucket size is quite large (2 mil) and according to the original comment from the PR for the fastText wrapper - https://github.com/RaRe-Technologies/gensim/pull/847#issuecomment-269330468 - 1.6mil of them are unused even with a mid-sized corpus (text8).
Some more detailed memory profiling could be useful here - to determine the exact gains from dropping the unused vectors, and the space overhead of the mapping dict.
IMO text8 is a tiny corpus - it'd be most relevant to see the bucket-load in actual vector sets, like those FB pretrained & released.
Also, even though when we do local training, we may know for sure that an ngram's vector slot has never been trained, it's theoretically plausible that in a file-being read, even slots for ngrams that never appear in the declared vocabulary might have been trained. (Maybe, the vocab-for-distrib was trimmed after training on a larger set. Not sure any implementation does this, but if modeling out-of-vocab words was my project priority, I'd consider such a strategy, so that even very-obscure words/ngrams get at least a little coverage. An interesting thing to check could be: in FB's pretrained vectors, do any slots which seem to have no ngrams mapped to them ever exceed the magnitude of only-ever-randomly-initialized vectors?)
Definitely agree best course is to evaluate the optimization with actual memory profiling/sizing-analysis.
I agree, that would be an interesting experiment to do (checking if vectors with no ngrams mapped are within the max-range of randomly initialized vectors).
This discussion about keeping/discarding vectors also directly affects the work on improving loading times - in case the impact (memory-wise) of retaining all vectors is not significant, and we decide to keep all vectors, the loading time automatically improves tremendously, since the majority of time spent right now is in computing hashes for all ngrams at load time. Any optimization with ft_hash/compute_ngrams wouldn't be as worthwhile if we do decide to keep all vectors.
In that case, I'd recommend that the next steps be memory profiling for different vocab/bucket sizes to determine whether keeping all vectors is a good idea.
@gojomo @jayantj
The table below summarizes the memory profiling results on pre-trained fasttext english vectors with default bucket size(2Million), Text9(with default bucket size) and Text9(retrained with bucket size 1Million):

We have 2 options to solve this issue:
train(no option in present code)It looks like the second option is better provided the concerns mentioned aren't too big a problem. Would like to hear your thoughts on this.
@manneshiva Thanks for the analysis. How exactly has the "Memory saved by only selecting ngrams of vocab words" been calculated?
I'm curious because the difference in the number of of unused vectors for wiki.simple and Text9 (with bucket size 2mil) seems low (1.4mil vs 1.16mil), but the difference in memory saved is very high (1100 MB vs 440 MB).
Maybe initially we could load all ngrams and not trim any of them (will speed up load time), and add an optional method to drop the unused ngrams for someone who is looking to save memory. (somewhat similarly to init_sims(replace=True) in word2vec)
@gojomo what do you think?
@jayantj
How exactly has the "Memory saved by only selecting ngrams of vocab words" been calculated?
I looked at the memory taken by self.wv.syn0_all before and after reassigning it with self.wv.syn0_all.take(ngram_indices, axis=0).
I'm curious because the difference in the number of of unused vectors for wiki.simple and Text9 (with bucket size 2mil) seems low (1.4mil vs 1.16mil), but the difference in memory saved is very high (1100 MB vs 440 MB).
This is because of the difference in embedding size. The words are 300 dimensional in case of wiki.simple while they are 100 dimensional for Text9(should have mentioned this, apologies for the confusion).
Text9 is still pretty small, and 100-dimensions (despite being the default) may not be a representative size for serious users. The effects on the pre-trained FB FastText vectors (of Wikipedia in many languages) may be more representative of what real users will experience.
(Are you sure the memory accounting is counting the cost of the dictionary mapping ngrams to remembered, rather than hash-discovered, slots? It's probably only a few tens-of-MB but not sure where it'd appear in the current analysis.)
I'm not sure saving all ngrams will slow loading time - doesn't the load code right now do more to precalculate the ngrams & do extra discarding?
It wouldn't be too hard to add bucket-configurability, or logging/reporting of an (approximate) count of unique ngrams, to help choose optimal bucket size. But also, these savings don't yet seem so big, for serious users.
Hi all,
Can someone please summarize the current state of this? I posted an issue with regards to the bucket in this code and I think it is directly related to your discussion here about optimizing the code.
The tradeoff here is speed v/s memory.
For relatively small vocab sizes (~200k), the steady-state memory usage is 1.1 GB lower than it would be if we chose to keep all ngram vectors as is. (for 300-d vectors). This is at the cost of significantly increased loading time.
Conversely, for large vocab sizes (like for Wikipedia models), we don't reduce memory usage, while also causing much higher load times. (as @gojomo rightly pointed out)
In case the common use case is indeed loading large models, it might make sense to store ngram vectors as is, without trying to discard any unused ones.
@piskvorky @menshikh-iv @manneshiva what do you think?
I feel we should give the user an option -- discard_unused_ngrams to save memory, which by default could be False. Since the memory saved for small vocab models is significant (owing to a fewer number of total used ngrmas), this should be helpful for a user trying to load a small vocab model with limited RAM.
Only concern here, as pointed by @gojomo in this comment -- can we confirm that ngrams for the final vocab are the only ngrams trained. We shouldn't be discarding trained ngrams that do not appear in the ngrams of in-vocab words. It is highly unlikely that this is the case. @jayantj comments?
Hi,
I'm just taking a look at this myself also, just posting some metrics here.
Loading a 1.2gb .bin file using the python wrapper in the original repository takes 2.078 seconds compared with gensim's load_fasttext_format taking 68.171 seconds.
The vast majority of time is in init_ngrams_post_load.
64526433 function calls (64513338 primitive calls) in 68.171 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 45.658 45.658 62.476 62.476 fasttext.py:857(init_ngrams_post_load)
16220806 7.847 0.000 7.847 0.000 {built-in method numpy.core.multiarray.array}
31281478 4.118 0.000 4.118 0.000 {built-in method gensim.models._utils_any2vec.ft_hash}
1159878 3.742 0.000 3.742 0.000 {built-in method gensim.models._utils_any2vec.compute_ngrams}
1 2.429 2.429 4.705 4.705 fasttext.py:618(_load_dict)
6100747 0.892 0.000 0.892 0.000 {method 'read' of '_io.BufferedReader' objects}
1 0.588 0.588 0.588 0.588 {method 'take' of 'numpy.ndarray' objects}
579939 0.494 0.000 0.616 0.000 keyedvectors.py:97(__init__)
1 0.411 0.411 0.411 0.411 {built-in method numpy.core.multiarray.fromfile}
3713959/3713525 0.401 0.000 0.401 0.000 {built-in method builtins.len}
(Gensim Version: 3.4.0)
Thanks @DomHudson, I checked myself and confrim long time of init_ngrams_post_load (~90%).
Noe: original python wrapper re-use c++ function for loading: https://github.com/facebookresearch/fastText/blob/master/python/fastText/pybind/fasttext_pybind.cc#L142
Profiling:
python2.7gensim==3.5.0wiki.en.bin from https://fasttext.cc/docs/en/pretrained-vectors.html# content of t.py
from gensim.models import FastText
import logging
logging.basicConfig(level=logging.INFO)
m = FastText.load_fasttext_format("wiki.en.bin")
run it as python -m cProfile -o p.prof t.py
result in attach, init_ngrams_post_load definitely consume much time and this should be improved

New measurement
python3.6gensim==3.7.0logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(filename)s:%(lineno)s - %(message)s')
m = FastText.load_fasttext_format("cc.ru.300.bin")
```
python -m cProfile -o lff.prof <fname.py>
So, slowest method is adjust_vectors (by _ft_hash reason)
https://github.com/RaRe-Technologies/gensim/blob/a864e0247dda421d8e4de280fa91f86f474a5691/gensim/models/keyedvectors.py#L2230-L2249
Notes:
n_calls: 15474957
cumtime: 645.1 sec
precall: 4.169e-05 sec
```
We need to speedup it at least 10x to have 1 minute load time, this will be hard I guess
full_model=False, that controls, how much matricies we load: 1 or 2. Reason: FB models really large and if user don't want to continue an training - no reason to load full FBmodelhash2index completely, related #2329 model.wv.vocab that duplicates a memory & need to maintain always. If this possible to use references instead of copies - this will be the simplest and best solution@menshikh-iv why don't make _compute_ngrams generate list of bytes ? Only one time make .encode in _compute_ngrams. There is one big drawback that _compute_ngrams should iterate by unicode characters, but generate bytes.
@horpto yes, I agree, this is one of the ideas, though need to test properly.
Optimization attempt â„–1, Idea (thanks @mpenkov) - use caching (LRU in our case)
Apply caching (very dirty patch, I know, but shows main idea)
ivan@h3:~/ft/gensim$ git diff
diff --git a/gensim/models/keyedvectors.py b/gensim/models/keyedvectors.py
index d9dad1cc..ad10f542 100644
--- a/gensim/models/keyedvectors.py
+++ b/gensim/models/keyedvectors.py
@@ -2237,7 +2237,7 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
if self.bucket == 0:
return
- hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
+ hash_fn = lru_cache(maxsize=100000)(_ft_hash) if self.compatible_hash else _ft_hash_broken
for w, v in self.vocab.items():
word_vec = np.copy(self.vectors_vocab[v.index])
@@ -2247,8 +2247,9 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
word_vec += self.vectors_ngrams[ngram_index]
word_vec /= len(ngrams) + 1
self.vectors[v.index] = word_vec
+ print("@@@", hash_fn.cache_info())
-
+from functools import lru_cache
def _process_fasttext_vocab(iterable, min_n, max_n, num_buckets, hash_fn, hash2index):
Result
Loading time for cc.ru.300.bin with LRU patch (described below)
| Cache type | Time | Speed |Hits | Misses | Hits % |
|------------|------|------|---------|---|--------|
| No cache (current) | 12m17s | 1x|- | - | - |
| LRU (100k) | 6m14s | 1.97x |8224568 | 7250389 | 53% |
| LRU (200k) | 5m2s | 2.44x|9914071 | 5560886| 64% |
| LRU (500k) | 3m36s|3.41x | 11787735 | 3687222 | 76% |
@horpto @menshikh-iv
why don't make _compute_ngrams generate list of bytes ? Only one time make .encode in _compute_ngrams. There is one big drawback that _compute_ngrams should iterate by unicode characters, but generate bytes.
I think that it's an interesting idea, but there's a problem: there's no guarantee that the length of the string before and after encoding will be the same, because a single character can be encoded to multiple bytes. This means that implementing the idea, we may get different ngrams.
At this stage, the thing to do is go read the FB implementation and documentation. We need to figure out how they're dealing with this and decide what to do based on compatibility.
I have a similar proposal based on @horpto's idea: why don't we store the vocabulary as bytes instead of strings? That would save us the cost of encoding everything each time we compute ngrams. Again, this depends if we can do this is a way that is compatible with FB, but I thought that it's worth documenting this idea here.
@mpenkov My idea actually is the same (I understand it later) to fb/fasttext implementation of _compute_ngrams (see https://github.com/facebookresearch/fastText/blob/7842495a4d64c7a3bb4339d45d6e64321d002ed8/src/dictionary.cc#L172). It's first. Second, I don't clearly expressed about unicode strings but I've written about this big drawback. It must make code a more complex (but maybe we can get access to inner str buffer?). Third, according to the code, actually everywhere we are calling _compute_ngrams and then hash(ngram) for all ngrams but itself ngram-s are not using. So maybe make an _computer_ngram_hashes ?
@horpto @mpenkov I made small benchmark (but I feel that I'm wrong somewhere)
from six import PY2
import numpy as np
cimport numpy as np
cdef _byte_to_int_py3(b):
return b
cdef _byte_to_int_py2(b):
return ord(b)
_byte_to_int = _byte_to_int_py2 if PY2 else _byte_to_int_py3
cpdef ft_hash_current(unicode string):
cdef unsigned int h = 2166136261
for c in string.encode("utf-8"):
h = np.uint32(h ^ np.uint32(np.int8(_byte_to_int(c))))
h = np.uint32(h * np.uint32(16777619))
return h
cpdef ft_hash_new(bytes string):
cdef unsigned int h = 2166136261
for c in string: # no more encode here, bytes as input
h = np.uint32(h ^ np.uint32(np.int8(_byte_to_int(c))))
h = np.uint32(h * np.uint32(16777619))
return h
cpdef compute_ngrams_current(word, unsigned int min_n, unsigned int max_n):
cdef unicode extended_word = f'<{word}>'
ngrams = []
for ngram_length in range(min_n, min(len(extended_word), max_n) + 1):
for i in range(0, len(extended_word) - ngram_length + 1):
ngrams.append(extended_word[i:i + ngram_length])
return ngrams
cpdef compute_ngrams_new(word, unsigned int min_n, unsigned int max_n):
cdef unicode extended_word = f'<{word}>'
ngrams = []
for ngram_length in range(min_n, min(len(extended_word), max_n) + 1):
for i in range(0, len(extended_word) - ngram_length + 1):
ngrams.append(extended_word[i:i + ngram_length])
return (" ".join(ngrams)).encode("utf-8").split() # make encode here for all ngrams in one moment
We have 2 pairs of func:
compute_ngrams_current + ft_hash_current (current master)compute_ngrams_new + ft_hash_new - idea with single encoding for word (not an each ngram)Benchmark looks really simle
import gensim.downloader as api
import itertools as it
words = tuple(it.chain.from_iterable(api.load("text8")))
assert len(words) == 17005207 # long enough
words = words[:100000]
def benchmark(words, ngram_func, hash_func):
for w in words:
arr = [hash_func(ngram) % 10000 for ngram in ngram_func(w, 3, 6)]
And result is
%time benchmark(words, ngram_func=compute_ngrams_current, hash_func=ft_hash_current)
"""
CPU times: user 36.3 s, sys: 417 ms, total: 36.7 s
Wall time: 34.5 s
"""
%time benchmark(words, ngram_func=compute_ngrams_new, hash_func=ft_hash_new)
"""
CPU times: user 37.6 s, sys: 405 ms, total: 38 s
Wall time: 35.6 s
"""
New variant even a bit slower than current one (I guess reason in join + split), but I'm surprised with result (still think that I do something wrong =/)
In addition, I ported piece of function from FB impl of ngram generation based on bytes
cpdef compute_ngrams_awesome(word, unsigned int min_n, unsigned int max_n):
cdef bytes _word = f'<{word}>'.encode("utf-8")
cdef int _wlen = len(_word)
cdef int j, i, n
ngrams = []
for i in range(_wlen):
ngram = []
if _word[i] & 0xC0 == 0x80: # it's not a first byte of actual character
continue
j, n = i, 1
while j < _wlen and n <= max_n:
ngram.append(_word[j])
j += 1
while j < _wlen and (_word[j] & 0xC0) == 0x80:
ngram.append(_word[j])
j += 1
if n >= min_n and not (n == 1 and (i == 0 or j == _wlen)):
ngrams.append(bytes(ngram))
n += 1
return ngrams
But this is still not help much, hmm..
Apologies if this has been addressed, but is there any reason we can't just use some of the original C++ with suitable pybind11 bindings to access the loaded data? Is the assumption that the overhead from moving the data from C++ to Python would mitigate any C++ speed improvements?
@DomHudson
is there any reason we can't just use some of the original C++ with suitable pybind11 bindings to access the loaded data?
Yes, we can't, because
pybind11 @horpto @mpenkov according to my experiment (calculate byte-based character ngram + no encoding on hash function side) - no difference with current implementation (same 12m17s) , looks like encoding wasn't an issue in the current case. Any other ideas, guys?
I guess I solve this issue: I re-write hash function
from libc.stdint cimport uint32_t, int8_t
from libc.string cimport strcpy, strlen
from libc.stdlib cimport malloc
cdef char* c_call_returning_a_c_string(bytes string):
cdef char* c_string = <char *> malloc((len(string) + 1) * sizeof(char))
if not c_string:
raise MemoryError()
strcpy(c_string, string)
return c_string
cpdef ft_hash_ff(bytes string):
cdef uint32_t h = 2166136261
cdef char* ss = c_call_returning_a_c_string(string)
for i in range(strlen(ss)):
h = h ^ <uint32_t>(<int8_t>ss[i]) # attention - I drop 'ord' from py2, not sure about correctenss
h = h * 16777619
return h
Now loading takes 1m1s instead of 12m17s
@mpenkov please
ft_hash_ff + bytes-based ngrams)upd: @horpto simplified my code version (get rid malloc & copying), looks simpler and better
cpdef ft_hash_ff2(bytes string):
cdef uint32_t h = 2166136261
cdef char ch
for ch in string:
h = h ^ <uint32_t>(<int8_t>ch) # attention - I drop 'ord' from py2, not sure about correctenss
h = h * 16777619
return h
CC @mpenkov
Most helpful comment
@tmylk I would like to work on this