gensim 🚀 - Improve FastText loading times

@tmylk I would like to work on this

prakhar2b on 5 Apr 2017

👍6

@prakhar2b that would be great

tmylk on 3 May 2017

My thoughts about loading the FastText model using only the .bin file -

Definitely doable, the .bin file contains all information about the model
The .vec file was used mainly to be able to reuse code from KeyedVectors, changing to use simply the .bin file would require some architectural changes.
The next step should be clearly benchmarking the FastText loading times for gensim and fastText.py for small and large models, and then profiling the gensim method. In case loading the .vec file is a significant bottleneck, it would make sense to load using the .bin file only.

jayantj on 12 May 2017

I was looking to use fasttext in python and the gensim wrapper seems to be considerably slower than salestock/fasttext.py . The gensim wrapper takes around 35seconds to load a moderately large(~200K vocab) pre trained model (.bin-1Gb, .vec-0.2gb) while the fasttext.py takes just above a second to load the same model. Posting the cProfile stats for both the runs/loads below:
Gensim Wrapper

Fasttext.py

As seen from the first figure, loading the .bin file alone(by reading in python) would not bring the load time anywhere close to the salestock/fasttext.py load time. Achieving comparable load times would require reading in C/C++(like salestock).

manneshiva on 25 May 2017

Hi @manneshiva , thanks for the profiling report.

It could be possible to reduce the time required for loading the .bin file further by optimizing the bottlenecks (we could rewrite them in Cython if required), but we should definitely get rid of the call to load_word2vec_format.

Does the profiling report also include what exactly is slow inside the load_vectors function? That would be helpful in determining what the next steps should be.

jayantj on 25 May 2017

@jayantj
Here is the profiling report on load_vectors.

The main bottleneck seems to be the ft_hash function. We could also look at ways to optimize compute_ngrams(takes ~3.2secs compared to 11secs taken by ft_hash). Would getting rid of load_word2vec_format and Cythonizing ft_hash and compute_ngrams be sufficient? I am interested to work on this issue.

manneshiva on 26 May 2017

👍1

@manneshiva Your help will be very appreciated here as it's a very needed feature. Cython rather than C/C++ is preferred. Happy to mentor it via our incubator programme. CC @menshikh-iv

tmylk on 30 May 2017

Optimizations of init_ngrams & compute_ngrams & ft_hash short of cythonization might also yield a big speedup. (Avoid the seterr toggling every ngram; avoid reconstructing the constants; maybe finding overflow-oblivious approach; maybe inlining ft_hash; restructuring the loops to avoid the += appends and set(); maybe even reusing available prefix ngram hashes; etc.)

(Separately: the comment for init_ngram() mentioning vectors being discarded to 'save space', and the building of the ngrams-to-indexes dict, seems a little fishy to me. Does the native fasttext code do that too? I would think the usual case is that very few of the hashed-index vectors are completely unused – or else the chosen bucket was too large – and the space overhead of the mapping dict could be significant.)

gojomo on 30 May 2017

@gojomo Those ideas sound good to me.

About the unused vectors being discarded - the native FastText code doesn't do that. The default value for bucket size is quite large (2 mil) and according to the original comment from the PR for the fastText wrapper - https://github.com/RaRe-Technologies/gensim/pull/847#issuecomment-269330468 - 1.6mil of them are unused even with a mid-sized corpus (text8).

Some more detailed memory profiling could be useful here - to determine the exact gains from dropping the unused vectors, and the space overhead of the mapping dict.

jayantj on 31 May 2017

IMO text8 is a tiny corpus - it'd be most relevant to see the bucket-load in actual vector sets, like those FB pretrained & released.

Also, even though when we do local training, we may know for sure that an ngram's vector slot has never been trained, it's theoretically plausible that in a file-being read, even slots for ngrams that never appear in the declared vocabulary might have been trained. (Maybe, the vocab-for-distrib was trimmed after training on a larger set. Not sure any implementation does this, but if modeling out-of-vocab words was my project priority, I'd consider such a strategy, so that even very-obscure words/ngrams get at least a little coverage. An interesting thing to check could be: in FB's pretrained vectors, do any slots which seem to have no ngrams mapped to them ever exceed the magnitude of only-ever-randomly-initialized vectors?)

Definitely agree best course is to evaluate the optimization with actual memory profiling/sizing-analysis.

gojomo on 1 Jun 2017

I agree, that would be an interesting experiment to do (checking if vectors with no ngrams mapped are within the max-range of randomly initialized vectors).

This discussion about keeping/discarding vectors also directly affects the work on improving loading times - in case the impact (memory-wise) of retaining all vectors is not significant, and we decide to keep all vectors, the loading time automatically improves tremendously, since the majority of time spent right now is in computing hashes for all ngrams at load time. Any optimization with ft_hash/compute_ngrams wouldn't be as worthwhile if we do decide to keep all vectors.

In that case, I'd recommend that the next steps be memory profiling for different vocab/bucket sizes to determine whether keeping all vectors is a good idea.

jayantj on 2 Jun 2017

@gojomo @jayantj
The table below summarizes the memory profiling results on pre-trained fasttext english vectors with default bucket size(2Million), Text9(with default bucket size) and Text9(retrained with bucket size 1Million):

We have 2 options to solve this issue:

Loading from .bin alone and trimming ngrams :
- memory profiling shows significantly lesser memory consumption for default/large bucket size
- would require optimization of relevant functions
Loading all ngrams :
- would considerably boost the load time without any optimization.
- only makes sense when user is allowed to set bucket size when calling train(no option in present code)
- no way to for user to decide the ideal bucket size
- a very small bucket size could lead to hash collisions which could affect the quality of vectors for OOV words(not sure about this point)

It looks like the second option is better provided the concerns mentioned aren't too big a problem. Would like to hear your thoughts on this.

manneshiva on 7 Jun 2017

👍1

@manneshiva Thanks for the analysis. How exactly has the "Memory saved by only selecting ngrams of vocab words" been calculated?
I'm curious because the difference in the number of of unused vectors for wiki.simple and Text9 (with bucket size 2mil) seems low (1.4mil vs 1.16mil), but the difference in memory saved is very high (1100 MB vs 440 MB).

Maybe initially we could load all ngrams and not trim any of them (will speed up load time), and add an optional method to drop the unused ngrams for someone who is looking to save memory. (somewhat similarly to init_sims(replace=True) in word2vec)

@gojomo what do you think?

jayantj on 9 Jun 2017

@jayantj

How exactly has the "Memory saved by only selecting ngrams of vocab words" been calculated?

I looked at the memory taken by self.wv.syn0_all before and after reassigning it with self.wv.syn0_all.take(ngram_indices, axis=0).

I'm curious because the difference in the number of of unused vectors for wiki.simple and Text9 (with bucket size 2mil) seems low (1.4mil vs 1.16mil), but the difference in memory saved is very high (1100 MB vs 440 MB).

This is because of the difference in embedding size. The words are 300 dimensional in case of wiki.simple while they are 100 dimensional for Text9(should have mentioned this, apologies for the confusion).

manneshiva on 9 Jun 2017

Text9 is still pretty small, and 100-dimensions (despite being the default) may not be a representative size for serious users. The effects on the pre-trained FB FastText vectors (of Wikipedia in many languages) may be more representative of what real users will experience.

(Are you sure the memory accounting is counting the cost of the dictionary mapping ngrams to remembered, rather than hash-discovered, slots? It's probably only a few tens-of-MB but not sure where it'd appear in the current analysis.)

I'm not sure saving all ngrams will slow loading time - doesn't the load code right now do more to precalculate the ngrams & do extra discarding?

It wouldn't be too hard to add bucket-configurability, or logging/reporting of an (approximate) count of unique ngrams, to help choose optimal bucket size. But also, these savings don't yet seem so big, for serious users.

gojomo on 13 Jun 2017

Hi all,
Can someone please summarize the current state of this? I posted an issue with regards to the bucket in this code and I think it is directly related to your discussion here about optimizing the code.

https://github.com/RaRe-Technologies/gensim/issues/1779

saroufimc1 on 11 Dec 2017

👍1

The tradeoff here is speed v/s memory.

For relatively small vocab sizes (~200k), the steady-state memory usage is 1.1 GB lower than it would be if we chose to keep all ngram vectors as is. (for 300-d vectors). This is at the cost of significantly increased loading time.

Conversely, for large vocab sizes (like for Wikipedia models), we don't reduce memory usage, while also causing much higher load times. (as @gojomo rightly pointed out)

In case the common use case is indeed loading large models, it might make sense to store ngram vectors as is, without trying to discard any unused ones.

@piskvorky @menshikh-iv @manneshiva what do you think?

jayantj on 12 Dec 2017

👍1

I feel we should give the user an option -- discard_unused_ngrams to save memory, which by default could be False. Since the memory saved for small vocab models is significant (owing to a fewer number of total used ngrmas), this should be helpful for a user trying to load a small vocab model with limited RAM.

Only concern here, as pointed by @gojomo in this comment -- can we confirm that ngrams for the final vocab are the only ngrams trained. We shouldn't be discarding trained ngrams that do not appear in the ngrams of in-vocab words. It is highly unlikely that this is the case. @jayantj comments?

manneshiva on 12 Dec 2017

Hi,
I'm just taking a look at this myself also, just posting some metrics here.
Loading a 1.2gb .bin file using the python wrapper in the original repository takes 2.078 seconds compared with gensim's load_fasttext_format taking 68.171 seconds.

The vast majority of time is in init_ngrams_post_load.

         64526433 function calls (64513338 primitive calls) in 68.171 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   45.658   45.658   62.476   62.476 fasttext.py:857(init_ngrams_post_load)
 16220806    7.847    0.000    7.847    0.000 {built-in method numpy.core.multiarray.array}
 31281478    4.118    0.000    4.118    0.000 {built-in method gensim.models._utils_any2vec.ft_hash}
  1159878    3.742    0.000    3.742    0.000 {built-in method gensim.models._utils_any2vec.compute_ngrams}
        1    2.429    2.429    4.705    4.705 fasttext.py:618(_load_dict)
  6100747    0.892    0.000    0.892    0.000 {method 'read' of '_io.BufferedReader' objects}
        1    0.588    0.588    0.588    0.588 {method 'take' of 'numpy.ndarray' objects}
   579939    0.494    0.000    0.616    0.000 keyedvectors.py:97(__init__)
        1    0.411    0.411    0.411    0.411 {built-in method numpy.core.multiarray.fromfile}
3713959/3713525    0.401    0.000    0.401    0.000 {built-in method builtins.len}

(Gensim Version: 3.4.0)

DomHudson on 6 May 2018

Thanks @DomHudson, I checked myself and confrim long time of init_ngrams_post_load (~90%).
Noe: original python wrapper re-use c++ function for loading: https://github.com/facebookresearch/fastText/blob/master/python/fastText/pybind/fasttext_pybind.cc#L142

Profiling:

python2.7
gensim==3.5.0
wiki.en.bin from https://fasttext.cc/docs/en/pretrained-vectors.html

# content of t.py
from gensim.models import FastText
import logging


logging.basicConfig(level=logging.INFO)


m = FastText.load_fasttext_format("wiki.en.bin")

run it as python -m cProfile -o p.prof t.py

result in attach, init_ngrams_post_load definitely consume much time and this should be improved

p.zip

menshikh-iv on 30 Jul 2018

New measurement

python3.6
gensim==3.7.0
model: https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ru.300.bin.gz
code
```python
from gensim.models import FastText
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(filename)s:%(lineno)s - %(message)s')
m = FastText.load_fasttext_format("cc.ru.300.bin")
```

profiling: python -m cProfile -o lff.prof <fname.py>

Raw cProfile outpt (zip)

So, slowest method is adjust_vectors (by _ft_hash reason)
https://github.com/RaRe-Technologies/gensim/blob/a864e0247dda421d8e4de280fa91f86f474a5691/gensim/models/keyedvectors.py#L2230-L2249

Notes:

I guess, reason in ".encoding" call in hash function for each ngram (we can't see that by cython reason). More details:
```
gensim.models._utils_any2vec.ft_hash

n_calls: 15474957
cumtime: 645.1 sec
precall: 4.169e-05 sec
```
We need to speedup it at least 10x to have 1 minute load time, this will be hard I guess

I think we need special flag (full_model=False, that controls, how much matricies we load: 1 or 2. Reason: FB models really large and if user don't want to continue an training - no reason to load full FBmodel
Get rid hash2index completely, related #2329
Re-think how we work with vectors at all, are we really need things like model.wv.vocab that duplicates a memory & need to maintain always. If this possible to use references instead of copies - this will be the simplest and best solution

menshikh-iv on 18 Jan 2019

@menshikh-iv why don't make _compute_ngrams generate list of bytes ? Only one time make .encode in _compute_ngrams. There is one big drawback that _compute_ngrams should iterate by unicode characters, but generate bytes.

horpto on 18 Jan 2019

👍1

@horpto yes, I agree, this is one of the ideas, though need to test properly.

menshikh-iv on 18 Jan 2019

Optimization attempt №1, Idea (thanks @mpenkov) - use caching (LRU in our case)

Apply caching (very dirty patch, I know, but shows main idea)

ivan@h3:~/ft/gensim$ git diff
diff --git a/gensim/models/keyedvectors.py b/gensim/models/keyedvectors.py
index d9dad1cc..ad10f542 100644
--- a/gensim/models/keyedvectors.py
+++ b/gensim/models/keyedvectors.py
@@ -2237,7 +2237,7 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
         if self.bucket == 0:
             return

-        hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
+        hash_fn = lru_cache(maxsize=100000)(_ft_hash) if self.compatible_hash else _ft_hash_broken

         for w, v in self.vocab.items():
             word_vec = np.copy(self.vectors_vocab[v.index])
@@ -2247,8 +2247,9 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
                 word_vec += self.vectors_ngrams[ngram_index]
             word_vec /= len(ngrams) + 1
             self.vectors[v.index] = word_vec
+        print("@@@", hash_fn.cache_info())

-
+from functools import lru_cache
 def _process_fasttext_vocab(iterable, min_n, max_n, num_buckets, hash_fn, hash2index):

Result

Loading time for cc.ru.300.bin with LRU patch (described below)

| Cache type | Time | Speed |Hits | Misses | Hits % |
|------------|------|------|---------|---|--------|
| No cache (current) | 12m17s | 1x|- | - | - |
| LRU (100k) | 6m14s | 1.97x |8224568 | 7250389 | 53% |
| LRU (200k) | 5m2s | 2.44x|9914071 | 5560886| 64% |
| LRU (500k) | 3m36s|3.41x | 11787735 | 3687222 | 76% |

menshikh-iv on 18 Jan 2019

@horpto @menshikh-iv

why don't make _compute_ngrams generate list of bytes ? Only one time make .encode in _compute_ngrams. There is one big drawback that _compute_ngrams should iterate by unicode characters, but generate bytes.

I think that it's an interesting idea, but there's a problem: there's no guarantee that the length of the string before and after encoding will be the same, because a single character can be encoded to multiple bytes. This means that implementing the idea, we may get different ngrams.

At this stage, the thing to do is go read the FB implementation and documentation. We need to figure out how they're dealing with this and decide what to do based on compatibility.

I have a similar proposal based on @horpto's idea: why don't we store the vocabulary as bytes instead of strings? That would save us the cost of encoding everything each time we compute ngrams. Again, this depends if we can do this is a way that is compatible with FB, but I thought that it's worth documenting this idea here.

mpenkov on 19 Jan 2019

@mpenkov My idea actually is the same (I understand it later) to fb/fasttext implementation of _compute_ngrams (see https://github.com/facebookresearch/fastText/blob/7842495a4d64c7a3bb4339d45d6e64321d002ed8/src/dictionary.cc#L172). It's first. Second, I don't clearly expressed about unicode strings but I've written about this big drawback. It must make code a more complex (but maybe we can get access to inner str buffer?). Third, according to the code, actually everywhere we are calling _compute_ngrams and then hash(ngram) for all ngrams but itself ngram-s are not using. So maybe make an _computer_ngram_hashes ?

horpto on 19 Jan 2019

👍2

@horpto @mpenkov I made small benchmark (but I feel that I'm wrong somewhere)

from six import PY2
import numpy as np
cimport numpy as np


cdef _byte_to_int_py3(b):
    return b

cdef _byte_to_int_py2(b):
    return ord(b)

_byte_to_int = _byte_to_int_py2 if PY2 else _byte_to_int_py3

cpdef ft_hash_current(unicode string):
    cdef unsigned int h = 2166136261
    for c in string.encode("utf-8"):
        h = np.uint32(h ^ np.uint32(np.int8(_byte_to_int(c))))
        h = np.uint32(h * np.uint32(16777619))
    return h

cpdef ft_hash_new(bytes string):
    cdef unsigned int h = 2166136261
    for c in string:  # no more encode here, bytes as input
        h = np.uint32(h ^ np.uint32(np.int8(_byte_to_int(c))))
        h = np.uint32(h * np.uint32(16777619))
    return h

cpdef compute_ngrams_current(word, unsigned int min_n, unsigned int max_n):
    cdef unicode extended_word = f'<{word}>'
    ngrams = []
    for ngram_length in range(min_n, min(len(extended_word), max_n) + 1):
        for i in range(0, len(extended_word) - ngram_length + 1):
            ngrams.append(extended_word[i:i + ngram_length])
    return ngrams


cpdef compute_ngrams_new(word, unsigned int min_n, unsigned int max_n):
    cdef unicode extended_word = f'<{word}>'
    ngrams = []
    for ngram_length in range(min_n, min(len(extended_word), max_n) + 1):
        for i in range(0, len(extended_word) - ngram_length + 1):
            ngrams.append(extended_word[i:i + ngram_length])
    return (" ".join(ngrams)).encode("utf-8").split()  # make encode here for all ngrams in one moment

We have 2 pairs of func:

compute_ngrams_current + ft_hash_current (current master)
compute_ngrams_new + ft_hash_new - idea with single encoding for word (not an each ngram)

Benchmark looks really simle

import gensim.downloader as api
import itertools as it

words = tuple(it.chain.from_iterable(api.load("text8")))
assert len(words) == 17005207  # long enough

words = words[:100000]


def benchmark(words, ngram_func, hash_func):
    for w in words:
        arr = [hash_func(ngram) % 10000 for ngram in ngram_func(w, 3, 6)]

And result is

%time benchmark(words, ngram_func=compute_ngrams_current, hash_func=ft_hash_current)
"""
CPU times: user 36.3 s, sys: 417 ms, total: 36.7 s
Wall time: 34.5 s
"""

%time benchmark(words, ngram_func=compute_ngrams_new, hash_func=ft_hash_new)
"""
CPU times: user 37.6 s, sys: 405 ms, total: 38 s
Wall time: 35.6 s
"""

New variant even a bit slower than current one (I guess reason in join + split), but I'm surprised with result (still think that I do something wrong =/)

menshikh-iv on 19 Jan 2019

In addition, I ported piece of function from FB impl of ngram generation based on bytes

cpdef compute_ngrams_awesome(word, unsigned int min_n, unsigned int max_n):
    cdef bytes _word = f'<{word}>'.encode("utf-8")
    cdef int _wlen = len(_word)
    cdef int j, i, n

    ngrams = []
    for i in range(_wlen): 
        ngram = []

        if _word[i] & 0xC0 == 0x80:  # it's not a first byte of actual character
            continue

        j, n = i, 1
        while j < _wlen and n <= max_n:
            ngram.append(_word[j])
            j += 1
            while j < _wlen and (_word[j] & 0xC0) == 0x80:
                ngram.append(_word[j])
                j += 1

            if n >= min_n and not (n == 1 and (i == 0 or j == _wlen)):
                ngrams.append(bytes(ngram))

            n += 1

    return ngrams

But this is still not help much, hmm..

menshikh-iv on 19 Jan 2019

Apologies if this has been addressed, but is there any reason we can't just use some of the original C++ with suitable pybind11 bindings to access the loaded data? Is the assumption that the overhead from moving the data from C++ to Python would mitigate any C++ speed improvements?

DomHudson on 19 Jan 2019

@DomHudson

is there any reason we can't just use some of the original C++ with suitable pybind11 bindings to access the loaded data?

Yes, we can't, because

we have our own implementation and need to fill it from FB binary format, we don't want to include FB C++ implementation to python package, especially for this operation
we don't want to use pybind11

menshikh-iv on 19 Jan 2019

@horpto @mpenkov according to my experiment (calculate byte-based character ngram + no encoding on hash function side) - no difference with current implementation (same 12m17s) , looks like encoding wasn't an issue in the current case. Any other ideas, guys?

menshikh-iv on 19 Jan 2019

I guess I solve this issue: I re-write hash function

from libc.stdint cimport uint32_t, int8_t
from libc.string cimport strcpy, strlen
from libc.stdlib cimport malloc

cdef char* c_call_returning_a_c_string(bytes string):
    cdef char* c_string = <char *> malloc((len(string) + 1) * sizeof(char))
    if not c_string:
        raise MemoryError()
    strcpy(c_string, string)
    return c_string


cpdef ft_hash_ff(bytes string):
    cdef uint32_t h = 2166136261
    cdef char* ss = c_call_returning_a_c_string(string)

    for i in range(strlen(ss)):
        h = h ^ <uint32_t>(<int8_t>ss[i])  # attention - I drop 'ord' from py2, not sure about correctenss
        h = h * 16777619
    return h

Now loading takes 1m1s instead of 12m17s

@mpenkov please

reproduce my last attempt (with ft_hash_ff + bytes-based ngrams)
test it properly (I'm not sure about correctness)
apply in #2340

menshikh-iv on 19 Jan 2019

👍3

upd: @horpto simplified my code version (get rid malloc & copying), looks simpler and better

cpdef ft_hash_ff2(bytes string):
    cdef uint32_t h = 2166136261
    cdef char ch

    for ch in string:
        h = h ^ <uint32_t>(<int8_t>ch) # attention - I drop 'ord' from py2, not sure about correctenss
        h = h * 16777619
    return h

CC @mpenkov

menshikh-iv on 20 Jan 2019

Gensim: Improve FastText loading times

Most helpful comment

All 33 comments

Related issues