Gensim: Speed up word2vec / fasttext model loading

Created on 21 Oct 2019  Â·  17Comments  Â·  Source: RaRe-Technologies/gensim

Loading a large word2vec model with load_word_format(binary=True) is slow. Users complain that loading the "standard" models published by Facebook is too slow, plus it also affects the speed of our own tests and tutorial autogeneration.

Some numbers:

time gensim.models.keyedvectors.KeyedVectors.load_word2vec_format('./word2vec-google-news-300.gz', binary=True)
2019-10-21 22:24:08,326 : INFO : loading projection weights from ./word2vec-google-news-300.gz
2019-10-21 22:26:54,620 : INFO : loaded (3000000, 300) matrix from ./word2vec-google-news-300.gz
CPU times: user 2min 42s, sys: 3.64 s, total: 2min 46s
Wall time: 2min 46s

The I/O part itself = only loading the bytes from the file without any interpretation, takes about 30 seconds:

time full = io.BytesIO(smart_open.open('./word2vec-google-news-300.gz', 'rb').read())
CPU times: user 20.9 s, sys: 8.13 s, total: 29.1 s
Wall time: 31.9 s

…which means our parsing code is taking up the majority of the load_word2vec_format time. Ideally, we shouldn't need much more than the 30 seconds (= the raw I/O speed) for the full load_word_format(binary=True). Nearly 3 minutes is too much.

Task: Optimize load_word2vec_format, especially the binary=True branch. The code seems to live here:
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L369

Hacktoberfest help wanted impact HIGH performance reach MEDIUM testing

Most helpful comment

Hello Radim,

Thanks for your quick reaction. Indeed I should be probably more descriptive in the ticket.

A helicopter view on the approach.

As you know, w2v file consist of blocks containing word, space, and vector (block of float32).

The original approach first reads each block character by character util space. On every read the character is appended to a list. Finally the join is called on the list of letters to obtain word and bulk read of the vector is taking place. Obviously, these has quite a lot of overhead per word.

My take is to read one chunk (default size 100kb) of bytes from the file. Then parse as many complete word-space-vector blocks as the chunk contains. Then read and append next chunk, repeat until file exhausted. Parsing is done with minimal copying & bookkeeping. The patch is fairly short. Today I tried to make it even more readable, so it could perhaps speak better than myself.

Profiles:

Original (truncated below 0.5 sec tottime):

         351516849 function calls (351056029 primitive calls) in 187.918 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   27.915   27.915  187.293  187.293 utils_any2vec.py:149(_load_word2vec_format)
 47258510   19.851    0.000   78.680    0.000 gzip.py:271(read)
47703404/47258549   18.706    0.000   43.615    0.000 {method 'read' of '_io.BufferedReader' objects}
   444856   16.212    0.000   16.212    0.000 {method 'decompress' of 'zlib.Decompress' objects}
  3000000   12.010    0.000   12.010    0.000 {built-in method numpy.fromstring}
  3000000   11.692    0.000   14.636    0.000 utils_any2vec.py:206(add_word)
 47258511   10.521    0.000   15.215    0.000 _compression.py:12(_check_not_closed)
  3000005    6.663    0.000   19.612    0.000 warnings.py:119(filterwarnings)
  6000000    5.730    0.000   32.572    0.000 utils.py:1479(ignore_deprecation_warning)
 47258513    4.693    0.000    4.693    0.000 gzip.py:298(closed)
  3000013    4.177    0.000    7.640    0.000 warnings.py:159(_add_filter)
 44336046    3.302    0.000    3.302    0.000 {method 'append' of 'list' objects}
  3000001    3.259    0.000    3.964    0.000 utils.py:348(any2unicode)
   444857    3.008    0.000    3.008    0.000 {built-in method zlib.crc32}
  3000000    2.980    0.000    3.304    0.000 warnings.py:449(__enter__)
  3000001    2.946    0.000    2.946    0.000 {method 'astype' of 'numpy.ndarray' objects}
  3000000    2.648    0.000    3.269    0.000 contextlib.py:59(__init__)
  3000009    2.337    0.000    2.337    0.000 {method 'remove' of 'list' objects}
  3000000    2.318    0.000    7.694    0.000 contextlib.py:85(__exit__)
  6000665    2.249    0.000    2.314    0.000 re.py:286(_compile)
  3000000    2.181    0.000    2.466    0.000 warnings.py:468(__exit__)
   444856    1.917    0.000   26.821    0.000 _compression.py:66(readinto)
  3000000    1.898    0.000    1.898    0.000 {method 'join' of 'bytes' objects}
15027404/15027403    1.851    0.000    1.851    0.000 {built-in method builtins.isinstance}
  6000027    1.779    0.000   34.351    0.000 {built-in method builtins.next}
  3000000    1.526    0.000    2.204    0.000 keyedvectors.py:203(__init__)
  6000199    1.524    0.000    3.837    0.000 re.py:231(compile)
  3000000    1.460    0.000    1.460    0.000 warnings.py:428(__init__)
   444856    1.411    0.000   24.581    0.000 gzip.py:438(read)
  3000000    1.366    0.000    4.635    0.000 contextlib.py:157(helper)
  3000000    0.956    0.000   29.930    0.000 contextlib.py:79(__enter__)
   444893    0.902    0.000    2.814    0.000 gzip.py:80(read)
  9000013    0.875    0.000    0.875    0.000 {built-in method _warnings._filters_mutated}
  3000018    0.861    0.000    0.861    0.000 {method 'insert' of 'list' objects}
5714868/5714048    0.723    0.000    0.723    0.000 {built-in method builtins.len}
  3000596    0.679    0.000    0.679    0.000 {method 'update' of 'dict' objects}
   444856    0.626    0.000    3.690    0.000 gzip.py:489(_add_read_data)
  3011200    0.626    0.000    0.626    0.000 {built-in method builtins.getattr}

After optimization (truncated below 0.1 sec) tottime:

         37251884 function calls (36986852 primitive calls) in 45.248 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   249066   14.593    0.000   14.593    0.000 {method 'decompress' of 'zlib.Decompress' objects}
  3000000    8.391    0.000   11.215    0.000 utils_any2vec.py:189(__add_word_to_result)
    35589    4.894    0.000   23.696    0.001 utils_any2vec.py:213(__add_words_from_binary_chunk_to_result)
   249068    2.842    0.000    2.842    0.000 {built-in method zlib.crc32}
  3000000    2.767    0.000    2.767    0.000 {built-in method numpy.frombuffer}
  3000000    1.790    0.000    2.285    0.000 keyedvectors.py:203(__init__)
  3000001    1.756    0.000    1.756    0.000 {method 'astype' of 'numpy.ndarray' objects}
284695/35628    1.188    0.000   20.557    0.001 {method 'read' of '_io.BufferedReader' objects}
  3000009    1.134    0.000    1.134    0.000 {method 'decode' of 'bytes' objects}
  3000000    1.038    0.000    1.300    0.000 utils_any2vec.py:207(__remove_initial_new_line)
   249067    0.667    0.000   20.270    0.000 _compression.py:66(readinto)
7468957/7468137    0.657    0.000    0.657    0.000 {built-in method builtins.len}
  3035588    0.629    0.000    0.629    0.000 {method 'find' of 'bytes' objects}
   249067    0.601    0.000   19.475    0.000 gzip.py:438(read)
  3000596    0.495    0.000    0.495    0.000 {method 'update' of 'dict' objects}
        1    0.314    0.314   44.665   44.665 utils_any2vec.py:149(_load_word2vec_format)
  3077536    0.260    0.000    0.260    0.000 {method 'append' of 'list' objects}
   249066    0.225    0.000    3.088    0.000 gzip.py:489(_add_read_data)
   249106    0.219    0.000    1.120    0.000 gzip.py:80(read)

I hope that this clears the details ;)

Best regards,
Michał

cc @piskvorky

All 17 comments

Hi, I am interested in solving this issue. However I was unable to reproduce what you got. Specifically, the following code block (from what you linked):

from gensim.models.keyedvectors import Vocab
from gensim import utils
import numpy as np
import io
import smart_open
import time

tit_file_read = time.time()
fin = utils.open("~/Downloads/GoogleNews-vectors-negative300.bin", 'rb')
tat_file_read = time.time()

header = utils.to_unicode(fin.readline(), 'utf-8')
vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format
binary_len = np.dtype(np.float32).itemsize * vector_size

word2vec = {}
word=[]

tit_for = time.time()
for i in range(vocab_size):
    # mixed text and binary: read text first, then binary
    word = []
    while True:
        ch = fin.read(1)  # Python uses I/O buffering internally
        if ch == b' ':
            break
        if ch == b'':
            raise EOFError("unexpected end of input; is count incorrect or file otherwise damaged?")
        if ch != b'\n':  # ignore newlines in front of words (some binary files have)
            word.append(ch)
    word = utils.to_unicode(b''.join(word), encoding='utf-8')
    with utils.ignore_deprecation_warning():
        # TODO use frombuffer or something similar
        weights = np.fromstring(fin.read(binary_len), dtype=np.float32)
    word2vec[word] = weights

tat_for = time.time()
print("File Read", tat_file_read-tit_file_read)
print("For Loop", tat_for-tit_for)

gives the report:

File Read 0.00021314620971679688
For Loop 99.58815383911133

If you change the file reader to this:

fin = io.BytesIO(smart_open.open('~/Downloads/GoogleNews-vectors-negative300.bin', 'rb').read())

I get this report:

File Read 7.0926032066345215
For Loop 166.45309591293335

So, what we have in the library seems fine to me.
Am I missing something here?

That timing is strange. It is not possible that parsing from an in-memory buffer is 1.6x slower than reading and parsing from a zipped file from disk. Maybe you ran out of RAM and you're swapping? How much free RAM do you have?

In any case, on your machine, we want to get closer to the raw I/O speed of 7 seconds, instead of 100 seconds. The parsing overhead is too much. The overhead is even worse on your machine than on mine, because you have a faster disk.

A good first step would be to line-profile the code, to see where the bottlenecks are.

Ping @piyush-kgp are you able to profile? This would be an awesome ticket to resolve, with great impact.

Hey sorry, I have been busy lately. Maybe some one else could take this up

Hello @piskvorky, Hello @mpenkov

I created a pull request #2671 improving on the issue.

Measurements on my laptop on standard Google news vectors:

Reading file to bytes 24.5 seconds

Reading file to KeyedVectors (Before) 127.8 seconds
Reading file to KeyedVectors (After) 44.3 seconds

No magic, but I believe a solid improvement.

I am curious what you think?

I used the following benchmark

import timeit

import numpy as np

import gensim.models.keyedvectors
import gensim.utils

def read_plain(fname):
    with gensim.utils.open(fname, "rb") as fin:
        res = fin.read()
    return res

W2V_FNAME = './word2vec-google-news-300.gz'

print("Reading file to bytes")
t = timeit.default_timer()
x = read_plain(W2V_FNAME)
delta_t = timeit.default_timer() - t
print("Time for the operation %.1f seconds\n" % delta_t)

print("Reading file to KeyedVectors")
t = timeit.default_timer()
x = gensim.models.keyedvectors.KeyedVectors.load_word2vec_format(W2V_FNAME, binary=True)
delta_t = timeit.default_timer() - t
print("Time for the operation %.1f seconds\n" % delta_t)

print("Number of words in the obtained model  = %d" %len(x.vocab))

Thanks @lopusz . Can you share the profiling numbers? Plus a high level description of your "attack vector", what optimization choices you made and why.

Let's go over that before delving into the code nitty-gritty. Cheers.

Hello Radim,

Thanks for your quick reaction. Indeed I should be probably more descriptive in the ticket.

A helicopter view on the approach.

As you know, w2v file consist of blocks containing word, space, and vector (block of float32).

The original approach first reads each block character by character util space. On every read the character is appended to a list. Finally the join is called on the list of letters to obtain word and bulk read of the vector is taking place. Obviously, these has quite a lot of overhead per word.

My take is to read one chunk (default size 100kb) of bytes from the file. Then parse as many complete word-space-vector blocks as the chunk contains. Then read and append next chunk, repeat until file exhausted. Parsing is done with minimal copying & bookkeeping. The patch is fairly short. Today I tried to make it even more readable, so it could perhaps speak better than myself.

Profiles:

Original (truncated below 0.5 sec tottime):

         351516849 function calls (351056029 primitive calls) in 187.918 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   27.915   27.915  187.293  187.293 utils_any2vec.py:149(_load_word2vec_format)
 47258510   19.851    0.000   78.680    0.000 gzip.py:271(read)
47703404/47258549   18.706    0.000   43.615    0.000 {method 'read' of '_io.BufferedReader' objects}
   444856   16.212    0.000   16.212    0.000 {method 'decompress' of 'zlib.Decompress' objects}
  3000000   12.010    0.000   12.010    0.000 {built-in method numpy.fromstring}
  3000000   11.692    0.000   14.636    0.000 utils_any2vec.py:206(add_word)
 47258511   10.521    0.000   15.215    0.000 _compression.py:12(_check_not_closed)
  3000005    6.663    0.000   19.612    0.000 warnings.py:119(filterwarnings)
  6000000    5.730    0.000   32.572    0.000 utils.py:1479(ignore_deprecation_warning)
 47258513    4.693    0.000    4.693    0.000 gzip.py:298(closed)
  3000013    4.177    0.000    7.640    0.000 warnings.py:159(_add_filter)
 44336046    3.302    0.000    3.302    0.000 {method 'append' of 'list' objects}
  3000001    3.259    0.000    3.964    0.000 utils.py:348(any2unicode)
   444857    3.008    0.000    3.008    0.000 {built-in method zlib.crc32}
  3000000    2.980    0.000    3.304    0.000 warnings.py:449(__enter__)
  3000001    2.946    0.000    2.946    0.000 {method 'astype' of 'numpy.ndarray' objects}
  3000000    2.648    0.000    3.269    0.000 contextlib.py:59(__init__)
  3000009    2.337    0.000    2.337    0.000 {method 'remove' of 'list' objects}
  3000000    2.318    0.000    7.694    0.000 contextlib.py:85(__exit__)
  6000665    2.249    0.000    2.314    0.000 re.py:286(_compile)
  3000000    2.181    0.000    2.466    0.000 warnings.py:468(__exit__)
   444856    1.917    0.000   26.821    0.000 _compression.py:66(readinto)
  3000000    1.898    0.000    1.898    0.000 {method 'join' of 'bytes' objects}
15027404/15027403    1.851    0.000    1.851    0.000 {built-in method builtins.isinstance}
  6000027    1.779    0.000   34.351    0.000 {built-in method builtins.next}
  3000000    1.526    0.000    2.204    0.000 keyedvectors.py:203(__init__)
  6000199    1.524    0.000    3.837    0.000 re.py:231(compile)
  3000000    1.460    0.000    1.460    0.000 warnings.py:428(__init__)
   444856    1.411    0.000   24.581    0.000 gzip.py:438(read)
  3000000    1.366    0.000    4.635    0.000 contextlib.py:157(helper)
  3000000    0.956    0.000   29.930    0.000 contextlib.py:79(__enter__)
   444893    0.902    0.000    2.814    0.000 gzip.py:80(read)
  9000013    0.875    0.000    0.875    0.000 {built-in method _warnings._filters_mutated}
  3000018    0.861    0.000    0.861    0.000 {method 'insert' of 'list' objects}
5714868/5714048    0.723    0.000    0.723    0.000 {built-in method builtins.len}
  3000596    0.679    0.000    0.679    0.000 {method 'update' of 'dict' objects}
   444856    0.626    0.000    3.690    0.000 gzip.py:489(_add_read_data)
  3011200    0.626    0.000    0.626    0.000 {built-in method builtins.getattr}

After optimization (truncated below 0.1 sec) tottime:

         37251884 function calls (36986852 primitive calls) in 45.248 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   249066   14.593    0.000   14.593    0.000 {method 'decompress' of 'zlib.Decompress' objects}
  3000000    8.391    0.000   11.215    0.000 utils_any2vec.py:189(__add_word_to_result)
    35589    4.894    0.000   23.696    0.001 utils_any2vec.py:213(__add_words_from_binary_chunk_to_result)
   249068    2.842    0.000    2.842    0.000 {built-in method zlib.crc32}
  3000000    2.767    0.000    2.767    0.000 {built-in method numpy.frombuffer}
  3000000    1.790    0.000    2.285    0.000 keyedvectors.py:203(__init__)
  3000001    1.756    0.000    1.756    0.000 {method 'astype' of 'numpy.ndarray' objects}
284695/35628    1.188    0.000   20.557    0.001 {method 'read' of '_io.BufferedReader' objects}
  3000009    1.134    0.000    1.134    0.000 {method 'decode' of 'bytes' objects}
  3000000    1.038    0.000    1.300    0.000 utils_any2vec.py:207(__remove_initial_new_line)
   249067    0.667    0.000   20.270    0.000 _compression.py:66(readinto)
7468957/7468137    0.657    0.000    0.657    0.000 {built-in method builtins.len}
  3035588    0.629    0.000    0.629    0.000 {method 'find' of 'bytes' objects}
   249067    0.601    0.000   19.475    0.000 gzip.py:438(read)
  3000596    0.495    0.000    0.495    0.000 {method 'update' of 'dict' objects}
        1    0.314    0.314   44.665   44.665 utils_any2vec.py:149(_load_word2vec_format)
  3077536    0.260    0.000    0.260    0.000 {method 'append' of 'list' objects}
   249066    0.225    0.000    3.088    0.000 gzip.py:489(_add_read_data)
   249106    0.219    0.000    1.120    0.000 gzip.py:80(read)

I hope that this clears the details ;)

Best regards,
Michał

cc @piskvorky

Thanks again. I meant line-profile of that one function, _load_word2vec_format.

I'm still curious because my understanding was that read() should be doing buffering already, internally. But since your extra buffering on top improves performance so much, I guess it's not. Some of my assumptions must have been wrong.

And by the way, in connection to your PR which already shows great improvements: we're shipping automatically compiled versions of Gensim too, for {Windows, Mac, Linux} x {various Python versions}. So if you feel any one function / loop / algo is a major bottleneck because of pure Python, feel free to unleash Cython on it. Or at least let us know and we will :)

Hi Radim, I believe you are right, the reader does buffering internally.

For me it seems, the thing is more related to the number of calls (each letter of 3_000_000 vocab!).

If you look at the attached profiles, the _io.BufferedReader.read call was executed 47_000_000+ times in the unoptimized case (call occurs once per letter of each word + it seems there are some recursive calls). This took 43.615 sec.

In the optimized case there is "only" 284_000 calls taking 20.55 sec.

I think there is some constant per call cost which blows up, if you do a call once per letter.

There are also some other places where precious wall time leaks, e.g., it seems that
almost 7 seconds go to suppressing deprecation warnings ;)

Anyway line profile, would be for sure interesting. I am at it.

Here comes the line profile of the original _load_word2vec_format. Indeed a lot goes to read,
deprecation warnings are visible too ;)

Total time: 473.53 s                                                            
Function: _load_word2vec_format at line 148                                     

Line #      Hits         Time  Per Hit   % Time  Line Contents                  
==============================================================                  
   148                                           @profile                       
   149                                           def _load_word2vec_format(cls, fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict',
   150                                                                     limit=None, datatype=REAL):
   186         1          9.0      9.0      0.0      from gensim.models.keyedvectors import Vocab
   187         1          1.0      1.0      0.0      counts = None              
   188         1          1.0      1.0      0.0      if fvocab is not None:     
   189                                                   logger.info("loading word counts from %s", fvocab)
   190                                                   counts = {}            
   191                                                   with utils.open(fvocab, 'rb') as fin:
   192                                                       for line in fin:   
   193                                                           word, count = utils.to_unicode(line, errors=unicode_errors).strip().split()
   194                                                           counts[word] = int(count)
   195                                                                          
   196         1         10.0     10.0      0.0      logger.info("loading projection weights from %s", fname)
   197         1        197.0    197.0      0.0      with utils.open(fname, 'rb') as fin:
   198         1        171.0    171.0      0.0          header = utils.to_unicode(fin.readline(), encoding=encoding)
   199         1          7.0      7.0      0.0          vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format
   200         1          1.0      1.0      0.0          if limit:              
   201                                                       vocab_size = min(vocab_size, limit)
   202         1         15.0     15.0      0.0          result = cls(vector_size)
   203         1          1.0      1.0      0.0          result.vector_size = vector_size
   204         1         15.0     15.0      0.0          result.vectors = zeros((vocab_size, vector_size), dtype=datatype)
   205                                                                          
   206         1          1.0      1.0      0.0          def add_word(word, weights):
   207                                                       word_id = len(result.vocab)
   208                                                       if word in result.vocab:
   209                                                           logger.warning("duplicate word '%s' in %s, ignoring all but first", word, fname)
   210                                                           return         
   211                                                       if counts is None: 
   212                                                           # most common scenario: no vocab file given. just make up some bogus counts, in descending order
   213                                                           result.vocab[word] = Vocab(index=word_id, count=vocab_size - word_id)
   214                                                       elif word in counts:
   215                                                           # use count from the vocab file
   216                                                           result.vocab[word] = Vocab(index=word_id, count=counts[word])
   217                                                       else:              
   218                                                           # vocab file given, but word is missing -- set count to None (TODO: or raise?)
   219                                                           logger.warning("vocabulary file is incomplete: '%s' is missing", word)
   220                                                           result.vocab[word] = Vocab(index=word_id, count=None)
   221                                                       result.vectors[word_id] = weights
   222                                                       result.index2word.append(word)
   223                                                                          
   224         1          1.0      1.0      0.0          if binary:             
   225         1          3.0      3.0      0.0              binary_len = dtype(REAL).itemsize * vector_size
   226   3000001    3592876.0      1.2      0.8              for _ in range(vocab_size):
   227                                                           # mixed text and binary: read text first, then binary
   228   3000000    3362574.0      1.1      0.7                  word = []      
   229   3000000    3189548.0      1.1      0.7                  while True:    
   230  44258510  116027581.0      2.6     24.5                      ch = fin.read(1)  # Python uses I/O buffering internally
   231  44258510   46767715.0      1.1      9.9                      if ch == b' ':
   232   3000000    3226081.0      1.1      0.7                          break  
   233  41258510   43162191.0      1.0      9.1                      if ch == b'':
   234                                                                   raise EOFError("unexpected end of input; is count incorrect or file otherwise damaged?")
   235  41258510   42878365.0      1.0      9.1                      if ch != b'\n':  # ignore newlines in front of words (some binary files have)
   236  41258510   46056608.0      1.1      9.7                          word.append(ch)
   237   3000000   12026255.0      4.0      2.5                  word = utils.to_unicode(b''.join(word), encoding=encoding, errors=unicode_errors)
   238   3000000   58346305.0     19.4     12.3                  with utils.ignore_deprecation_warning():
   239                                                               # TODO use frombuffer or something similar
   240   3000000   72620750.0     24.2     15.3                      weights = fromstring(fin.read(binary_len), dtype=REAL).astype(datatype)
   241   3000000   22272438.0      7.4      4.7                  add_word(word, weights)
   242                                                   else:                  
   243                                                       for line_no in range(vocab_size):
   244                                                           line = fin.readline()
   245                                                           if line == b'':
   246                                                               raise EOFError("unexpected end of input; is count incorrect or file otherwise damaged?")
   247                                                           parts = utils.to_unicode(line.rstrip(), encoding=encoding, errors=unicode_errors).split(" ")
   248                                                           if len(parts) != vector_size + 1:
   249                                                               raise ValueError("invalid vector on line %s (is this really the text format?)" % line_no)
   250                                                           word, weights = parts[0], [datatype(x) for x in parts[1:]]
   251                                                           add_word(word, weights)
   252         1          3.0      3.0      0.0      if result.vectors.shape[0] != len(result.vocab):
   253                                                   logger.info(           
   254                                                       "duplicate words detected, shrinking matrix size from %i to %i",
   255                                                       result.vectors.shape[0], len(result.vocab)
   256                                                   )                      
   257                                                   result.vectors = ascontiguousarray(result.vectors[: len(result.vocab)])
   258         1          2.0      2.0      0.0      assert (len(result.vocab), vector_size) == result.vectors.shape
   259                                                                          
   260         1          9.0      9.0      0.0      logger.info("loaded %s matrix from %s", result.vectors.shape, fname)
   261         1          1.0      1.0      0.0      return result              

Nice! That super slow utils.ignore_deprecation_warning() is unexpected and surprising. CC @mpenkov.

I look forward to reviewing your PR. Do you see much potential for C (Cython) level optimizations left in the loading?

As far as the Cython goes - I did two small tweaks yesterday (converted a custom function to bytes.find call + removed one unnecessary slicing). It gave the current version nice
8s boost. So now the comparison on my laptop reads:

Reading file to bytes 24.5 seconds
Reading file to KeyedVectors (Before) 127.8 seconds
Reading file to KeyedVectors (After) 36.3 seconds

This is not that far from the plain reading & decompression "baseline".

I do not see any more very easy wins even with Cython, but one pair of eyes may be easily fooled in such cases ;)

@lopusz for the record: what was your final benchmark timing, using the merged PR? (on the same dataset and HW as your timings above, apples-to-apples)

@piskvorky This is actually a very good question leading to quite sweet story ;)

The benchmark was done before refactoring.

Of course I did not run the benchmark after cleaning-up (It was really just minor clean-up, right?)

And of course when I did the benchmark, prompted by your comment, it turned out that it runs in
~39 seconds. OK, still not bad, but slightly slower than 36-37ish I was getting before refactoring.

I got curious why and it turned out to be easy to spot. The culprit was the import in the
very inner function (remember, 3_000_000 calls) ;)

The 2s slower is not a tragedy, but since it is easy to fix I propose to merge that slightly updated version.

Lesson learned - there are no innocuous clean-ups when it comes to performance + always be profiling ;)

PR #2682

Thanks! It makes me wonder how much we're losing due to Python (as opposed to Cython) here. You know, 3,000,000x "nothing" adds up quickly…

But if the theoretical maximum is 24s, and we're at 37s now, it's likely diminishing returns. 50% overhead (in pure Python) is actually much better than I hoped for or expected.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

volj1 picture volj1  Â·  4Comments

ahmedbhabbas picture ahmedbhabbas  Â·  4Comments

sairampillai picture sairampillai  Â·  3Comments

hhchen1105 picture hhchen1105  Â·  4Comments

coopwilliams picture coopwilliams  Â·  3Comments