Loading a large word2vec model with load_word_format(binary=True) is slow. Users complain that loading the "standard" models published by Facebook is too slow, plus it also affects the speed of our own tests and tutorial autogeneration.
Some numbers:
time gensim.models.keyedvectors.KeyedVectors.load_word2vec_format('./word2vec-google-news-300.gz', binary=True)
2019-10-21 22:24:08,326 : INFO : loading projection weights from ./word2vec-google-news-300.gz
2019-10-21 22:26:54,620 : INFO : loaded (3000000, 300) matrix from ./word2vec-google-news-300.gz
CPU times: user 2min 42s, sys: 3.64 s, total: 2min 46s
Wall time: 2min 46s
The I/O part itself = only loading the bytes from the file without any interpretation, takes about 30 seconds:
time full = io.BytesIO(smart_open.open('./word2vec-google-news-300.gz', 'rb').read())
CPU times: user 20.9 s, sys: 8.13 s, total: 29.1 s
Wall time: 31.9 s
…which means our parsing code is taking up the majority of the load_word2vec_format time. Ideally, we shouldn't need much more than the 30 seconds (= the raw I/O speed) for the full load_word_format(binary=True). Nearly 3 minutes is too much.
Task: Optimize load_word2vec_format, especially the binary=True branch. The code seems to live here:
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L369
Hi, I am interested in solving this issue. However I was unable to reproduce what you got. Specifically, the following code block (from what you linked):
from gensim.models.keyedvectors import Vocab
from gensim import utils
import numpy as np
import io
import smart_open
import time
tit_file_read = time.time()
fin = utils.open("~/Downloads/GoogleNews-vectors-negative300.bin", 'rb')
tat_file_read = time.time()
header = utils.to_unicode(fin.readline(), 'utf-8')
vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file format
binary_len = np.dtype(np.float32).itemsize * vector_size
word2vec = {}
word=[]
tit_for = time.time()
for i in range(vocab_size):
# mixed text and binary: read text first, then binary
word = []
while True:
ch = fin.read(1) # Python uses I/O buffering internally
if ch == b' ':
break
if ch == b'':
raise EOFError("unexpected end of input; is count incorrect or file otherwise damaged?")
if ch != b'\n': # ignore newlines in front of words (some binary files have)
word.append(ch)
word = utils.to_unicode(b''.join(word), encoding='utf-8')
with utils.ignore_deprecation_warning():
# TODO use frombuffer or something similar
weights = np.fromstring(fin.read(binary_len), dtype=np.float32)
word2vec[word] = weights
tat_for = time.time()
print("File Read", tat_file_read-tit_file_read)
print("For Loop", tat_for-tit_for)
gives the report:
File Read 0.00021314620971679688
For Loop 99.58815383911133
If you change the file reader to this:
fin = io.BytesIO(smart_open.open('~/Downloads/GoogleNews-vectors-negative300.bin', 'rb').read())
I get this report:
File Read 7.0926032066345215
For Loop 166.45309591293335
So, what we have in the library seems fine to me.
Am I missing something here?
That timing is strange. It is not possible that parsing from an in-memory buffer is 1.6x slower than reading and parsing from a zipped file from disk. Maybe you ran out of RAM and you're swapping? How much free RAM do you have?
In any case, on your machine, we want to get closer to the raw I/O speed of 7 seconds, instead of 100 seconds. The parsing overhead is too much. The overhead is even worse on your machine than on mine, because you have a faster disk.
A good first step would be to line-profile the code, to see where the bottlenecks are.
Ping @piyush-kgp are you able to profile? This would be an awesome ticket to resolve, with great impact.
Hey sorry, I have been busy lately. Maybe some one else could take this up
Hello @piskvorky, Hello @mpenkov
I created a pull request #2671 improving on the issue.
Measurements on my laptop on standard Google news vectors:
Reading file to bytes 24.5 seconds
Reading file to KeyedVectors (Before) 127.8 seconds
Reading file to KeyedVectors (After) 44.3 seconds
No magic, but I believe a solid improvement.
I am curious what you think?
I used the following benchmark
import timeit
import numpy as np
import gensim.models.keyedvectors
import gensim.utils
def read_plain(fname):
with gensim.utils.open(fname, "rb") as fin:
res = fin.read()
return res
W2V_FNAME = './word2vec-google-news-300.gz'
print("Reading file to bytes")
t = timeit.default_timer()
x = read_plain(W2V_FNAME)
delta_t = timeit.default_timer() - t
print("Time for the operation %.1f seconds\n" % delta_t)
print("Reading file to KeyedVectors")
t = timeit.default_timer()
x = gensim.models.keyedvectors.KeyedVectors.load_word2vec_format(W2V_FNAME, binary=True)
delta_t = timeit.default_timer() - t
print("Time for the operation %.1f seconds\n" % delta_t)
print("Number of words in the obtained model = %d" %len(x.vocab))
Thanks @lopusz . Can you share the profiling numbers? Plus a high level description of your "attack vector", what optimization choices you made and why.
Let's go over that before delving into the code nitty-gritty. Cheers.
Hello Radim,
Thanks for your quick reaction. Indeed I should be probably more descriptive in the ticket.
A helicopter view on the approach.
As you know, w2v file consist of blocks containing word, space, and vector (block of float32).
The original approach first reads each block character by character util space. On every read the character is appended to a list. Finally the join is called on the list of letters to obtain word and bulk read of the vector is taking place. Obviously, these has quite a lot of overhead per word.
My take is to read one chunk (default size 100kb) of bytes from the file. Then parse as many complete word-space-vector blocks as the chunk contains. Then read and append next chunk, repeat until file exhausted. Parsing is done with minimal copying & bookkeeping. The patch is fairly short. Today I tried to make it even more readable, so it could perhaps speak better than myself.
Profiles:
Original (truncated below 0.5 sec tottime):
351516849 function calls (351056029 primitive calls) in 187.918 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 27.915 27.915 187.293 187.293 utils_any2vec.py:149(_load_word2vec_format)
47258510 19.851 0.000 78.680 0.000 gzip.py:271(read)
47703404/47258549 18.706 0.000 43.615 0.000 {method 'read' of '_io.BufferedReader' objects}
444856 16.212 0.000 16.212 0.000 {method 'decompress' of 'zlib.Decompress' objects}
3000000 12.010 0.000 12.010 0.000 {built-in method numpy.fromstring}
3000000 11.692 0.000 14.636 0.000 utils_any2vec.py:206(add_word)
47258511 10.521 0.000 15.215 0.000 _compression.py:12(_check_not_closed)
3000005 6.663 0.000 19.612 0.000 warnings.py:119(filterwarnings)
6000000 5.730 0.000 32.572 0.000 utils.py:1479(ignore_deprecation_warning)
47258513 4.693 0.000 4.693 0.000 gzip.py:298(closed)
3000013 4.177 0.000 7.640 0.000 warnings.py:159(_add_filter)
44336046 3.302 0.000 3.302 0.000 {method 'append' of 'list' objects}
3000001 3.259 0.000 3.964 0.000 utils.py:348(any2unicode)
444857 3.008 0.000 3.008 0.000 {built-in method zlib.crc32}
3000000 2.980 0.000 3.304 0.000 warnings.py:449(__enter__)
3000001 2.946 0.000 2.946 0.000 {method 'astype' of 'numpy.ndarray' objects}
3000000 2.648 0.000 3.269 0.000 contextlib.py:59(__init__)
3000009 2.337 0.000 2.337 0.000 {method 'remove' of 'list' objects}
3000000 2.318 0.000 7.694 0.000 contextlib.py:85(__exit__)
6000665 2.249 0.000 2.314 0.000 re.py:286(_compile)
3000000 2.181 0.000 2.466 0.000 warnings.py:468(__exit__)
444856 1.917 0.000 26.821 0.000 _compression.py:66(readinto)
3000000 1.898 0.000 1.898 0.000 {method 'join' of 'bytes' objects}
15027404/15027403 1.851 0.000 1.851 0.000 {built-in method builtins.isinstance}
6000027 1.779 0.000 34.351 0.000 {built-in method builtins.next}
3000000 1.526 0.000 2.204 0.000 keyedvectors.py:203(__init__)
6000199 1.524 0.000 3.837 0.000 re.py:231(compile)
3000000 1.460 0.000 1.460 0.000 warnings.py:428(__init__)
444856 1.411 0.000 24.581 0.000 gzip.py:438(read)
3000000 1.366 0.000 4.635 0.000 contextlib.py:157(helper)
3000000 0.956 0.000 29.930 0.000 contextlib.py:79(__enter__)
444893 0.902 0.000 2.814 0.000 gzip.py:80(read)
9000013 0.875 0.000 0.875 0.000 {built-in method _warnings._filters_mutated}
3000018 0.861 0.000 0.861 0.000 {method 'insert' of 'list' objects}
5714868/5714048 0.723 0.000 0.723 0.000 {built-in method builtins.len}
3000596 0.679 0.000 0.679 0.000 {method 'update' of 'dict' objects}
444856 0.626 0.000 3.690 0.000 gzip.py:489(_add_read_data)
3011200 0.626 0.000 0.626 0.000 {built-in method builtins.getattr}
After optimization (truncated below 0.1 sec) tottime:
37251884 function calls (36986852 primitive calls) in 45.248 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
249066 14.593 0.000 14.593 0.000 {method 'decompress' of 'zlib.Decompress' objects}
3000000 8.391 0.000 11.215 0.000 utils_any2vec.py:189(__add_word_to_result)
35589 4.894 0.000 23.696 0.001 utils_any2vec.py:213(__add_words_from_binary_chunk_to_result)
249068 2.842 0.000 2.842 0.000 {built-in method zlib.crc32}
3000000 2.767 0.000 2.767 0.000 {built-in method numpy.frombuffer}
3000000 1.790 0.000 2.285 0.000 keyedvectors.py:203(__init__)
3000001 1.756 0.000 1.756 0.000 {method 'astype' of 'numpy.ndarray' objects}
284695/35628 1.188 0.000 20.557 0.001 {method 'read' of '_io.BufferedReader' objects}
3000009 1.134 0.000 1.134 0.000 {method 'decode' of 'bytes' objects}
3000000 1.038 0.000 1.300 0.000 utils_any2vec.py:207(__remove_initial_new_line)
249067 0.667 0.000 20.270 0.000 _compression.py:66(readinto)
7468957/7468137 0.657 0.000 0.657 0.000 {built-in method builtins.len}
3035588 0.629 0.000 0.629 0.000 {method 'find' of 'bytes' objects}
249067 0.601 0.000 19.475 0.000 gzip.py:438(read)
3000596 0.495 0.000 0.495 0.000 {method 'update' of 'dict' objects}
1 0.314 0.314 44.665 44.665 utils_any2vec.py:149(_load_word2vec_format)
3077536 0.260 0.000 0.260 0.000 {method 'append' of 'list' objects}
249066 0.225 0.000 3.088 0.000 gzip.py:489(_add_read_data)
249106 0.219 0.000 1.120 0.000 gzip.py:80(read)
I hope that this clears the details ;)
Best regards,
Michał
cc @piskvorky
Thanks again. I meant line-profile of that one function, _load_word2vec_format.
I'm still curious because my understanding was that read() should be doing buffering already, internally. But since your extra buffering on top improves performance so much, I guess it's not. Some of my assumptions must have been wrong.
And by the way, in connection to your PR which already shows great improvements: we're shipping automatically compiled versions of Gensim too, for {Windows, Mac, Linux} x {various Python versions}. So if you feel any one function / loop / algo is a major bottleneck because of pure Python, feel free to unleash Cython on it. Or at least let us know and we will :)
Hi Radim, I believe you are right, the reader does buffering internally.
For me it seems, the thing is more related to the number of calls (each letter of 3_000_000 vocab!).
If you look at the attached profiles, the _io.BufferedReader.read call was executed 47_000_000+ times in the unoptimized case (call occurs once per letter of each word + it seems there are some recursive calls). This took 43.615 sec.
In the optimized case there is "only" 284_000 calls taking 20.55 sec.
I think there is some constant per call cost which blows up, if you do a call once per letter.
There are also some other places where precious wall time leaks, e.g., it seems that
almost 7 seconds go to suppressing deprecation warnings ;)
Anyway line profile, would be for sure interesting. I am at it.
Here comes the line profile of the original _load_word2vec_format. Indeed a lot goes to read,
deprecation warnings are visible too ;)
Total time: 473.53 s
Function: _load_word2vec_format at line 148
Line # Hits Time Per Hit % Time Line Contents
==============================================================
148 @profile
149 def _load_word2vec_format(cls, fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict',
150 limit=None, datatype=REAL):
186 1 9.0 9.0 0.0 from gensim.models.keyedvectors import Vocab
187 1 1.0 1.0 0.0 counts = None
188 1 1.0 1.0 0.0 if fvocab is not None:
189 logger.info("loading word counts from %s", fvocab)
190 counts = {}
191 with utils.open(fvocab, 'rb') as fin:
192 for line in fin:
193 word, count = utils.to_unicode(line, errors=unicode_errors).strip().split()
194 counts[word] = int(count)
195
196 1 10.0 10.0 0.0 logger.info("loading projection weights from %s", fname)
197 1 197.0 197.0 0.0 with utils.open(fname, 'rb') as fin:
198 1 171.0 171.0 0.0 header = utils.to_unicode(fin.readline(), encoding=encoding)
199 1 7.0 7.0 0.0 vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file format
200 1 1.0 1.0 0.0 if limit:
201 vocab_size = min(vocab_size, limit)
202 1 15.0 15.0 0.0 result = cls(vector_size)
203 1 1.0 1.0 0.0 result.vector_size = vector_size
204 1 15.0 15.0 0.0 result.vectors = zeros((vocab_size, vector_size), dtype=datatype)
205
206 1 1.0 1.0 0.0 def add_word(word, weights):
207 word_id = len(result.vocab)
208 if word in result.vocab:
209 logger.warning("duplicate word '%s' in %s, ignoring all but first", word, fname)
210 return
211 if counts is None:
212 # most common scenario: no vocab file given. just make up some bogus counts, in descending order
213 result.vocab[word] = Vocab(index=word_id, count=vocab_size - word_id)
214 elif word in counts:
215 # use count from the vocab file
216 result.vocab[word] = Vocab(index=word_id, count=counts[word])
217 else:
218 # vocab file given, but word is missing -- set count to None (TODO: or raise?)
219 logger.warning("vocabulary file is incomplete: '%s' is missing", word)
220 result.vocab[word] = Vocab(index=word_id, count=None)
221 result.vectors[word_id] = weights
222 result.index2word.append(word)
223
224 1 1.0 1.0 0.0 if binary:
225 1 3.0 3.0 0.0 binary_len = dtype(REAL).itemsize * vector_size
226 3000001 3592876.0 1.2 0.8 for _ in range(vocab_size):
227 # mixed text and binary: read text first, then binary
228 3000000 3362574.0 1.1 0.7 word = []
229 3000000 3189548.0 1.1 0.7 while True:
230 44258510 116027581.0 2.6 24.5 ch = fin.read(1) # Python uses I/O buffering internally
231 44258510 46767715.0 1.1 9.9 if ch == b' ':
232 3000000 3226081.0 1.1 0.7 break
233 41258510 43162191.0 1.0 9.1 if ch == b'':
234 raise EOFError("unexpected end of input; is count incorrect or file otherwise damaged?")
235 41258510 42878365.0 1.0 9.1 if ch != b'\n': # ignore newlines in front of words (some binary files have)
236 41258510 46056608.0 1.1 9.7 word.append(ch)
237 3000000 12026255.0 4.0 2.5 word = utils.to_unicode(b''.join(word), encoding=encoding, errors=unicode_errors)
238 3000000 58346305.0 19.4 12.3 with utils.ignore_deprecation_warning():
239 # TODO use frombuffer or something similar
240 3000000 72620750.0 24.2 15.3 weights = fromstring(fin.read(binary_len), dtype=REAL).astype(datatype)
241 3000000 22272438.0 7.4 4.7 add_word(word, weights)
242 else:
243 for line_no in range(vocab_size):
244 line = fin.readline()
245 if line == b'':
246 raise EOFError("unexpected end of input; is count incorrect or file otherwise damaged?")
247 parts = utils.to_unicode(line.rstrip(), encoding=encoding, errors=unicode_errors).split(" ")
248 if len(parts) != vector_size + 1:
249 raise ValueError("invalid vector on line %s (is this really the text format?)" % line_no)
250 word, weights = parts[0], [datatype(x) for x in parts[1:]]
251 add_word(word, weights)
252 1 3.0 3.0 0.0 if result.vectors.shape[0] != len(result.vocab):
253 logger.info(
254 "duplicate words detected, shrinking matrix size from %i to %i",
255 result.vectors.shape[0], len(result.vocab)
256 )
257 result.vectors = ascontiguousarray(result.vectors[: len(result.vocab)])
258 1 2.0 2.0 0.0 assert (len(result.vocab), vector_size) == result.vectors.shape
259
260 1 9.0 9.0 0.0 logger.info("loaded %s matrix from %s", result.vectors.shape, fname)
261 1 1.0 1.0 0.0 return result
Nice! That super slow utils.ignore_deprecation_warning() is unexpected and surprising. CC @mpenkov.
I look forward to reviewing your PR. Do you see much potential for C (Cython) level optimizations left in the loading?
As far as the Cython goes - I did two small tweaks yesterday (converted a custom function to bytes.find call + removed one unnecessary slicing). It gave the current version nice
8s boost. So now the comparison on my laptop reads:
Reading file to bytes 24.5 seconds
Reading file to KeyedVectors (Before) 127.8 seconds
Reading file to KeyedVectors (After) 36.3 seconds
This is not that far from the plain reading & decompression "baseline".
I do not see any more very easy wins even with Cython, but one pair of eyes may be easily fooled in such cases ;)
@lopusz for the record: what was your final benchmark timing, using the merged PR? (on the same dataset and HW as your timings above, apples-to-apples)
@piskvorky This is actually a very good question leading to quite sweet story ;)
The benchmark was done before refactoring.
Of course I did not run the benchmark after cleaning-up (It was really just minor clean-up, right?)
And of course when I did the benchmark, prompted by your comment, it turned out that it runs in
~39 seconds. OK, still not bad, but slightly slower than 36-37ish I was getting before refactoring.
I got curious why and it turned out to be easy to spot. The culprit was the import in the
very inner function (remember, 3_000_000 calls) ;)
The 2s slower is not a tragedy, but since it is easy to fix I propose to merge that slightly updated version.
Lesson learned - there are no innocuous clean-ups when it comes to performance + always be profiling ;)
PR #2682
Thanks! It makes me wonder how much we're losing due to Python (as opposed to Cython) here. You know, 3,000,000x "nothing" adds up quickly…
But if the theoretical maximum is 24s, and we're at 37s now, it's likely diminishing returns. 50% overhead (in pure Python) is actually much better than I hoped for or expected.
Most helpful comment
Hello Radim,
Thanks for your quick reaction. Indeed I should be probably more descriptive in the ticket.
A helicopter view on the approach.
As you know, w2v file consist of blocks containing word, space, and vector (block of float32).
The original approach first reads each block character by character util space. On every read the character is appended to a list. Finally the join is called on the list of letters to obtain word and bulk read of the vector is taking place. Obviously, these has quite a lot of overhead per word.
My take is to read one chunk (default size 100kb) of bytes from the file. Then parse as many complete word-space-vector blocks as the chunk contains. Then read and append next chunk, repeat until file exhausted. Parsing is done with minimal copying & bookkeeping. The patch is fairly short. Today I tried to make it even more readable, so it could perhaps speak better than myself.
Profiles:
Original (truncated below 0.5 sec tottime):
After optimization (truncated below 0.1 sec) tottime:
I hope that this clears the details ;)
Best regards,
Michał
cc @piskvorky