A simple way of recreating:
from numpy import zeros, uint32
class Dummy(object):
count = 2;
class Issue(object):
def __init__(self):
reps = 10000000
self.index2word = [i for i in xrange(reps)]
self.vocab = { i: Dummy() for i in xrange(reps) }
def make_cum_table(self, power=0.75, domain=2**31 - 1):
vocab_size = len(self.index2word)
self.cum_table = zeros(vocab_size, dtype=uint32)
train_words_pow = float(sum([self.vocab[word].count**power for word in self.vocab]))
cumulative = 0.0
for word_index in range(vocab_size):
cumulative += self.vocab[self.index2word[word_index]].count**power / train_words_pow
self.cum_table[word_index] = round(cumulative * domain)
if len(self.cum_table) > 0:
assert self.cum_table[-1] == domain
def main():
issue = Issue()
issue.make_cum_table()
if __name__=="__main__":
main()
On my computer assertion in the last line of make_cum_table fails because self.cum_table[-1] is 2147483646 instead of expected 2147483647. The reason for this is the accumulated numerical error in the computation of cumulative.
A simple way to fix this is to replace:
for word_index in range(vocab_size):
cumulative += self.vocab[self.index2word[word_index]].count**power / train_words_pow
self.cum_table[word_index] = round(cumulative * domain)
with:
for word_index in range(vocab_size):
cumulative += self.vocab[self.index2word[word_index]].count**power
self.cum_table[word_index] = round(cumulative / train_words_pow * domain)
Thanks for reporting it and suggesting the fix. Confirm reproduced and fix works.
Leaving PR creation until tomorrow - it is a good beginner exercise for our sprint at PyCon India tomorrow.
Hi I ran into this bug using most recent gensim. Probably not doing this correctly, new user to gensim, but wanted to report that I got this issue with latest version.
python version: Python 2.7.10 :: Anaconda 2.3.0 (64-bit)
gensim version: gensim==0.13.4.1
I am using a large dataset of xml files and not preprocessing them so it made the word count huge.
Here is the last log I got:
017-01-16 15:17:33,507 : INFO : collected 1210080506 word types and 799989 unique tags from a corpus of 800000 exa
mples and 48220055130 words
2017-01-16 15:17:33,507 : INFO : Loading a fresh vocabulary
2017-01-16 18:11:20,085 : INFO : min_count=5 retains 63125605 unique words (5% of original 1210080506, drops 114695
4901)
2017-01-16 18:11:20,085 : INFO : min_count=5 leaves 46792286761 word corpus (97% of original 48220055130, drops 142
7768369)
2017-01-16 18:16:01,980 : INFO : deleting the raw counts dictionary of 1210080506 items
2017-01-17 04:09:13,357 : INFO : sample=0.001 downsamples 52 most-common words
2017-01-17 04:09:13,357 : INFO : downsampling leaves estimated 36302969642 word corpus (77.6% of prior 46792286761)
2017-01-17 04:09:13,358 : INFO : estimated required memory for 63125605 words and 300 dimensions: 184184239100 byte
s
Error Message:
Traceback (most recent call last):
File "d2v.py", line 34, in
model = Doc2Vec(documents, size=300, window=10, workers = multiprocessing.cpu_count(), iter=1, max_vocab_size =
175 * 1000000000)
File "/home/harley_d_swick/anaconda/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 624, in __init__
self.build_vocab(documents, trim_rule=trim_rule)
File "/home/harley_d_swick/anaconda/lib/python2.7/site-packages/gensim/models/word2vec.py", line 535, in build_vo
cab
self.finalize_vocab(update=update) # build tables & arrays
File "/home/harley_d_swick/anaconda/lib/python2.7/site-packages/gensim/models/word2vec.py", line 696, in finalize
_vocab
self.make_cum_table()
File "/home/harley_d_swick/anaconda/lib/python2.7/site-packages/gensim/models/word2vec.py", line 493, in make_cum
_table
assert self.cum_table[-1] == domain
AssertionError
Reopening as it appears the prior fix wasn't enough.
We should probably change the assertion to dump the mismatched values, to confirm if recurrences are the same sort of accumulated-imprecision off-by-one.
We may want to see if we can reproduce with an artificially-generated raw-vocab with a similarly large (63 million+) number of unique tokens and realistic frequency distribution (to have quick triggering test cases that doesn't require either a real corpus scan or even real-corpus-derived raw-vocab).
The same problem for me:
assert self.cum_table[-1] == domain
AssertionError
It would be helpful to verify if cumulative is exactly equal to train_words_pow right before the assertion is thrown. Try printing the repr(cumulative) and repr(train_words_pow) after the for loop.
What might be happening is that in:
sum([self.vocab[word].count**power for word in self.vocab])
iteration goes in the order of self.vocab, while in:
for word_index in range(vocab_size):
cumulative += self.vocab[self.index2word[word_index]].count**power
self.cum_table[word_index] = round(cumulative / train_words_pow * domain)
It is the order of self.index2word.
Summing floats in different order may yield different results. If that is the case then a fix would be to replace
train_words_pow = float(sum([self.vocab[word].count**power for word in self.vocab]))
with
train_words_pow = float(sum([self.vocab[self.index2word[word_index]].count**power for word_index in range(vocab_size)]))
Or maybe a safer bet would be to get rid of a built-in sum altogether in favor of a loop, because I'm not sure if sum is allowed to apply a float-specific summation or if it can promote the values along the way for higher precision.
This is all assuming that cumulative != train_words_pow.
Let's try both solutions:
index2word fsum that avoids rounding off errorsHoping this will be fixed in this Sunday's coding sprint in Brazil.
The second option will (likely) not help, since the reason (I suppose) the bug exists is that we calculate two floating point values: train_words_pow and cumulative that are supposed to be equal at the end of the for loop in two different ways. So I think it's not so much about the accuracy of either of them, but more about making sure they are calculated the same way.
The following code was used to verify that the problem was indeed the order of summation:
domain = 10**13
count = 100000000
train_words_pow = float(sum([(i%10000000)**0.75 for i in reversed(xrange(count))]))
cumulative = 0.0
for i in xrange(count):
cumulative += (i%10000000)**0.75
vfinal = round(cumulative/ train_words_pow * domain)
vfinal
assert vfinal == domain
changing sum for fsum didn't fix the issue because as noted by @piotder it's probably the order what matters. Testing the time it takes to run a loop against the time it takes to run the built-in sum showed that the difference in speed is negligible and which is faster depends on the run:
import time
# test sum
t0 = time.time()
train_words_pow = float(sum([((i%10)+2)**0.75 for i in xrange(count)]))
t1 = time.time()
total_sum = t1-t0
#test loop
t0 = time.time()
for i in xrange(count):
cumulative += ((i%10)+2)**0.75
t1 = time.time()
total_loop = t1-t0
so I will be implementing the suggestion by @piotder and replaced:
train_words_pow = float(sum([self.wv.vocab[word].count**power for word in self.wv.vocab]))
by
train_words_pow = 0.0
for word_index in xrange(vocab_size):
train_words_pow += self.wv.vocab[self.wv.index2word[word_index]].count**power
a test case where the current code might fail (depending on the order of summation) is the following:
domain = 10**13
count = 100000000
train_words_pow = 0.0
train_words_pow = float(sum([(i%10000000)**0.75 for i in reversed(xrange(count))]))
cumulative = 0.0
for i in xrange(count):
cumulative += (i%10000000)**0.75
vfinal = round(cumulative/ train_words_pow * domain)
vfinal
assert vfinal == domain
I tested with fsum and didn't work either. As suggested by @piotder the problem seems not with sum but with the order of summation. The speed of a loop vs a in place sum is negligible:
import time
count = 10000000
t0 = time.time()
train_words_pow = float(sum([((i%10)+2)**0.75 for i in xrange(count)]))
t1 = time.time()
total_sum = t1-t0
t0 = time.time()
for i in xrange(count):
cumulative += ((i%10)+2)**0.75
t1 = time.time()
total_loop = t1-t0
to make sure that train_words_pow == cumulative for the last run of the loop I will implement @piotder's suggestion and make the order of summation the same replacing:
train_words_pow = float(sum([self.wv.vocab[word].count**power for word in self.wv.vocab]))
by
train_words_pow = 0.0
for word_index in xrange(vocab_size):
train_words_pow += self.wv.vocab[self.wv.index2word[word_index]].count**power
There's a report of a recurrence even after the fix: https://groups.google.com/d/msg/gensim/f2AO-wJZexs/qY_1YNQDDwAJ
Never mind; reporter was using an ancient version of gensim.
Most helpful comment
It would be helpful to verify if cumulative is exactly equal to train_words_pow right before the assertion is thrown. Try printing the repr(cumulative) and repr(train_words_pow) after the for loop.
What might be happening is that in:
sum([self.vocab[word].count**power for word in self.vocab])iteration goes in the order of self.vocab, while in:
for word_index in range(vocab_size): cumulative += self.vocab[self.index2word[word_index]].count**power self.cum_table[word_index] = round(cumulative / train_words_pow * domain)It is the order of self.index2word.
Summing floats in different order may yield different results. If that is the case then a fix would be to replace
train_words_pow = float(sum([self.vocab[word].count**power for word in self.vocab]))with
train_words_pow = float(sum([self.vocab[self.index2word[word_index]].count**power for word_index in range(vocab_size)]))Or maybe a safer bet would be to get rid of a built-in sum altogether in favor of a loop, because I'm not sure if sum is allowed to apply a float-specific summation or if it can promote the values along the way for higher precision.
This is all assuming that cumulative != train_words_pow.