Gensim: A numerical bug in word2vec make_cum_table

Created on 21 Sep 2016 · 11Comments · Source: RaRe-Technologies/gensim

A simple way of recreating:

from numpy import zeros, uint32

class Dummy(object):
    count = 2;

class Issue(object):
    def __init__(self):
        reps = 10000000
        self.index2word = [i for i in xrange(reps)]
        self.vocab = { i: Dummy() for i in xrange(reps) }

    def make_cum_table(self, power=0.75, domain=2**31 - 1):
        vocab_size = len(self.index2word)
        self.cum_table = zeros(vocab_size, dtype=uint32)
        train_words_pow = float(sum([self.vocab[word].count**power for word in self.vocab]))
        cumulative = 0.0
        for word_index in range(vocab_size):
            cumulative += self.vocab[self.index2word[word_index]].count**power / train_words_pow
            self.cum_table[word_index] = round(cumulative * domain)
        if len(self.cum_table) > 0:
            assert self.cum_table[-1] == domain

def main():
    issue = Issue()
    issue.make_cum_table()

if __name__=="__main__":
    main()

On my computer assertion in the last line of make_cum_table fails because self.cum_table[-1] is 2147483646 instead of expected 2147483647. The reason for this is the accumulated numerical error in the computation of cumulative.

A simple way to fix this is to replace:

        for word_index in range(vocab_size):
            cumulative += self.vocab[self.index2word[word_index]].count**power / train_words_pow
            self.cum_table[word_index] = round(cumulative * domain)

with:

        for word_index in range(vocab_size):
            cumulative += self.vocab[self.index2word[word_index]].count**power
            self.cum_table[word_index] = round(cumulative / train_words_pow * domain)

bug difficulty easy

Source

pderkowski

Most helpful comment

It would be helpful to verify if cumulative is exactly equal to train_words_pow right before the assertion is thrown. Try printing the repr(cumulative) and repr(train_words_pow) after the for loop.

What might be happening is that in:
sum([self.vocab[word].count**power for word in self.vocab])
iteration goes in the order of self.vocab, while in:
for word_index in range(vocab_size): cumulative += self.vocab[self.index2word[word_index]].count**power self.cum_table[word_index] = round(cumulative / train_words_pow * domain)
It is the order of self.index2word.

Summing floats in different order may yield different results. If that is the case then a fix would be to replace

train_words_pow = float(sum([self.vocab[word].count**power for word in self.vocab]))

with

train_words_pow = float(sum([self.vocab[self.index2word[word_index]].count**power for word_index in range(vocab_size)]))

Or maybe a safer bet would be to get rid of a built-in sum altogether in favor of a loop, because I'm not sure if sum is allowed to apply a float-specific summation or if it can promote the values along the way for higher precision.

This is all assuming that cumulative != train_words_pow.

pderkowski on 20 Jan 2017

👍3

All 11 comments

Thanks for reporting it and suggesting the fix. Confirm reproduced and fix works.
Leaving PR creation until tomorrow - it is a good beginner exercise for our sprint at PyCon India tomorrow.

tmylk on 22 Sep 2016

Hi I ran into this bug using most recent gensim. Probably not doing this correctly, new user to gensim, but wanted to report that I got this issue with latest version.

python version: Python 2.7.10 :: Anaconda 2.3.0 (64-bit)
gensim version: gensim==0.13.4.1

I am using a large dataset of xml files and not preprocessing them so it made the word count huge.

Here is the last log I got:

017-01-16 15:17:33,507 : INFO : collected 1210080506 word types and 799989 unique tags from a corpus of 800000 exa
mples and 48220055130 words
2017-01-16 15:17:33,507 : INFO : Loading a fresh vocabulary
2017-01-16 18:11:20,085 : INFO : min_count=5 retains 63125605 unique words (5% of original 1210080506, drops 114695
4901)
2017-01-16 18:11:20,085 : INFO : min_count=5 leaves 46792286761 word corpus (97% of original 48220055130, drops 142
7768369)
2017-01-16 18:16:01,980 : INFO : deleting the raw counts dictionary of 1210080506 items
2017-01-17 04:09:13,357 : INFO : sample=0.001 downsamples 52 most-common words
2017-01-17 04:09:13,357 : INFO : downsampling leaves estimated 36302969642 word corpus (77.6% of prior 46792286761)
2017-01-17 04:09:13,358 : INFO : estimated required memory for 63125605 words and 300 dimensions: 184184239100 byte
s

Error Message:
Traceback (most recent call last):
File "d2v.py", line 34, in
model = Doc2Vec(documents, size=300, window=10, workers = multiprocessing.cpu_count(), iter=1, max_vocab_size =
175 * 1000000000)
File "/home/harley_d_swick/anaconda/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 624, in __init__
self.build_vocab(documents, trim_rule=trim_rule)
File "/home/harley_d_swick/anaconda/lib/python2.7/site-packages/gensim/models/word2vec.py", line 535, in build_vo
cab
self.finalize_vocab(update=update) # build tables & arrays
File "/home/harley_d_swick/anaconda/lib/python2.7/site-packages/gensim/models/word2vec.py", line 696, in finalize
_vocab
self.make_cum_table()
File "/home/harley_d_swick/anaconda/lib/python2.7/site-packages/gensim/models/word2vec.py", line 493, in make_cum
_table
assert self.cum_table[-1] == domain
AssertionError

hswick on 17 Jan 2017

Reopening as it appears the prior fix wasn't enough.

We should probably change the assertion to dump the mismatched values, to confirm if recurrences are the same sort of accumulated-imprecision off-by-one.

We may want to see if we can reproduce with an artificially-generated raw-vocab with a similarly large (63 million+) number of unique tokens and realistic frequency distribution (to have quick triggering test cases that doesn't require either a real corpus scan or even real-corpus-derived raw-vocab).

gojomo on 20 Jan 2017

👍3

The same problem for me:
assert self.cum_table[-1] == domain
AssertionError

pkarvelis on 20 Jan 2017

It would be helpful to verify if cumulative is exactly equal to train_words_pow right before the assertion is thrown. Try printing the repr(cumulative) and repr(train_words_pow) after the for loop.

Summing floats in different order may yield different results. If that is the case then a fix would be to replace

train_words_pow = float(sum([self.vocab[word].count**power for word in self.vocab]))

with

train_words_pow = float(sum([self.vocab[self.index2word[word_index]].count**power for word_index in range(vocab_size)]))

This is all assuming that cumulative != train_words_pow.

pderkowski on 20 Jan 2017

👍3

Let's try both solutions:

change the order of summation to be index2word
switch to explicit fsum that avoids rounding off errors

Hoping this will be fixed in this Sunday's coding sprint in Brazil.

tmylk on 25 Jan 2017

The second option will (likely) not help, since the reason (I suppose) the bug exists is that we calculate two floating point values: train_words_pow and cumulative that are supposed to be equal at the end of the for loop in two different ways. So I think it's not so much about the accuracy of either of them, but more about making sure they are calculated the same way.

pderkowski on 25 Jan 2017

👍1

The following code was used to verify that the problem was indeed the order of summation:

domain = 10**13
count = 100000000
train_words_pow = float(sum([(i%10000000)**0.75 for i in reversed(xrange(count))]))
cumulative = 0.0
for i in xrange(count):
    cumulative += (i%10000000)**0.75  
vfinal = round(cumulative/ train_words_pow * domain)
vfinal
assert vfinal == domain

changing sum for fsum didn't fix the issue because as noted by @piotder it's probably the order what matters. Testing the time it takes to run a loop against the time it takes to run the built-in sum showed that the difference in speed is negligible and which is faster depends on the run:

import time
# test sum
t0 = time.time()
train_words_pow = float(sum([((i%10)+2)**0.75 for i in xrange(count)]))
t1 = time.time()
total_sum = t1-t0
#test loop
t0 = time.time()
for i in xrange(count):
    cumulative += ((i%10)+2)**0.75
t1 = time.time()
total_loop = t1-t0

so I will be implementing the suggestion by @piotder and replaced:

train_words_pow = float(sum([self.wv.vocab[word].count**power for word in self.wv.vocab]))

train_words_pow = 0.0
for word_index in xrange(vocab_size):
    train_words_pow += self.wv.vocab[self.wv.index2word[word_index]].count**power

javkrei-cargox on 29 Jan 2017

👍2

a test case where the current code might fail (depending on the order of summation) is the following:

domain = 10**13
count = 100000000
train_words_pow = 0.0
train_words_pow = float(sum([(i%10000000)**0.75 for i in reversed(xrange(count))]))
cumulative = 0.0
for i in xrange(count):
    cumulative += (i%10000000)**0.75  
vfinal = round(cumulative/ train_words_pow * domain)
vfinal
assert vfinal == domain

I tested with fsum and didn't work either. As suggested by @piotder the problem seems not with sum but with the order of summation. The speed of a loop vs a in place sum is negligible:

import time
count = 10000000
t0 = time.time()
train_words_pow = float(sum([((i%10)+2)**0.75 for i in xrange(count)]))
t1 = time.time()

total_sum = t1-t0

t0 = time.time()
for i in xrange(count):
    cumulative += ((i%10)+2)**0.75
t1 = time.time()
total_loop = t1-t0

to make sure that train_words_pow == cumulative for the last run of the loop I will implement @piotder's suggestion and make the order of summation the same replacing:

train_words_pow = float(sum([self.wv.vocab[word].count**power for word in self.wv.vocab]))

train_words_pow = 0.0
for word_index in xrange(vocab_size):
    train_words_pow += self.wv.vocab[self.wv.index2word[word_index]].count**power

javkrei-cargox on 29 Jan 2017

There's a report of a recurrence even after the fix: https://groups.google.com/d/msg/gensim/f2AO-wJZexs/qY_1YNQDDwAJ

gojomo on 10 Dec 2018

👍1

Never mind; reporter was using an ancient version of gensim.

gojomo on 11 Dec 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

KeyError: "word 'གུ་རུ་' not in vocabulary"

mikkokotila · 3Comments

Keep the period symbols when extracting articles from wikipedia by WikiCorpus.get_texts()

hhchen1105 · 4Comments

Similarity class does not use constant memory

Laubeee · 3Comments

numpy==1.7.1 and gensim

vlad17 · 4Comments

ldamodel does not accept csc matrix

simonm3 · 3Comments