While loading the LSI vectors containing 4411273 documents and 500 terms in Matrix Market format , gensim complained the OverflowError.
In [14]: gensim.__version__
Out[14]: '3.4.0'
In [6]: corpus = MmCorpus(datapath('lsi_vectors.mm.gz'))
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
<ipython-input-6-9e22aec68b3c> in <module>()
----> 1 corpus = MmCorpus(datapath('lsi_vectors.mm.gz'))
/Users/aukauk/anaconda3.6/anaconda3/envs/py27/lib/python2.7/site-packages/gensim/corpora/mmcorpus.pyc in __init__(self, fname)
62 # avoid calling super(), too confusing
63 IndexedCorpus.__init__(self, fname)
---> 64 matutils.MmReader.__init__(self, fname)
65
66 def __iter__(self):
/Users/aukauk/anaconda3.6/anaconda3/envs/py27/lib/python2.7/site-packages/gensim/corpora/_mmreader.pyx in gensim.corpora._mmreader.MmReader.__init__()
58 logger.info("initializing cython corpus reader from %s", input)
59 self.input, self.transposed = input, transposed
---> 60 with utils.open_file(self.input) as lines:
61 try:
62 header = utils.to_unicode(next(lines)).strip()
/Users/aukauk/anaconda3.6/anaconda3/envs/py27/lib/python2.7/site-packages/gensim/corpora/_mmreader.pyx in gensim.corpora._mmreader.MmReader.__init__()
73 line = utils.to_unicode(line)
74 if not line.startswith('%'):
---> 75 self.num_docs, self.num_terms, self.num_nnz = (int(x) for x in line.split())
76 if not self.transposed:
77 self.num_docs, self.num_terms = self.num_terms, self.num_docs
OverflowError: value too large to convert to int
cat lsi_vectors.mm |wc -l
2213498865
head lsi_vectors.mm
%%MatrixMarket matrix coordinate real general
4427006 500 2213498863
1 1 0.3913027376444812
1 2 -0.07658791716226626
1 3 -0.020870794080588395
1 4 0.2145833024464887
1 5 0.16483779845897858
1 6 -0.05127146459864627
1 7 0.007765814982918945
1 8 -0.01817635794795088
Hello @KartikTaskhuman, you are right, 2213498865 definitely too long for signed int (and should be fine for unsigned), thanks for the report!
@arlenk can you have a look, please?
sorry, made the silly mistake of using ints for all the numeric types.
PR version uses long longs for everything. Presumably this is large enough (typically a 64 bit value).
I didn't use unsigned values because "previd" is set up -1 prior to the loop body and there are comparisons between previd and docid, so I didn't want to open the possibility for any signed vs. unsigned comparison bugs.
Most helpful comment
sorry, made the silly mistake of using ints for all the numeric types.
PR version uses long longs for everything. Presumably this is large enough (typically a 64 bit value).
I didn't use unsigned values because "previd" is set up -1 prior to the loop body and there are comparisons between previd and docid, so I didn't want to open the possibility for any signed vs. unsigned comparison bugs.