Gensim: Integer overflow during `FastText` training with `corpus_file`

Created on 5 Nov 2018 · 5Comments · Source: RaRe-Technologies/gensim

Description

model = FastText(corpus_file="sentences_norm.txt.gz", workers=14, iter=5, size=200, sg=1, hs=1)

with the following sizes

2018-11-05 16:57:52,809 : INFO : collected 6532860 word types from a corpus of 4728738902 raw words and 238627116 sentences
2018-11-05 16:57:52,809 : INFO : Loading a fresh vocabulary
2018-11-05 16:58:00,788 : INFO : effective_min_count=5 retains 1887156 unique words (28% of original 6532860, drops 4645704)
2018-11-05 16:58:00,788 : INFO : effective_min_count=5 leaves 4721157112 word corpus (99% of original 4728738902, drops 7581790)
2018-11-05 16:58:07,437 : INFO : deleting the raw counts dictionary of 6532860 items
2018-11-05 16:58:07,615 : INFO : sample=0.001 downsamples 26 most-common words
2018-11-05 16:58:07,615 : INFO : downsampling leaves estimated 3749158657 word corpus (79.4% of prior 4721157112)
2018-11-05 16:58:11,281 : INFO : constructing a huffman tree from 1887156 words
2018-11-05 16:59:36,077 : INFO : built huffman tree with maximum node depth 30
2018-11-05 17:00:17,300 : INFO : estimated required memory for 1887156 words, 1929637 buckets and 200 dimensions: 7871448352 bytes
2018-11-05 17:00:17,398 : INFO : resetting layer weights
2018-11-05 17:01:43,333 : INFO : Total number of ngrams is 1929637
2018-11-05 17:02:11,990 : INFO : training model with 14 workers on 1887156 vocabulary and 200 features, using sg=1 hs=1 sample=0.001 negative=5 window=5

yields

Exception in thread Thread-2120:
Traceback (most recent call last):
  File "/home/joelkuiper/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/joelkuiper/anaconda3/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/joelkuiper/anaconda3/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 175, in _worker_loop_corpusfile
    total_examples=total_examples, total_words=total_words, **kwargs)
  File "/home/joelkuiper/anaconda3/lib/python3.6/site-packages/gensim/models/fasttext.py", line 561, in _do_train_epoch
    total_examples, total_words, work, neu1)
  File "gensim/models/fasttext_corpusfile.pyx", line 126, in gensim.models.fasttext_corpusfile.train_epoch_sg
OverflowError: value too large to convert to int
```` 

on all workers. Note that the sg and hs parameters seem to have no relation to this, also happens without them. 

### Steps to reproduce
`model = FastText(corpus_file="sentences_norm.txt.gz", workers=14, iter=5,size=200)`

#### Expected Results
Should train the model

#### Actual Results
Exception thrown, no further output.

Traceback (most recent call last):
File "/home/joelkuiper/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/joelkuiper/anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(self._args, *self._kwargs)
File "/home/joelkuiper/anaconda3/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 175, in _worker_loop_corpusfile
total_examples=total_examples, total_words=total_words, **kwargs)
File "/home/joelkuiper/anaconda3/lib/python3.6/site-packages/gensim/models/fasttext.py", line 561, in _do_train_epoch
total_examples, total_words, work, neu1)
File "gensim/models/fasttext_corpusfile.pyx", line 126, in gensim.models.fasttext_corpusfile.train_epoch_sg
OverflowError: value too large to convert to int
```

Versions

Python 3.6.6 |Anaconda, Inc.| (default, Oct 9 2018, 12:34:16)
[GCC 7.3.0]
NumPy 1.15.3
SciPy 1.1.0
gensim 3.6.0

On Ubuntu 16.04

edit seems to work fine when passing in a LineSentence object

bug difficulty easy fasttext

Source

joelkuiper

All 5 comments

I see a similar error in Doc2Vec. I an verify that total_words is larger than a 32 bit integer. There's not an easy solution to this since training on a corpus_file will throw a different exception if total_words isn't present.

Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/Volumes/Backblaze_MacEx1TB50506065/cs221/project/CS221/venv/lib/python3.5/site-packages/gensim/models/base_any2vec.py", line 175, in _worker_loop_corpusfile
    total_examples=total_examples, total_words=total_words, **kwargs)
  File "/Volumes/Backblaze_MacEx1TB50506065/cs221/project/CS221/venv/lib/python3.5/site-packages/gensim/models/doc2vec.py", line 686, in _do_train_epoch
    doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)
  File "gensim/models/doc2vec_corpusfile.pyx", line 280, in gensim.models.doc2vec_corpusfile.d2v_train_epoch_dm

CuriousG102 on 22 Nov 2018

Thanks for the report @joelkuiper!

menshikh-iv on 13 Dec 2018

@menshikh-iv Since this is tagged "easy", I'm guessing the fix is to replace the int declaration here with something like a long?

mpenkov on 15 Dec 2018

@mpenkov yes, something like this (int -> longest_int_type for all variables that can be "too large") in all *_corpusfile.pyx files

menshikh-iv on 15 Dec 2018

👍1

I am experiencing this same bug as well when training Word2Vec with a large corpus. There has been a pull request for this bug here for a couple of months. Would you please fix this one? Thanks.