Gensim: ModuleNotFoundError: No module named 'gensim.models.word2vec_corpusfile' exception when using corpus_file parameter

Created on 20 May 2019 · 10Comments · Source: RaRe-Technologies/gensim

Problem description

I can run Doc2vec with documents parameter without problem but when set corpus_file parameter I am getting following exception:

2019-05-20 15:39:47,719 INFO collecting all words and their counts
2019-05-20 15:39:47,719 WARNING this function is deprecated, use smart_open.open instead
2019-05-20 15:39:47,722 INFO PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2019-05-20 15:39:47,880 INFO PROGRESS: at example #10000, processed 166529 words (1052030/s), 12757 word types, 10000 tags
2019-05-20 15:39:47,994 INFO PROGRESS: at example #20000, processed 333710 words (1475171/s), 18969 word types, 20000 tags
2019-05-20 15:39:48,110 INFO PROGRESS: at example #30000, processed 497372 words (1413199/s), 23460 word types, 30000 tags
2019-05-20 15:39:48,232 INFO PROGRESS: at example #40000, processed 664500 words (1368097/s), 27517 word types, 40000 tags
2019-05-20 15:39:48,356 INFO PROGRESS: at example #50000, processed 831204 words (1351422/s), 31439 word types, 50000 tags
2019-05-20 15:39:48,474 INFO PROGRESS: at example #60000, processed 999270 words (1425773/s), 34971 word types, 60000 tags
2019-05-20 15:39:48,595 INFO PROGRESS: at example #70000, processed 1166581 words (1381434/s), 38029 word types, 70000 tags
2019-05-20 15:39:48,709 INFO PROGRESS: at example #80000, processed 1331917 words (1450900/s), 40998 word types, 80000 tags
2019-05-20 15:39:48,850 INFO PROGRESS: at example #90000, processed 1500483 words (1199681/s), 43763 word types, 90000 tags
2019-05-20 15:39:48,968 INFO PROGRESS: at example #100000, processed 1668223 words (1425485/s), 46284 word types, 100000 tags
2019-05-20 15:39:49,100 INFO PROGRESS: at example #110000, processed 1832073 words (1246089/s), 48598 word types, 110000 tags
2019-05-20 15:39:49,236 INFO PROGRESS: at example #120000, processed 1999604 words (1235473/s), 50894 word types, 120000 tags
2019-05-20 15:39:49,360 INFO PROGRESS: at example #130000, processed 2166458 words (1342288/s), 53063 word types, 130000 tags
2019-05-20 15:39:49,485 INFO PROGRESS: at example #140000, processed 2332035 words (1324079/s), 55163 word types, 140000 tags
2019-05-20 15:39:49,605 INFO PROGRESS: at example #150000, processed 2498794 words (1402007/s), 57258 word types, 150000 tags
2019-05-20 15:39:49,720 INFO PROGRESS: at example #160000, processed 2664993 words (1448274/s), 59151 word types, 160000 tags
2019-05-20 15:39:49,851 INFO PROGRESS: at example #170000, processed 2833858 words (1287963/s), 61142 word types, 170000 tags
2019-05-20 15:39:49,977 INFO PROGRESS: at example #180000, processed 3001865 words (1343356/s), 62966 word types, 180000 tags
2019-05-20 15:39:50,106 INFO PROGRESS: at example #190000, processed 3169739 words (1300939/s), 64843 word types, 190000 tags
2019-05-20 15:39:50,232 INFO collected 66527 word types and 200000 unique tags from a corpus of 200000 examples and 3335643 words
2019-05-20 15:39:50,233 INFO Loading a fresh vocabulary
2019-05-20 15:39:50,349 INFO effective_min_count=3 retains 26695 unique words (40% of original 66527, drops 39832)
2019-05-20 15:39:50,349 INFO effective_min_count=3 leaves 3286045 word corpus (98% of original 3335643, drops 49598)
2019-05-20 15:39:50,446 INFO deleting the raw counts dictionary of 66527 items
2019-05-20 15:39:50,448 INFO sample=0.001 downsamples 35 most-common words
2019-05-20 15:39:50,448 INFO downsampling leaves estimated 3004103 word corpus (91.4% of prior 3286045)
2019-05-20 15:39:50,547 INFO estimated required memory for 26695 words and 300 dimensions: 317415500 bytes
2019-05-20 15:39:50,547 INFO resetting layer weights
2019-05-20 15:39:54,690 WARNING this function is deprecated, use smart_open.open instead
2019-05-20 15:39:54,837 INFO training model with 4 workers on 26695 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
Traceback (most recent call last):
  File "run.py", line 76, in <module>
    run_ner_process()
  File "run.py", line 57, in run_ner_process
    doc2vec_test.doc2vec_create3()
  File "C:\_XProject\similarity\doc2vec_test.py", line 48, in doc2vec_create3
    model = Doc2Vec(corpus_file=file_name, dm=1, vector_size=300, window=5, min_count=3, workers=multiprocessing.cpu_count())
  File "C:\_XProject\_envx\lib\site-packages\gensim\models\doc2vec.py", line 620, in __init__
    end_alpha=self.min_alpha, callbacks=callbacks)
  File "C:\_XProject\_envx\lib\site-packages\gensim\models\doc2vec.py", line 814, in train
    queue_factor=queue_factor, report_delay=report_delay, callbacks=callbacks, **kwargs)
  File "C:\_XProject\_envx\lib\site-packages\gensim\models\base_any2vec.py", line 1081, in train
    **kwargs)
  File "C:\_XProject\_envx\lib\site-packages\gensim\models\base_any2vec.py", line 556, in train
    corpus_file, cur_epoch=cur_epoch, total_examples=total_examples, total_words=total_words, **kwargs)
  File "C:\_XProject\_envx\lib\site-packages\gensim\models\base_any2vec.py", line 405, in _train_epoch_corpusfile
    from gensim.models.word2vec_corpusfile import CythonVocab
ModuleNotFoundError: No module named 'gensim.models.word2vec_corpusfile'

Steps/code/corpus to reproduce

Include full tracebacks, logs and datasets if necessary. Please keep the examples minimal ("minimal reproducible example").

import os
import multiprocessing
from collections import OrderedDict

from gensim import corpora
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.utils import save_as_line_sentence

TEMP_FOLDER=""
    common_texts = ["Hırvatistan,diğer,şehir,hiçbir,aday,salt,zafer,el,et".split(","), 
    "seçim,toplam,46 bin 325,aday,yarış".split(","), 
    "AB,milletvekili,Makedonya,üyelik,müzakere,başla,çağrı".split(",")]
#total 200.000 lines
save_as_line_sentence(common_texts, os.path.join(TEMP_FOLDER, 'corpus.txt'))
file_name = os.path.join(TEMP_FOLDER, 'corpus.txt')
model = Doc2Vec(corpus_file=file_name, dm=0, vector_size=10, window=5, min_count=3, workers=multiprocessing.cpu_count())
#model = Doc2Vec(corpus_file=file_name, epochs=5, vector_size=300, workers=multiprocessing.cpu_count())
model.save(os.path.join(TEMP_FOLDER, 'doc2vec_dm_1.model'))

corpus.txt file content after "save_as_line_sentence" (200.000 lines of tokenized sentences)
~
Hırvat yerel seçim ikinci tur kal
sonuç başta büyük şehir ol üzere açık fark bir zafer el et göster
seçim gözlemci seçim gün sandık merkez kampanya yasak del bil
...
~

Versions

Please provide the output of:

>>> import platform; print(platform.platform())
Windows-10-10.0.16299-SP0
>>> import sys; print("Python", sys.version)
Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) [MSC v.1916 64 bit (AMD64)]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.16.3
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.2.1
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.7.3
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION -1
>>>

need info

Source

mehmetilker

All 10 comments

Thank you for your report.

Could you please make sure your reproducible example is minimal and complete? It looks like there are some definitions missing, and the indentation is wrong. Ideally, it'd be something we can just copy-paste into an interpreter to reproduce the problem.

mpenkov on 21 May 2019

@mpenkov I have updated example. But I guess due to few sentence I am getting "RuntimeError: you must first build vocabulary before training the model"
When I try with original corpus with 200.000 sentence I see same problem I stated above.

mehmetilker on 21 May 2019

👍1

I am also facing the same error:
model_gensim.train(
corpus_file=corpus_file, epochs=model_gensim.epochs,
total_examples=model_gensim.corpus_count, total_words=model_gensim.corpus_total_words
)
ModuleNotFoundError: No module named 'gensim.models.word2vec_corpusfile'

ModuleNotFoundError Traceback (most recent call last)
in
1 model_gensim.train(
2 corpus_file=corpus_file, epochs=model_gensim.epochs,
----> 3 total_examples=model_gensim.corpus_count, total_words=model_gensim.corpus_total_words
4 )

c:\users\chamars\appdata\local\programs\python\python37\lib\site-packages\gensim\models\fasttext.py in train(self, sentences, corpus_file, total_examples, total_words, epochs, start_alpha, end_alpha, word_count, queue_factor, report_delay, callbacks, **kwargs)
920 sentences=sentences, corpus_file=corpus_file, total_examples=total_examples, total_words=total_words,
921 epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha, word_count=word_count,
--> 922 queue_factor=queue_factor, report_delay=report_delay, callbacks=callbacks)
923 self.wv.adjust_vectors()
924

c:\users\chamars\appdata\local\programs\python\python37\lib\site-packages\gensim\models\base_any2vec.py in train(self, sentences, corpus_file, total_examples, total_words, epochs, start_alpha, end_alpha, word_count, queue_factor, report_delay, compute_loss, callbacks, *kwargs)
1079 total_words=total_words, epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha, word_count=word_count,
1080 queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks,
-> 1081 *kwargs)
1082
1083 def _get_job_params(self, cur_epoch):

c:\users\chamars\appdata\local\programs\python\python37\lib\site-packages\gensim\models\base_any2vec.py in train(self, data_iterable, corpus_file, epochs, total_examples, total_words, queue_factor, report_delay, callbacks, *kwargs)
554 else:
555 trained_word_count_epoch, raw_word_count_epoch, job_tally_epoch = self._train_epoch_corpusfile(
--> 556 corpus_file, cur_epoch=cur_epoch, total_examples=total_examples, total_words=total_words, *kwargs)
557
558 trained_word_count += trained_word_count_epoch

c:\users\chamars\appdata\local\programs\python\python37\lib\site-packages\gensim\models\base_any2vec.py in _train_epoch_corpusfile(self, corpus_file, cur_epoch, total_examples, total_words, **kwargs)
403 raise ValueError("total_words must be provided alongside corpus_file argument.")
404
--> 405 from gensim.models.word2vec_corpusfile import CythonVocab
406 from gensim.models.fasttext import FastText
407 cython_vocab = CythonVocab(self.wv, hs=self.hs, fasttext=isinstance(self, FastText))

ModuleNotFoundError: No module named 'gensim.models.word2vec_corpusfile'

sivanrchamarthy on 2 Aug 2019

👍1

@mehmetilker @sivanrchamarthy Thank you both for providing sample code, but it doesn't help me to reproduce the problem, because your sample does not run.

Could you please provide a minimum reproducible example?

mpenkov on 29 Sep 2019

@mpenkov
I have attahced corpus file. You can run following code with attached corpus.
But I should add that following code works with gensim 3.8.1 but seeing same exception above with 3.7.3.

As there is no problem with new gensim version you can ignore this issue or try with 3.7.3.

~~~
import logging
import os
import multiprocessing
from collections import OrderedDict

from gensim import corpora
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

logger = logging.getLogger('gensim')
logger.addHandler(logging.StreamHandler())
logger.setLevel(logging.DEBUG) # DEBUG to print sql output

TEMP_FOLDER = ""

file_name = os.path.join(TEMP_FOLDER, 'corpus.txt')
model = Doc2Vec(corpus_file=file_name, dm=0, vector_size=10, window=5, min_count=3, workers=multiprocessing.cpu_count())

model = Doc2Vec(corpus_file=file_name, epochs=5, vector_size=300, workers=multiprocessing.cpu_count())

model.save(os.path.join(TEMP_FOLDER, 'doc2vec_dm_1.model'))

~~~

corpus.zip

mehmetilker on 3 Oct 2019

@sivanrchamarthy Does upgrading to the latest version of gensim solve your problem?

mpenkov on 3 Oct 2019

@mehmetilker Thank you for the reproducible sample and the clarification.

mpenkov on 3 Oct 2019

Your report says FAST_VERSION -1. This means all compiled (fast) algorithms are missing in your installation of Gensim, including for corpus_file. That explains the error.

The fix would be to install Gensim properly, including the optimized compiled version. If you're on Windows, there are now binary wheels in 3.8.1 – can you try that @mehmetilker ?

piskvorky on 8 Oct 2019

@piskvorky
I had the problem in 3.7.3. It works now with 3.8.1 already (as I have written).
So the the reason of error I had in version 3.7.3 due to FAST_VERSION -1 I assume.

I guess exceptions/problems related with FAST_VERSION -1 needs better description.

mehmetilker on 8 Oct 2019

👍1

Alright, thanks.

piskvorky on 8 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings