I can run Doc2vec with documents parameter without problem but when set corpus_file parameter I am getting following exception:
2019-05-20 15:39:47,719 INFO collecting all words and their counts
2019-05-20 15:39:47,719 WARNING this function is deprecated, use smart_open.open instead
2019-05-20 15:39:47,722 INFO PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2019-05-20 15:39:47,880 INFO PROGRESS: at example #10000, processed 166529 words (1052030/s), 12757 word types, 10000 tags
2019-05-20 15:39:47,994 INFO PROGRESS: at example #20000, processed 333710 words (1475171/s), 18969 word types, 20000 tags
2019-05-20 15:39:48,110 INFO PROGRESS: at example #30000, processed 497372 words (1413199/s), 23460 word types, 30000 tags
2019-05-20 15:39:48,232 INFO PROGRESS: at example #40000, processed 664500 words (1368097/s), 27517 word types, 40000 tags
2019-05-20 15:39:48,356 INFO PROGRESS: at example #50000, processed 831204 words (1351422/s), 31439 word types, 50000 tags
2019-05-20 15:39:48,474 INFO PROGRESS: at example #60000, processed 999270 words (1425773/s), 34971 word types, 60000 tags
2019-05-20 15:39:48,595 INFO PROGRESS: at example #70000, processed 1166581 words (1381434/s), 38029 word types, 70000 tags
2019-05-20 15:39:48,709 INFO PROGRESS: at example #80000, processed 1331917 words (1450900/s), 40998 word types, 80000 tags
2019-05-20 15:39:48,850 INFO PROGRESS: at example #90000, processed 1500483 words (1199681/s), 43763 word types, 90000 tags
2019-05-20 15:39:48,968 INFO PROGRESS: at example #100000, processed 1668223 words (1425485/s), 46284 word types, 100000 tags
2019-05-20 15:39:49,100 INFO PROGRESS: at example #110000, processed 1832073 words (1246089/s), 48598 word types, 110000 tags
2019-05-20 15:39:49,236 INFO PROGRESS: at example #120000, processed 1999604 words (1235473/s), 50894 word types, 120000 tags
2019-05-20 15:39:49,360 INFO PROGRESS: at example #130000, processed 2166458 words (1342288/s), 53063 word types, 130000 tags
2019-05-20 15:39:49,485 INFO PROGRESS: at example #140000, processed 2332035 words (1324079/s), 55163 word types, 140000 tags
2019-05-20 15:39:49,605 INFO PROGRESS: at example #150000, processed 2498794 words (1402007/s), 57258 word types, 150000 tags
2019-05-20 15:39:49,720 INFO PROGRESS: at example #160000, processed 2664993 words (1448274/s), 59151 word types, 160000 tags
2019-05-20 15:39:49,851 INFO PROGRESS: at example #170000, processed 2833858 words (1287963/s), 61142 word types, 170000 tags
2019-05-20 15:39:49,977 INFO PROGRESS: at example #180000, processed 3001865 words (1343356/s), 62966 word types, 180000 tags
2019-05-20 15:39:50,106 INFO PROGRESS: at example #190000, processed 3169739 words (1300939/s), 64843 word types, 190000 tags
2019-05-20 15:39:50,232 INFO collected 66527 word types and 200000 unique tags from a corpus of 200000 examples and 3335643 words
2019-05-20 15:39:50,233 INFO Loading a fresh vocabulary
2019-05-20 15:39:50,349 INFO effective_min_count=3 retains 26695 unique words (40% of original 66527, drops 39832)
2019-05-20 15:39:50,349 INFO effective_min_count=3 leaves 3286045 word corpus (98% of original 3335643, drops 49598)
2019-05-20 15:39:50,446 INFO deleting the raw counts dictionary of 66527 items
2019-05-20 15:39:50,448 INFO sample=0.001 downsamples 35 most-common words
2019-05-20 15:39:50,448 INFO downsampling leaves estimated 3004103 word corpus (91.4% of prior 3286045)
2019-05-20 15:39:50,547 INFO estimated required memory for 26695 words and 300 dimensions: 317415500 bytes
2019-05-20 15:39:50,547 INFO resetting layer weights
2019-05-20 15:39:54,690 WARNING this function is deprecated, use smart_open.open instead
2019-05-20 15:39:54,837 INFO training model with 4 workers on 26695 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
Traceback (most recent call last):
File "run.py", line 76, in <module>
run_ner_process()
File "run.py", line 57, in run_ner_process
doc2vec_test.doc2vec_create3()
File "C:\_XProject\similarity\doc2vec_test.py", line 48, in doc2vec_create3
model = Doc2Vec(corpus_file=file_name, dm=1, vector_size=300, window=5, min_count=3, workers=multiprocessing.cpu_count())
File "C:\_XProject\_envx\lib\site-packages\gensim\models\doc2vec.py", line 620, in __init__
end_alpha=self.min_alpha, callbacks=callbacks)
File "C:\_XProject\_envx\lib\site-packages\gensim\models\doc2vec.py", line 814, in train
queue_factor=queue_factor, report_delay=report_delay, callbacks=callbacks, **kwargs)
File "C:\_XProject\_envx\lib\site-packages\gensim\models\base_any2vec.py", line 1081, in train
**kwargs)
File "C:\_XProject\_envx\lib\site-packages\gensim\models\base_any2vec.py", line 556, in train
corpus_file, cur_epoch=cur_epoch, total_examples=total_examples, total_words=total_words, **kwargs)
File "C:\_XProject\_envx\lib\site-packages\gensim\models\base_any2vec.py", line 405, in _train_epoch_corpusfile
from gensim.models.word2vec_corpusfile import CythonVocab
ModuleNotFoundError: No module named 'gensim.models.word2vec_corpusfile'
Include full tracebacks, logs and datasets if necessary. Please keep the examples minimal ("minimal reproducible example").
import os
import multiprocessing
from collections import OrderedDict
from gensim import corpora
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.utils import save_as_line_sentence
TEMP_FOLDER=""
common_texts = ["Hırvatistan,diğer,şehir,hiçbir,aday,salt,zafer,el,et".split(","),
"seçim,toplam,46 bin 325,aday,yarış".split(","),
"AB,milletvekili,Makedonya,üyelik,müzakere,başla,çağrı".split(",")]
#total 200.000 lines
save_as_line_sentence(common_texts, os.path.join(TEMP_FOLDER, 'corpus.txt'))
file_name = os.path.join(TEMP_FOLDER, 'corpus.txt')
model = Doc2Vec(corpus_file=file_name, dm=0, vector_size=10, window=5, min_count=3, workers=multiprocessing.cpu_count())
#model = Doc2Vec(corpus_file=file_name, epochs=5, vector_size=300, workers=multiprocessing.cpu_count())
model.save(os.path.join(TEMP_FOLDER, 'doc2vec_dm_1.model'))
corpus.txt file content after "save_as_line_sentence" (200.000 lines of tokenized sentences)
~
Hırvat yerel seçim ikinci tur kal
sonuç başta büyük şehir ol üzere açık fark bir zafer el et göster
seçim gözlemci seçim gün sandık merkez kampanya yasak del bil
...
~
Please provide the output of:
>>> import platform; print(platform.platform())
Windows-10-10.0.16299-SP0
>>> import sys; print("Python", sys.version)
Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) [MSC v.1916 64 bit (AMD64)]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.16.3
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.2.1
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.7.3
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION -1
>>>
Thank you for your report.
Could you please make sure your reproducible example is minimal and complete? It looks like there are some definitions missing, and the indentation is wrong. Ideally, it'd be something we can just copy-paste into an interpreter to reproduce the problem.
@mpenkov I have updated example. But I guess due to few sentence I am getting "RuntimeError: you must first build vocabulary before training the model"
When I try with original corpus with 200.000 sentence I see same problem I stated above.
I am also facing the same error:
model_gensim.train(
corpus_file=corpus_file, epochs=model_gensim.epochs,
total_examples=model_gensim.corpus_count, total_words=model_gensim.corpus_total_words
)
ModuleNotFoundError: No module named 'gensim.models.word2vec_corpusfile'
ModuleNotFoundError Traceback (most recent call last)
1 model_gensim.train(
2 corpus_file=corpus_file, epochs=model_gensim.epochs,
----> 3 total_examples=model_gensim.corpus_count, total_words=model_gensim.corpus_total_words
4 )
c:\users\chamars\appdata\local\programs\python\python37\lib\site-packages\gensim\models\fasttext.py in train(self, sentences, corpus_file, total_examples, total_words, epochs, start_alpha, end_alpha, word_count, queue_factor, report_delay, callbacks, **kwargs)
920 sentences=sentences, corpus_file=corpus_file, total_examples=total_examples, total_words=total_words,
921 epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha, word_count=word_count,
--> 922 queue_factor=queue_factor, report_delay=report_delay, callbacks=callbacks)
923 self.wv.adjust_vectors()
924
c:\users\chamars\appdata\local\programs\python\python37\lib\site-packages\gensim\models\base_any2vec.py in train(self, sentences, corpus_file, total_examples, total_words, epochs, start_alpha, end_alpha, word_count, queue_factor, report_delay, compute_loss, callbacks, *kwargs)
1079 total_words=total_words, epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha, word_count=word_count,
1080 queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks,
-> 1081 *kwargs)
1082
1083 def _get_job_params(self, cur_epoch):
c:\users\chamars\appdata\local\programs\python\python37\lib\site-packages\gensim\models\base_any2vec.py in train(self, data_iterable, corpus_file, epochs, total_examples, total_words, queue_factor, report_delay, callbacks, *kwargs)
554 else:
555 trained_word_count_epoch, raw_word_count_epoch, job_tally_epoch = self._train_epoch_corpusfile(
--> 556 corpus_file, cur_epoch=cur_epoch, total_examples=total_examples, total_words=total_words, *kwargs)
557
558 trained_word_count += trained_word_count_epoch
c:\users\chamars\appdata\local\programs\python\python37\lib\site-packages\gensim\models\base_any2vec.py in _train_epoch_corpusfile(self, corpus_file, cur_epoch, total_examples, total_words, **kwargs)
403 raise ValueError("total_words must be provided alongside corpus_file argument.")
404
--> 405 from gensim.models.word2vec_corpusfile import CythonVocab
406 from gensim.models.fasttext import FastText
407 cython_vocab = CythonVocab(self.wv, hs=self.hs, fasttext=isinstance(self, FastText))
ModuleNotFoundError: No module named 'gensim.models.word2vec_corpusfile'
@mehmetilker @sivanrchamarthy Thank you both for providing sample code, but it doesn't help me to reproduce the problem, because your sample does not run.
Could you please provide a minimum reproducible example?
@mpenkov
I have attahced corpus file. You can run following code with attached corpus.
But I should add that following code works with gensim 3.8.1 but seeing same exception above with 3.7.3.
As there is no problem with new gensim version you can ignore this issue or try with 3.7.3.
~~~
import logging
import os
import multiprocessing
from collections import OrderedDict
from gensim import corpora
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
logger = logging.getLogger('gensim')
logger.addHandler(logging.StreamHandler())
logger.setLevel(logging.DEBUG) # DEBUG to print sql output
TEMP_FOLDER = ""
file_name = os.path.join(TEMP_FOLDER, 'corpus.txt')
model = Doc2Vec(corpus_file=file_name, dm=0, vector_size=10, window=5, min_count=3, workers=multiprocessing.cpu_count())
model.save(os.path.join(TEMP_FOLDER, 'doc2vec_dm_1.model'))
~~~
@sivanrchamarthy Does upgrading to the latest version of gensim solve your problem?
@mehmetilker Thank you for the reproducible sample and the clarification.
Your report says FAST_VERSION -1. This means all compiled (fast) algorithms are missing in your installation of Gensim, including for corpus_file. That explains the error.
The fix would be to install Gensim properly, including the optimized compiled version. If you're on Windows, there are now binary wheels in 3.8.1 – can you try that @mehmetilker ?
@piskvorky
I had the problem in 3.7.3. It works now with 3.8.1 already (as I have written).
So the the reason of error I had in version 3.7.3 due to FAST_VERSION -1 I assume.
I guess exceptions/problems related with FAST_VERSION -1 needs better description.
Alright, thanks.