What are you trying to achieve?
:- To successfully store a corpus in MM formate.
What is the expected result?
:- It should store without any error.
What are you seeing instead?
:- It is giving memory error a the end of saving a corpus.
2019-09-04 09:10:31,905 : INFO : PROGRESS: saving document #50014000
2019-09-04 09:10:32,156 : INFO : PROGRESS: saving document #50015000
2019-09-04 09:10:32,467 : INFO : PROGRESS: saving document #50016000
2019-09-04 09:10:32,738 : INFO : PROGRESS: saving document #50017000
2019-09-04 09:10:33,068 : INFO : PROGRESS: saving document #50018000
2019-09-04 09:10:33,336 : INFO : PROGRESS: saving document #50019000
2019-09-04 09:10:33,623 : INFO : PROGRESS: saving document #50020000
2019-09-04 09:10:33,925 : INFO : PROGRESS: saving document #50021000
2019-09-04 09:10:34,186 : INFO : PROGRESS: saving document #50022000
2019-09-04 09:10:34,478 : INFO : PROGRESS: saving document #50023000
2019-09-04 09:10:34,730 : INFO : PROGRESS: saving document #50024000
2019-09-04 09:10:34,985 : INFO : PROGRESS: saving document #50025000
Traceback (most recent call last):
File "email_data_experiment.py", line 23, in <module>
MmCorpus.serialize(fname=corpus_path, corpus=corpus, metadata=True, id2word=corpus.dictionary)
File "C:\Users\koradg\AppData\Local\Programs\Python\Python36\lib\site-packages\gensim\corpora\indexedcorpus.py", line 123, in serialize
offsets = serializer.save_corpus(fname, corpus, id2word, **kwargs)
File "C:\Users\koradg\AppData\Local\Programs\Python\Python36\lib\site-packages\gensim\corpora\mmcorpus.py", line 125, in save_corpus
fname, corpus, num_terms=num_terms, index=True, progress_cnt=progress_cnt, metadata=metadata
File "C:\Users\koradg\AppData\Local\Programs\Python\Python36\lib\site-packages\gensim\matutils.py", line 1391, in write_corpus
utils.pickle(docno2metadata, fname + '.metadata.cpickle')
File "C:\Users\koradg\AppData\Local\Programs\Python\Python36\lib\site-packages\gensim\utils.py", line 1364, in pickle
_pickle.dump(obj, fout, protocol=protocol)
MemoryError
corpus = TextDirectoryCorpus(input=data_path,metadata=True,max_depth=2,lines_are_documents=True)
Dictionary.save_as_text(corpus.dictionary, fname_or_handle=dictionary_path)
MmCorpus.serialize(fname=corpus_path, corpus=corpus, metadata=True, id2word=corpus.dictionary)
getting error on 3rd line.
Please provide the output of:
import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
Windows-10-10.0.17763-SP0
Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)]
NumPy 1.17.0
SciPy 1.3.0
gensim 3.8.0
FAST_VERSION 0
The error seems related to the optional metadata parameter, which stores an additional file by pickling, which is what is failing here.
The metadata documentation says:
metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.
which I don't understand (what "titles"?). It may be some new addition, along with the labels parameter (what "class labels"?)
CC @mpenkov @gojomo any idea why this API is here? The docs are opaque.
@gauravkoradiya Why do you need this parameter, what do you expect out of it?
I’d guess the 'docno2metadata' object would require pickle-serialization beyond its effective size-limit (likely 2GB or 4GB). To test this theory, I’d try to access/create that same object directly, & try to pickle it, to see if it triggers the same error.
The error seems related to the optional
metadataparameter, which stores an additional file by pickling, which is what is failing here.The
metadatadocumentation says:metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.which I don't understand (what "titles"?). It may be some new addition, along with the
labelsparameter (what "class labels"?)CC @mpenkov @gojomo any idea why this API is here? The docs are opaque.
@gauravkoradiya Why do you need this parameter, what do you expect out of it?
I need a document line form that.
I need a document line form that.
If there's a specific bit of per-document metadata you need to retain, as a workaround, you may need to write your own parallel routine to save/reload that data separately.
write your own parallel routine to save/reload that data separately.
That is indeed the recommended way. Unless there's a good reason not to, I'm in favour of even removing the existing metadata (and labels?) parameters. Their API seems out of place in Gensim, as well as under-documented.
@piskvorky @gojomo Is it appropriate to close this? It looks like there are more legitimate alternatives.
Unless there's a good reason not to, I'm in favour of even removing the existing metadata (and labels?) parameters. Their API seems out of place in Gensim, as well as under-documented.
I was trying to solve but it seems that no solution I found. Please reopen if one have proper solution.