Gensim: Memory error while processing with mm.serialize() method

Created on 4 Sep 2019  Â·  7Comments  Â·  Source: RaRe-Technologies/gensim

Problem description

What are you trying to achieve?
:- To successfully store a corpus in MM formate.

What is the expected result?
:- It should store without any error.

What are you seeing instead?
:- It is giving memory error a the end of saving a corpus.

2019-09-04 09:10:31,905 : INFO : PROGRESS: saving document #50014000
2019-09-04 09:10:32,156 : INFO : PROGRESS: saving document #50015000
2019-09-04 09:10:32,467 : INFO : PROGRESS: saving document #50016000
2019-09-04 09:10:32,738 : INFO : PROGRESS: saving document #50017000
2019-09-04 09:10:33,068 : INFO : PROGRESS: saving document #50018000
2019-09-04 09:10:33,336 : INFO : PROGRESS: saving document #50019000
2019-09-04 09:10:33,623 : INFO : PROGRESS: saving document #50020000
2019-09-04 09:10:33,925 : INFO : PROGRESS: saving document #50021000
2019-09-04 09:10:34,186 : INFO : PROGRESS: saving document #50022000
2019-09-04 09:10:34,478 : INFO : PROGRESS: saving document #50023000
2019-09-04 09:10:34,730 : INFO : PROGRESS: saving document #50024000
2019-09-04 09:10:34,985 : INFO : PROGRESS: saving document #50025000
Traceback (most recent call last):
  File "email_data_experiment.py", line 23, in <module>
    MmCorpus.serialize(fname=corpus_path, corpus=corpus, metadata=True, id2word=corpus.dictionary)
  File "C:\Users\koradg\AppData\Local\Programs\Python\Python36\lib\site-packages\gensim\corpora\indexedcorpus.py", line 123, in serialize
    offsets = serializer.save_corpus(fname, corpus, id2word, **kwargs)
  File "C:\Users\koradg\AppData\Local\Programs\Python\Python36\lib\site-packages\gensim\corpora\mmcorpus.py", line 125, in save_corpus
    fname, corpus, num_terms=num_terms, index=True, progress_cnt=progress_cnt, metadata=metadata
  File "C:\Users\koradg\AppData\Local\Programs\Python\Python36\lib\site-packages\gensim\matutils.py", line 1391, in write_corpus
    utils.pickle(docno2metadata, fname + '.metadata.cpickle')
  File "C:\Users\koradg\AppData\Local\Programs\Python\Python36\lib\site-packages\gensim\utils.py", line 1364, in pickle
    _pickle.dump(obj, fout, protocol=protocol)
MemoryError

Steps/code/corpus to reproduce

        corpus = TextDirectoryCorpus(input=data_path,metadata=True,max_depth=2,lines_are_documents=True)
        Dictionary.save_as_text(corpus.dictionary, fname_or_handle=dictionary_path)
        MmCorpus.serialize(fname=corpus_path, corpus=corpus, metadata=True, id2word=corpus.dictionary)

getting error on 3rd line.

Versions

Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

Windows-10-10.0.17763-SP0
Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)]
NumPy 1.17.0
SciPy 1.3.0
gensim 3.8.0
FAST_VERSION 0

bug documentation impact HIGH reach LOW

All 7 comments

The error seems related to the optional metadata parameter, which stores an additional file by pickling, which is what is failing here.

The metadata documentation says:

metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.

which I don't understand (what "titles"?). It may be some new addition, along with the labels parameter (what "class labels"?)

CC @mpenkov @gojomo any idea why this API is here? The docs are opaque.

@gauravkoradiya Why do you need this parameter, what do you expect out of it?

I’d guess the 'docno2metadata' object would require pickle-serialization beyond its effective size-limit (likely 2GB or 4GB). To test this theory, I’d try to access/create that same object directly, & try to pickle it, to see if it triggers the same error.

The error seems related to the optional metadata parameter, which stores an additional file by pickling, which is what is failing here.

The metadata documentation says:

metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.

which I don't understand (what "titles"?). It may be some new addition, along with the labels parameter (what "class labels"?)

CC @mpenkov @gojomo any idea why this API is here? The docs are opaque.

@gauravkoradiya Why do you need this parameter, what do you expect out of it?

I need a document line form that.

I need a document line form that.

If there's a specific bit of per-document metadata you need to retain, as a workaround, you may need to write your own parallel routine to save/reload that data separately.

write your own parallel routine to save/reload that data separately.

That is indeed the recommended way. Unless there's a good reason not to, I'm in favour of even removing the existing metadata (and labels?) parameters. Their API seems out of place in Gensim, as well as under-documented.

@piskvorky @gojomo Is it appropriate to close this? It looks like there are more legitimate alternatives.

Unless there's a good reason not to, I'm in favour of even removing the existing metadata (and labels?) parameters. Their API seems out of place in Gensim, as well as under-documented.

I was trying to solve but it seems that no solution I found. Please reopen if one have proper solution.

Was this page helpful?
0 / 5 - 0 ratings