I'm using the LDA implmentation from Gensim and I wanted to use my estimated LDA model and corpus in the LDAVis tool.
The tutorial of taking a Gensim corpus and lda model is really helpful (link http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Gensim%20Newsgroup.ipynb#topic=0&lambda=1&term=) but I'm having issues with my implementation.
I use the memory friendly implementation of my corpus and dont' store it in memory, which I think may be the root of my problem.
Does anyone know how I can implement pyLDAvis.gensim.prepare on a streaming corpus? I get the following error:
import pyLDAvis.gensim as gensimvis
vis_data = gensimvis.prepare(ldamodel, mycorpus, mycorpus.dictionary)
Traceback (most recent call last):
File "
", line 1, in
vis_data = gensimvis.prepare(ldamod, corpus, corpus.dictionary)File "//anaconda/lib/python2.7/site-packages/pyLDAvis/gensim.py", line 97, in prepare
opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs)File "//anaconda/lib/python2.7/site-packages/pyLDAvis/gensim.py", line 33, in _extract_data
assert doc_lengths.shape[0] == len(corpus), 'Document lengths and corpus have different sizes {} != {}'.format(doc_lengths.shape[0], len(corpus))TypeError: object of type 'MyCorpus' has no len()
This is probably better served on the gensim mailing list -- many more people will have a chance to read your question (and answer it).
I think the first thing you can try is to implement a __len__method on your corpus, if possible, or serialize it to one of our internal formats first. An example implementation might be: https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/textcorpus.py#L100
Figured out the simple solution.
Just have to save the corpus as a serialized corpus and use the serialized version in the call to ldavis.
Example:
import gensim
import pyLDAvis.gensim as gensimvis
from gensim import corpora, models, similarities
class MyCorpus(object):
def iter(self):
for line in open('mycorpus.txt'):
# assume there's one document per line, tokens separated by whitespace
yield dictionary.doc2bow(line.lower().split())corpus = MyCorpus()
corpora.MmCorpus.serialize(path+'SerializedCorpus.mm', corpus)
SerializedCorpus = corpora.MmCorpus(path+'SerializedCorpus.mm')
vis_data = gensimvis.prepare(lda, SerializedCorpus, corpus.dictionary)
pyLDAvis.save_html(vis_data,outpth+'LDA_Visualization.html')
Most helpful comment
Figured out the simple solution.
Just have to save the corpus as a serialized corpus and use the serialized version in the call to ldavis.
Example: