Gensim: LDAViz and streaming corpus

Created on 24 Nov 2015 · 3Comments · Source: RaRe-Technologies/gensim

I'm using the LDA implmentation from Gensim and I wanted to use my estimated LDA model and corpus in the LDAVis tool.

The tutorial of taking a Gensim corpus and lda model is really helpful (link http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Gensim%20Newsgroup.ipynb#topic=0&lambda=1&term=) but I'm having issues with my implementation.

I use the memory friendly implementation of my corpus and dont' store it in memory, which I think may be the root of my problem.

Does anyone know how I can implement pyLDAvis.gensim.prepare on a streaming corpus? I get the following error:

import pyLDAvis.gensim as gensimvis
vis_data = gensimvis.prepare(ldamodel, mycorpus, mycorpus.dictionary)

Traceback (most recent call last):

File "", line 1, in
vis_data = gensimvis.prepare(ldamod, corpus, corpus.dictionary)

File "//anaconda/lib/python2.7/site-packages/pyLDAvis/gensim.py", line 97, in prepare
opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs)

File "//anaconda/lib/python2.7/site-packages/pyLDAvis/gensim.py", line 33, in _extract_data
assert doc_lengths.shape[0] == len(corpus), 'Document lengths and corpus have different sizes {} != {}'.format(doc_lengths.shape[0], len(corpus))

TypeError: object of type 'MyCorpus' has no len()

documentation wishlist

Source

franciscojavierarceo

Most helpful comment

Figured out the simple solution.

Just have to save the corpus as a serialized corpus and use the serialized version in the call to ldavis.

Example:

import gensim
import pyLDAvis.gensim as gensimvis
from gensim import corpora, models, similarities
class MyCorpus(object):
def iter(self):
for line in open('mycorpus.txt'):
# assume there's one document per line, tokens separated by whitespace
yield dictionary.doc2bow(line.lower().split())

corpus = MyCorpus()
corpora.MmCorpus.serialize(path+'SerializedCorpus.mm', corpus)
SerializedCorpus = corpora.MmCorpus(path+'SerializedCorpus.mm')
vis_data = gensimvis.prepare(lda, SerializedCorpus, corpus.dictionary)
pyLDAvis.save_html(vis_data,outpth+'LDA_Visualization.html')

franciscojavierarceo on 27 Nov 2015

👍2

All 3 comments

This is probably better served on the gensim mailing list -- many more people will have a chance to read your question (and answer it).

piskvorky on 24 Nov 2015

I think the first thing you can try is to implement a __len__method on your corpus, if possible, or serialize it to one of our internal formats first. An example implementation might be: https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/textcorpus.py#L100

cscorley on 24 Nov 2015

👍1

Figured out the simple solution.

Just have to save the corpus as a serialized corpus and use the serialized version in the call to ldavis.

Example:

import gensim
import pyLDAvis.gensim as gensimvis
from gensim import corpora, models, similarities
class MyCorpus(object):
def iter(self):
for line in open('mycorpus.txt'):
# assume there's one document per line, tokens separated by whitespace
yield dictionary.doc2bow(line.lower().split())

corpus = MyCorpus()
corpora.MmCorpus.serialize(path+'SerializedCorpus.mm', corpus)
SerializedCorpus = corpora.MmCorpus(path+'SerializedCorpus.mm')
vis_data = gensimvis.prepare(lda, SerializedCorpus, corpus.dictionary)
pyLDAvis.save_html(vis_data,outpth+'LDA_Visualization.html')