Gensim: LDA: Increasing perplexity with increased no. of topics on small documents

Created on 18 May 2016 · 10Comments · Source: RaRe-Technologies/gensim

From the mailing list here and here and here and here

In theory, a model with more topics is more expressive so should fit better. However the perplexity parameter is a bound not the exact perplexity.

Would like to get to the bottom of this. Does anyone have a corpus and code to reproduce?

Compare behaviour of gensim, VW, sklearn, Mallet and other implementations as number of topics increases.

bug difficulty easy testing

Source

tmylk

Most helpful comment

Looking at vwmodel2ldamodel more closely, I think this is two separate problems. In creating a new LdaModel object, it sets expElogbeta, but that's not what's used by log_perplexity, get_topics etc. So, the LdaVowpalWabbit -> LdaModel conversion isn't happening correctly.

But, it's still also true that LdaModel's perplexity scores increase as the number of topics increases, so it looks like there's something not right there as well.

rmalouf on 9 Apr 2018

👍3

All 10 comments

@tmylk I would like to take a stab at the issue. How do I start?

souravsingh on 6 Oct 2016

@tmylk I will do this.

shubham0420 on 1 Nov 2016

@shubham0420 @souravsingh have you any progress?

menshikh-iv on 2 Oct 2017

Nothing was said about how to pursue with the task. I can still pursue the task if needed.

souravsingh on 2 Oct 2017

Ok @souravsingh, let's try :+1:

menshikh-iv on 2 Oct 2017

Has anyone looked into this more closely? There's definitely something weird about the perplexity results.

Here's a sample that compares LdaModel and VW's perplexity calculations. VW shows steadily decreasing plexplexity while LdaModel shows it rapidly increasing as the number of topics goes up, with the same model and test data:

 10 1748074.208   8948.300
 50 1748053.610  14224.587
100 1748046.340  25370.445

And here's the code:

```
import gzip
from cytoolz import take

from gensim.models.wrappers import LdaVowpalWabbit
from gensim.models.wrappers.ldavowpalwabbit import vwmodel2ldamodel
from gensim.matutils import Sparse2Corpus

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

import gensim.downloader as api
corpus = api.load('text8')

V = CountVectorizer(analyzer=str.split)
X = V.fit_transform(take(50000,gzip.open(corpus.fn,mode='rt')))
Xtrain, Xtest = train_test_split(X, test_size=0.1, shuffle=True)

vocab = dict((v,k) for (k,v) in V.vocabulary_.items())
train_corpus = Sparse2Corpus(Xtrain, documents_columns=False)
test_corpus = Sparse2Corpus(Xtest, documents_columns=False)

vwpath = '/usr/local/bin/vw'
models = [ LdaVowpalWabbit(vwpath, train_corpus, id2word=vocab, num_topics=k)
for k in [10, 50, 100] ]

for m in models:
print('%3d %10.3f %10.3f'%(m.num_topics,
2-m.log_perplexity(test_corpus),
2-vwmodel2ldamodel(m).log_perplexity(test_corpus)))
```

rmalouf on 8 Apr 2018

Thanks for example @rmalouf :+1: looks really weird, something definitely goes wrong here (especially for the case that these models should be identical by matrices).
IMO this should be identical (or I missed something?)

menshikh-iv on 9 Apr 2018

But, it's still also true that LdaModel's perplexity scores increase as the number of topics increases, so it looks like there's something not right there as well.

rmalouf on 9 Apr 2018

👍3

I'm facing the same issue as well with LdaModel and LdaMulticore models. Was there any update resolving this open issue?