Gensim: LDA: Increasing perplexity with increased no. of topics on small documents

Created on 18 May 2016  路  10Comments  路  Source: RaRe-Technologies/gensim

From the mailing list here and here and here and here

In theory, a model with more topics is more expressive so should fit better. However the perplexity parameter is a bound not the exact perplexity.

Would like to get to the bottom of this. Does anyone have a corpus and code to reproduce?

Compare behaviour of gensim, VW, sklearn, Mallet and other implementations as number of topics increases.

bug difficulty easy testing

Most helpful comment

Looking at vwmodel2ldamodel more closely, I think this is two separate problems. In creating a new LdaModel object, it sets expElogbeta, but that's not what's used by log_perplexity, get_topics etc. So, the LdaVowpalWabbit -> LdaModel conversion isn't happening correctly.

But, it's still also true that LdaModel's perplexity scores increase as the number of topics increases, so it looks like there's something not right there as well.

All 10 comments

@tmylk I would like to take a stab at the issue. How do I start?

@tmylk I will do this.

@shubham0420 @souravsingh have you any progress?

Nothing was said about how to pursue with the task. I can still pursue the task if needed.

Ok @souravsingh, let's try :+1:

Has anyone looked into this more closely? There's definitely something weird about the perplexity results.

Here's a sample that compares LdaModel and VW's perplexity calculations. VW shows steadily decreasing plexplexity while LdaModel shows it rapidly increasing as the number of topics goes up, with the same model and test data:

 10 1748074.208   8948.300
 50 1748053.610  14224.587
100 1748046.340  25370.445

And here's the code:

```
import gzip
from cytoolz import take

from gensim.models.wrappers import LdaVowpalWabbit
from gensim.models.wrappers.ldavowpalwabbit import vwmodel2ldamodel
from gensim.matutils import Sparse2Corpus

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

import gensim.downloader as api
corpus = api.load('text8')

V = CountVectorizer(analyzer=str.split)
X = V.fit_transform(take(50000,gzip.open(corpus.fn,mode='rt')))
Xtrain, Xtest = train_test_split(X, test_size=0.1, shuffle=True)

vocab = dict((v,k) for (k,v) in V.vocabulary_.items())
train_corpus = Sparse2Corpus(Xtrain, documents_columns=False)
test_corpus = Sparse2Corpus(Xtest, documents_columns=False)

vwpath = '/usr/local/bin/vw'
models = [ LdaVowpalWabbit(vwpath, train_corpus, id2word=vocab, num_topics=k)
for k in [10, 50, 100] ]

for m in models:
print('%3d %10.3f %10.3f'%(m.num_topics,
2-m.log_perplexity(test_corpus),
2
-vwmodel2ldamodel(m).log_perplexity(test_corpus)))
```

Thanks for example @rmalouf :+1: looks really weird, something definitely goes wrong here (especially for the case that these models should be identical by matrices).
IMO this should be identical (or I missed something?)

Looking at vwmodel2ldamodel more closely, I think this is two separate problems. In creating a new LdaModel object, it sets expElogbeta, but that's not what's used by log_perplexity, get_topics etc. So, the LdaVowpalWabbit -> LdaModel conversion isn't happening correctly.

But, it's still also true that LdaModel's perplexity scores increase as the number of topics increases, so it looks like there's something not right there as well.

I'm facing the same issue as well with LdaModel and LdaMulticore models. Was there any update resolving this open issue?

I have the same problem, both LdaModel and LdaMulticore give weird perplexity values.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

menshikh-iv picture menshikh-iv  路  3Comments

coopwilliams picture coopwilliams  路  3Comments

mmunozm picture mmunozm  路  3Comments

sairampillai picture sairampillai  路  3Comments

Laubeee picture Laubeee  路  3Comments