From the mailing list here and here and here and here
In theory, a model with more topics is more expressive so should fit better. However the perplexity parameter is a bound not the exact perplexity.
Would like to get to the bottom of this. Does anyone have a corpus and code to reproduce?
Compare behaviour of gensim, VW, sklearn, Mallet and other implementations as number of topics increases.
@tmylk I would like to take a stab at the issue. How do I start?
@tmylk I will do this.
@shubham0420 @souravsingh have you any progress?
Nothing was said about how to pursue with the task. I can still pursue the task if needed.
Ok @souravsingh, let's try :+1:
Has anyone looked into this more closely? There's definitely something weird about the perplexity results.
Here's a sample that compares LdaModel and VW's perplexity calculations. VW shows steadily decreasing plexplexity while LdaModel shows it rapidly increasing as the number of topics goes up, with the same model and test data:
10 1748074.208 8948.300
50 1748053.610 14224.587
100 1748046.340 25370.445
And here's the code:
```
import gzip
from cytoolz import take
from gensim.models.wrappers import LdaVowpalWabbit
from gensim.models.wrappers.ldavowpalwabbit import vwmodel2ldamodel
from gensim.matutils import Sparse2Corpus
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import gensim.downloader as api
corpus = api.load('text8')
V = CountVectorizer(analyzer=str.split)
X = V.fit_transform(take(50000,gzip.open(corpus.fn,mode='rt')))
Xtrain, Xtest = train_test_split(X, test_size=0.1, shuffle=True)
vocab = dict((v,k) for (k,v) in V.vocabulary_.items())
train_corpus = Sparse2Corpus(Xtrain, documents_columns=False)
test_corpus = Sparse2Corpus(Xtest, documents_columns=False)
vwpath = '/usr/local/bin/vw'
models = [ LdaVowpalWabbit(vwpath, train_corpus, id2word=vocab, num_topics=k)
for k in [10, 50, 100] ]
for m in models:
print('%3d %10.3f %10.3f'%(m.num_topics,
2-m.log_perplexity(test_corpus),
2-vwmodel2ldamodel(m).log_perplexity(test_corpus)))
```
Thanks for example @rmalouf :+1: looks really weird, something definitely goes wrong here (especially for the case that these models should be identical by matrices).
IMO this should be identical (or I missed something?)
Looking at vwmodel2ldamodel more closely, I think this is two separate problems. In creating a new LdaModel object, it sets expElogbeta, but that's not what's used by log_perplexity, get_topics etc. So, the LdaVowpalWabbit -> LdaModel conversion isn't happening correctly.
But, it's still also true that LdaModel's perplexity scores increase as the number of topics increases, so it looks like there's something not right there as well.
I'm facing the same issue as well with LdaModel and LdaMulticore models. Was there any update resolving this open issue?
I have the same problem, both LdaModel and LdaMulticore give weird perplexity values.
Most helpful comment
Looking at
vwmodel2ldamodelmore closely, I think this is two separate problems. In creating a new LdaModel object, it setsexpElogbeta, but that's not what's used bylog_perplexity,get_topicsetc. So, the LdaVowpalWabbit -> LdaModel conversion isn't happening correctly.But, it's still also true that LdaModel's perplexity scores increase as the number of topics increases, so it looks like there's something not right there as well.