For my intership, I'm trying to evaluate the quality of different LDA models using both perplexity and coherence. I used a loop and generated each model. However, I encounter a problem when the loop reaches parameter_value=15. The coherence score seems to be stored as nan for the models 15, 20, 25 and 30... I tried fixing this issue by changind the parameters in .LdaModel() but it only makes the warning appear for further models. Instead of having a warning for parameter_value=15, I get it for parameter_value=30.
Can someone please help me ?
starting pass for parameter_value = 30.000
Elapsed time: 1.6870347789972584
Perplexity score: -13.63168019880968
C:\Users\straw\Anaconda3\lib\site-packages\gensim\topic_coherence\direct_confirmation_measure.py:204: RuntimeWarning: divide by zero encountered in double_scalars
m_lr_i = np.log(numerator / denominator)
C:\Users\straw\Anaconda3\lib\site-packages\gensim\topic_coherence\indirect_confirmation_measure.py:323: RuntimeWarning: invalid value encountered in double_scalars
return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))
Coherence Score: nan
grid_flt = defaultdict(list)
# num topics
parameter_list=[2, 5, 10, 15, 20, 25, 30]
for parameter_value in parameter_list:
print("starting pass for parameter_value = %.3f" % parameter_value)
start_time = timeit.default_timer()
# run model
ldamodel_train_flt = gensim.models.ldamodel.LdaModel(corpus=doc_term_matrix_train_flt, id2word = dictionary_train_flt, num_topics = parameter_value, passes=25, per_word_topics=True)
# show elapsed time for model
elapsed = timeit.default_timer() - start_time
print("Elapsed time: %s" % elapsed)
# Compute perplexity
perplex = ldamodel_train_flt.log_perplexity(doc_term_matrix_test_flt)
print("Perplexity score: %s" % perplex)
grid_flt[parameter_value].append(perplex)
# Compute Coherence Score
coherence_model_lda = gensim.models.coherencemodel.CoherenceModel(model=ldamodel_train_flt, texts=list_of_docs_flt_test, dictionary=dictionary_train_flt, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print("Coherence Score: %s" % coherence_lda)
grid_flt[parameter_value].append(coherence_lda)
Windows-10-10.0.17134-SP0
Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
NumPy 1.16.2
SciPy 1.1.0
gensim 3.7.2
FAST_VERSION -1
@Kaotic-Kiwi Can you please explain what happened here?
I am facing the exact same issue while using LDAMalletModel. Would you please provide the solution?
My function to create multiple models and stores multiple values in a list
def compute_coherence_score(dictionary, corpus, texts, limit, start, step):
"""Compute Coherence score for different values of num of topics"""
coherence_scores, model_list = [],[]
for num_topics in range(start,limit,step):
model = gensim.models.wrappers.LdaMallet(mallet_path=mallet_path, corpus = corpus,id2word=id2word, num_topics= num_topics)
model_list.append(model)
coherencescore = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_scores.append(coherencescore.get_coherence())
return model_list, coherence_scores
*Function Call *
model_list , coherence_scores = compute_coherence_score(dictionary=id2word,texts=data_words,corpus=corpus,limit=100,start=50,step=10)
print(model_list)
print(coherence_scores)
Error Message: Resulting the coherence_scores =[nan,nan,nan,nan] (with all NaN values)
/usr/local/lib/python3.6/dist-packages/gensim/topic_coherence/direct_confirmation_measure.py:204: RuntimeWarning: divide by zero encountered in double_scalars
m_lr_i = np.log(numerator / denominator)
/usr/local/lib/python3.6/dist-packages/gensim/topic_coherence/indirect_confirmation_measure.py:323: RuntimeWarning: invalid value encountered in double_scalars
return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))
@kaotic-Kiwi @mpenkov Can you please inform if there is any solution for this?
I'm getting a similar error using the default gensim LDA implementation.
I did notice that there is a certain combination of topics (in LDA) and topn (in CoherenceModel) settings that can let the coherence calculation go through, for example if I have 30 topics and topn=2 I make it through the calculation.
Any thoughts? Perhaps this is a numerical stability issue?
ps: interestingly with the other window methods ‘c_uci’ and ‘c_npmi’ I get inf instead of nan
I am getting exactly the same error as @Kaotic-Kiwi when I try to calculate the coherence with c_v. Could somebody please help us or reopen the issue?
I'm wondering if this does not come from adding epsilon to the numerator rather than the denominator l.202-203 in topic_coherence/direct_confirmation_measure.py :
numerator = (co_occur_count / num_docs) + EPSILON
denominator = (w_prime_count / num_docs) * (w_star_count / num_docs)
m_lr_i = np.log(numerator / denominator)
Adding +EPSILON to the denominator removes the warning+NaN coherence result for me.
@HaukeT @aschoenauer-sebag the topic_coherence is a contributed module and its quality may be iffy.
If you're able to fix the issue and open a clean clear PR that'd be great.
Hello,
I wrote another code, it seems to do the job for me. Apparently, using the parameter corpus instead of the parameter dictionary doesn't create any errors. I think coherence='c_v' doesn't like to be called with the dictionary parameter. I don't quite undertand why.
def LdaPipeline(train_set, test_set, k):
dictionary = gensim.corpora.Dictionary(train_set)
corpus_train = [dictionary.doc2bow(doc) for doc in train_set]
corpus_test = [dictionary.doc2bow(doc) for doc in test_set]
# LDA
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus_train, id2word = dictionary, num_topics = k, passes=30, alpha='auto')
# Perplexity
perplexity = lda_model.log_perplexity(corpus_test)
# Coherence
coherence_model = gensim.models.coherencemodel.CoherenceModel(model=lda_model, corpus=corpus_train, texts = train_set, coherence='c_v')
coherence = coherence_model.get_coherence()
return [perplexity, coherence]
Thank you for reopening the issue and the replies. The work around from @Kaotic-Kiwi to only use the parameter corpus and avoid the parameter dictionary did not work for my data. I will try to find an error in my data.
In my case, this error will happen when I try to pass my prior eta to the model. My eta is a numpy.ndarray with the shape of (num topics, num terms). I initialize eta with the value of 1/(num topics) and transfer some prior to top-n rows.
e.g. 3 topics and the first row is my prior:
[ [18, 63, 52, 5, 0, 145],
[1/3, 1/3, 1/3, 1/3, 1/3, 1/3],
[1/3, 1/3, 1/3, 1/3, 1/3, 1/3]]
with the more num of prior I transfer, the topic coherence will get nan in the calculation. (e.g. 30 topics, transfer 10)
Does not work for me either. I am using LDAMallet and using corpus instead of dictionary parameter as per @Kaotic-Kiwi advice did not help to solve the issue, unfortunately.
I get this error when switching to corpus parameter:
text_analysis.py in _ids_to_words(ids, dictionary)
55
56 """
---> 57 if not dictionary.id2token: # may not be initialized in the standard gensim.corpora.Dictionary
58 setattr(dictionary, 'id2token', {v: k for k, v in dictionary.token2id.items()})
59
AttributeError: 'dict' object has no attribute 'id2token'
Using u_mass solves the issue, although this is a different metric.
coherencemodel = CoherenceModel(model=model, texts=docs, corpus=corpus, coherence='u_mass')
@kdubovikov What is the full traceback?
I wonder if the dictionary in the code you show is allowed to be a plain dict, or must be gensim.corpora.Dictionary.
Most helpful comment
Hello,
I wrote another code, it seems to do the job for me. Apparently, using the parameter corpus instead of the parameter dictionary doesn't create any errors. I think coherence='c_v' doesn't like to be called with the dictionary parameter. I don't quite undertand why.
def LdaPipeline(train_set, test_set, k):