Gensim: Coherence RuntimeWarnings : divide by zero encountered in double_scalars AND invalid value encountered in double_scalars

Created on 24 Apr 2019 · 11Comments · Source: RaRe-Technologies/gensim

Problem description

For my intership, I'm trying to evaluate the quality of different LDA models using both perplexity and coherence. I used a loop and generated each model. However, I encounter a problem when the loop reaches parameter_value=15. The coherence score seems to be stored as nan for the models 15, 20, 25 and 30... I tried fixing this issue by changind the parameters in .LdaModel() but it only makes the warning appear for further models. Instead of having a warning for parameter_value=15, I get it for parameter_value=30.

Can someone please help me ?

Problem encountered : warning

starting pass for parameter_value = 30.000
Elapsed time: 1.6870347789972584
Perplexity score: -13.63168019880968
C:\Users\straw\Anaconda3\lib\site-packages\gensim\topic_coherence\direct_confirmation_measure.py:204: RuntimeWarning: divide by zero encountered in double_scalars
  m_lr_i = np.log(numerator / denominator)
C:\Users\straw\Anaconda3\lib\site-packages\gensim\topic_coherence\indirect_confirmation_measure.py:323: RuntimeWarning: invalid value encountered in double_scalars
  return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))
Coherence Score: nan

Steps/code


grid_flt = defaultdict(list)

 # num topics
parameter_list=[2, 5, 10, 15, 20, 25, 30]


for parameter_value in parameter_list:
    print("starting pass for parameter_value = %.3f" % parameter_value)
    start_time = timeit.default_timer()
    # run model
    ldamodel_train_flt = gensim.models.ldamodel.LdaModel(corpus=doc_term_matrix_train_flt, id2word = dictionary_train_flt, num_topics = parameter_value, passes=25, per_word_topics=True)

    # show elapsed time for model
    elapsed = timeit.default_timer() - start_time
    print("Elapsed time: %s" % elapsed)

    # Compute perplexity
    perplex =  ldamodel_train_flt.log_perplexity(doc_term_matrix_test_flt)
    print("Perplexity score: %s" % perplex)
    grid_flt[parameter_value].append(perplex)

    # Compute Coherence Score
    coherence_model_lda = gensim.models.coherencemodel.CoherenceModel(model=ldamodel_train_flt, texts=list_of_docs_flt_test, dictionary=dictionary_train_flt, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    print("Coherence Score: %s" % coherence_lda)
    grid_flt[parameter_value].append(coherence_lda)

Versions

Windows-10-10.0.17134-SP0
Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
NumPy 1.16.2
SciPy 1.1.0
gensim 3.7.2
FAST_VERSION -1

bug

Source

Kaotic-Kiwi

Most helpful comment

Hello,
I wrote another code, it seems to do the job for me. Apparently, using the parameter corpus instead of the parameter dictionary doesn't create any errors. I think coherence='c_v' doesn't like to be called with the dictionary parameter. I don't quite undertand why.

def LdaPipeline(train_set, test_set, k):

dictionary = gensim.corpora.Dictionary(train_set)
corpus_train = [dictionary.doc2bow(doc) for doc in train_set]
corpus_test = [dictionary.doc2bow(doc) for doc in test_set] 
# LDA
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus_train, id2word = dictionary, num_topics = k, passes=30, alpha='auto')
# Perplexity
perplexity = lda_model.log_perplexity(corpus_test)
# Coherence
coherence_model = gensim.models.coherencemodel.CoherenceModel(model=lda_model, corpus=corpus_train, texts = train_set, coherence='c_v')
coherence = coherence_model.get_coherence()
return [perplexity, coherence]

Kaotic-Kiwi on 25 May 2020

🎉1 👍1

All 11 comments

@Kaotic-Kiwi Can you please explain what happened here?

mpenkov on 3 May 2019

I am facing the exact same issue while using LDAMalletModel. Would you please provide the solution?

My function to create multiple models and stores multiple values in a list

def compute_coherence_score(dictionary, corpus, texts, limit, start, step):
  """Compute Coherence score for different values of num of topics"""
  coherence_scores, model_list = [],[]
  for num_topics in range(start,limit,step):
    model = gensim.models.wrappers.LdaMallet(mallet_path=mallet_path, corpus = corpus,id2word=id2word, num_topics= num_topics)
    model_list.append(model)
    coherencescore = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
    coherence_scores.append(coherencescore.get_coherence())

  return model_list, coherence_scores

*Function Call *

model_list , coherence_scores = compute_coherence_score(dictionary=id2word,texts=data_words,corpus=corpus,limit=100,start=50,step=10)
print(model_list)
print(coherence_scores)

Error Message: Resulting the coherence_scores =[nan,nan,nan,nan] (with all NaN values)

/usr/local/lib/python3.6/dist-packages/gensim/topic_coherence/direct_confirmation_measure.py:204: RuntimeWarning: divide by zero encountered in double_scalars
  m_lr_i = np.log(numerator / denominator)
/usr/local/lib/python3.6/dist-packages/gensim/topic_coherence/indirect_confirmation_measure.py:323: RuntimeWarning: invalid value encountered in double_scalars
  return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))

@kaotic-Kiwi @mpenkov Can you please inform if there is any solution for this?

harshshah-work on 26 Feb 2020

I'm getting a similar error using the default gensim LDA implementation.

I did notice that there is a certain combination of topics (in LDA) and topn (in CoherenceModel) settings that can let the coherence calculation go through, for example if I have 30 topics and topn=2 I make it through the calculation.

Any thoughts? Perhaps this is a numerical stability issue?

ps: interestingly with the other window methods ‘c_uci’ and ‘c_npmi’ I get inf instead of nan

NickRothbacher on 21 Apr 2020

I am getting exactly the same error as @Kaotic-Kiwi when I try to calculate the coherence with c_v. Could somebody please help us or reopen the issue?

HaukeT on 25 May 2020

I'm wondering if this does not come from adding epsilon to the numerator rather than the denominator l.202-203 in topic_coherence/direct_confirmation_measure.py :

numerator = (co_occur_count / num_docs) + EPSILON
denominator = (w_prime_count / num_docs) * (w_star_count / num_docs)
m_lr_i = np.log(numerator / denominator)

Adding +EPSILON to the denominator removes the warning+NaN coherence result for me.

aschoenauer-sebag on 25 May 2020

@HaukeT @aschoenauer-sebag the topic_coherence is a contributed module and its quality may be iffy.

If you're able to fix the issue and open a clean clear PR that'd be great.

piskvorky on 25 May 2020

def LdaPipeline(train_set, test_set, k):

dictionary = gensim.corpora.Dictionary(train_set)
corpus_train = [dictionary.doc2bow(doc) for doc in train_set]
corpus_test = [dictionary.doc2bow(doc) for doc in test_set] 
# LDA
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus_train, id2word = dictionary, num_topics = k, passes=30, alpha='auto')
# Perplexity
perplexity = lda_model.log_perplexity(corpus_test)
# Coherence
coherence_model = gensim.models.coherencemodel.CoherenceModel(model=lda_model, corpus=corpus_train, texts = train_set, coherence='c_v')
coherence = coherence_model.get_coherence()
return [perplexity, coherence]

Kaotic-Kiwi on 25 May 2020

🎉1 👍1

Thank you for reopening the issue and the replies. The work around from @Kaotic-Kiwi to only use the parameter corpus and avoid the parameter dictionary did not work for my data. I will try to find an error in my data.

HaukeT on 27 May 2020

In my case, this error will happen when I try to pass my prior eta to the model. My eta is a numpy.ndarray with the shape of (num topics, num terms). I initialize eta with the value of 1/(num topics) and transfer some prior to top-n rows.
e.g. 3 topics and the first row is my prior:
[ [18, 63, 52, 5, 0, 145], [1/3, 1/3, 1/3, 1/3, 1/3, 1/3], [1/3, 1/3, 1/3, 1/3, 1/3, 1/3]]
with the more num of prior I transfer, the topic coherence will get nan in the calculation. (e.g. 30 topics, transfer 10)

rocknamx8 on 10 Dec 2020

Does not work for me either. I am using LDAMallet and using corpus instead of dictionary parameter as per @Kaotic-Kiwi advice did not help to solve the issue, unfortunately.

I get this error when switching to corpus parameter:

text_analysis.py in _ids_to_words(ids, dictionary)
     55 
     56     """
---> 57     if not dictionary.id2token:  # may not be initialized in the standard gensim.corpora.Dictionary
     58         setattr(dictionary, 'id2token', {v: k for k, v in dictionary.token2id.items()})
     59 

AttributeError: 'dict' object has no attribute 'id2token'

Using u_mass solves the issue, although this is a different metric.

coherencemodel = CoherenceModel(model=model, texts=docs, corpus=corpus, coherence='u_mass')

kdubovikov on 18 Dec 2020

@kdubovikov What is the full traceback?

I wonder if the dictionary in the code you show is allowed to be a plain dict, or must be gensim.corpora.Dictionary.

piskvorky on 18 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Word2Vec ns_exponent cannot be changed from default

coopwilliams · 3Comments

Similarity class does not use constant memory

Laubeee · 3Comments

Keep the period symbols when extracting articles from wikipedia by WikiCorpus.get_texts()

hhchen1105 · 4Comments

Bug in Phrases.export_phrases()

jeradf · 4Comments

Word2Vec online training not consistent

sairampillai · 3Comments