Gensim: Unreasonable Query Result

Created on 23 Nov 2019 · 10Comments · Source: RaRe-Technologies/gensim

Problem description

  **The query result seems not correct. The code is self-explained. Thank you!**

Steps/code/corpus to reproduce

Include full tracebacks, logs and datasets if necessary. Please keep the examples minimal ("minimal reproducible example").

from gensim.summarization.bm25 import BM25, get_bm25_weights


text1 = "A constellation is a group of stars that are considered to form imaginary outlines or meaningful patterns on the celestial sphere."
text2 = "The 88 modern constellations are formally defined regions of the sky together covering the entire celestial sphere."
text = [text1, text2]

corpus = [text1.split(" "), text2.split(" ")]
print(f'corpus: {corpus}')

query = text2.split(" ")

bm25 = BM25(corpus)
scores = bm25.get_scores(query)
scores = [(s, i) for i, s in enumerate(scores)]
scores.sort(key=lambda t: t[0], reverse=True)
print(f'scores:         {scores}')

for s, idx in scores:
  print(f'{s}\t{idx}: {text[idx]}')

Output:

-0.3601521710456333         0: A constellation is a group of stars that are considered to form imaginary outlines or meaningful patterns on the celestial sphere.
-0.44989406787023367     1: The 88 modern constellations are formally defined regions of the sky together covering the entire celestial sphere.

Versions

Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

Output:

macOS-10.14.6-x86_64-i386-64bit
Python 3.8.0 (default, Nov  6 2019, 15:49:01)
[Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.17.4
SciPy 1.3.3
gensim 3.8.1
FAST_VERSION 0

Source

JiaqiLiu

All 10 comments

@Witiko could you please have a look?

piskvorky on 23 Nov 2019

@JiaqiLiu Your issue seems to be due to the inverse document frequency term in the BM25 formula. Since the corpus is so small (only 2 documents), most words have zero or negative inverse document frequency. Can you try using a larger corpus and see if the results become more reasonable?

Witiko on 24 Nov 2019

👍1

Hi Witiko, thank you for your reply. You are right, and the result is correct when trying larger corpus.
Thank you!

JiaqiLiu on 24 Nov 2019

Can you summarize what the issue is (was)? Why should a small corpus give incorrect / useless results because of inverse document frequencies? Sounds fishy to me.

piskvorky on 24 Nov 2019

@piskvorky BM25 is a fairly simple model, but still sufficiently complex to surprise. The IDF term in the BM25 formula is negative when a word occurs in more than half the corpus. In general, this penalizes stopwords. However, in this example, the IDF for all words is either zero (a word occurs in one document) or negative (a word occurs in both documents). Hence the negative scores.

Witiko on 24 Nov 2019

Understood, but is that a desirable result?

If the formula breaks down for some inputs, isn't it better to raise an error? Or is such score "expected", interpretable?

piskvorky on 24 Nov 2019

We can issue a warning in the BM25 constructor when the average word has negative IDFs (self.average_idf < 0), i.e. the corpus does not obey the Zipf's law.

>>> bm25.average_idf
-0.2514746738178282

In general, the user can break the assumptions of BM25 in several ways, but I worry that raising an exception when we receive garbage input might break existing systems.

Witiko on 24 Nov 2019

👍1

OK. So my understanding is this "works as expected by the BM25 formula", but in this case, "formula expectation != user expectation".

I agree with adding the warning wherever the input seems malformed: negative average IDF, ?other cases? @JiaqiLiu can you open a PR?

piskvorky on 24 Nov 2019

I opened PR #2687 for this issue.

Witiko on 26 Nov 2019

🚀1

Thanks a lot @Witiko .

piskvorky on 26 Nov 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

conversion function naming

amueller · 30Comments

Drop Py2 support

mpenkov · 29Comments

Identical topics

ghost · 30Comments

Set up Azure pipelines for gensim

mpenkov · 34Comments

Structural Topic Models in gensim

cschwem2er · 27Comments