**The query result seems not correct. The code is self-explained. Thank you!**
Include full tracebacks, logs and datasets if necessary. Please keep the examples minimal ("minimal reproducible example").
from gensim.summarization.bm25 import BM25, get_bm25_weights
text1 = "A constellation is a group of stars that are considered to form imaginary outlines or meaningful patterns on the celestial sphere."
text2 = "The 88 modern constellations are formally defined regions of the sky together covering the entire celestial sphere."
text = [text1, text2]
corpus = [text1.split(" "), text2.split(" ")]
print(f'corpus: {corpus}')
query = text2.split(" ")
bm25 = BM25(corpus)
scores = bm25.get_scores(query)
scores = [(s, i) for i, s in enumerate(scores)]
scores.sort(key=lambda t: t[0], reverse=True)
print(f'scores: {scores}')
for s, idx in scores:
print(f'{s}\t{idx}: {text[idx]}')
Output:
-0.3601521710456333 0: A constellation is a group of stars that are considered to form imaginary outlines or meaningful patterns on the celestial sphere.
-0.44989406787023367 1: The 88 modern constellations are formally defined regions of the sky together covering the entire celestial sphere.
Please provide the output of:
import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
Output:
macOS-10.14.6-x86_64-i386-64bit
Python 3.8.0 (default, Nov 6 2019, 15:49:01)
[Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.17.4
SciPy 1.3.3
gensim 3.8.1
FAST_VERSION 0
@Witiko could you please have a look?
@JiaqiLiu Your issue seems to be due to the inverse document frequency term in the BM25 formula. Since the corpus is so small (only 2 documents), most words have zero or negative inverse document frequency. Can you try using a larger corpus and see if the results become more reasonable?
Hi Witiko, thank you for your reply. You are right, and the result is correct when trying larger corpus.
Thank you!
Can you summarize what the issue is (was)? Why should a small corpus give incorrect / useless results because of inverse document frequencies? Sounds fishy to me.
@piskvorky BM25 is a fairly simple model, but still sufficiently complex to surprise. The IDF term in the BM25 formula is negative when a word occurs in more than half the corpus. In general, this penalizes stopwords. However, in this example, the IDF for all words is either zero (a word occurs in one document) or negative (a word occurs in both documents). Hence the negative scores.
Understood, but is that a desirable result?
If the formula breaks down for some inputs, isn't it better to raise an error? Or is such score "expected", interpretable?
We can issue a warning in the BM25 constructor when the average word has negative IDFs (self.average_idf < 0), i.e. the corpus does not obey the Zipf's law.
>>> bm25.average_idf
-0.2514746738178282
In general, the user can break the assumptions of BM25 in several ways, but I worry that raising an exception when we receive garbage input might break existing systems.
OK. So my understanding is this "works as expected by the BM25 formula", but in this case, "formula expectation != user expectation".
I agree with adding the warning wherever the input seems malformed: negative average IDF, ?other cases? @JiaqiLiu can you open a PR?
I opened PR #2687 for this issue.
Thanks a lot @Witiko .