Several information retrieval "tasks" use a few common evaluation metrics including mean average precision (MAP) [1] and recall@k, in addition to what is already supported (e.g. ERR, nDCG, MRR). Sometimes the geometric MAP (GMAP) variant is used and if it's an easy option to add (like how NDCG is an option on DCG), we should add this option. These are standard measures in many TREC and related tasks (e.g. MSMARCO). In particular, reranking tasks use recall@k to tune the base query which is input to a reranker (e.g. tuning BM25 or RM3 parameters).
[1] "GMAP is the geometric mean of per-topic average precision, in contrast with MAP which is the arithmetic mean"
Pinging @elastic/es-search (:Search/Ranking)
Potentially a duplicate of https://github.com/elastic/elasticsearch/issues/29653
@joshdevins
By mean average precision did you mean like the one described in the stanford IR course - https://web.stanford.edu/class/cs276/handouts/EvaluationNew-handout-1-per.pdf (The book itself does not appear to talk about MAP) 🤷♂
I've added it here https://github.com/sbourke/elasticsearch/commit/652ff119fb15e7569e92e1e78ad7093dfadad038#diff-5fb623709353794e709a58f45104baec
I've added recall as well.
https://github.com/sbourke/elasticsearch/commit/652ff119fb15e7569e92e1e78ad7093dfadad038
It thats what you're generally looking for please let me know and I'll clean up the code. The tests are based off of the precision tests.
Hang tight, I've also got a branch with some other things cleaned up in it. Let's sync up after I have a PR in.
@sbourke I've got this draft PR going but it's not got MAP in it yet. https://github.com/elastic/elasticsearch/pull/52577
@sbourke thanks for your interest in contributing! Perhaps we can first work to integrate @joshdevins's PR to add recall@k (#52577), then you could follow-up with a PR to add mAP. Feel free to add any suggestions to the recall@k PR.
It's generally nice to separate out changes into small PRs, so I think it's fine to add each metric separately. It would also be great to get @cbuescher's thoughts on the proposed metrics to make sure we're happy to add them.
@sbourke I think that MAP definition is what I understand. I'm looking at the TREC definition and I think it's the same.
@joshdevins Your explicit confusion matrix matrix is much nicer than what I was doing. I'll look at the coder changes more closely today. Do you have GMAP as well, or should I do that.
Have a look at the PR (https://github.com/elastic/elasticsearch/pull/52577) again — it's ready to merge so you can use it as a basis for the next change set if you want.
Your explicit confusion matrix matrix is much nicer than what I was doing.
I think from the ML perspective, it's typical for how we evaluate and calculate metrics. We decided to remove it for now though as it introduces a bit of unnecessary indirection in the code. We might put it back later after the MAP implementation. See related discussion in the PR.
After removing the confusion matrix, I normalized all the variables and way of calculating metrics in PrecisionAtK and RecallAtK to be based on the logical meaning of the metric components. For example, instead of counting truePositives, we count relevantDocsRetrieved. This makes the codebase uniform and consistent with the MetricDetail we return.
Do you have GMAP as well, or should I do that.
I haven't done anything for (G)MAP yet so you are welcome to contribute if you want. Let me know if you are still interested in doing a feature branch for that work. If you haven't already, have a look at CONTRIBUTING.md for some details on how we take contributions through PRs.
We should be able to implement GMAP as an option on the MAP metric, much as the DCG metric provides the normalize option to get nDCG. The change for GMAP would just be how the per-query average-precision calculations are combined in the combine function, as the default is just the mean (so will work fine for MAP).