Faiss: Failure to reproduce paper's recall on IVF Product quantization

Created on 27 Nov 2019  ·  6Comments  ·  Source: facebookresearch/faiss

I am trying to reproduce results the paper: “Product quantization for nearest neighbor search”, Jégou & al., PAMI 2011.
The paper reports that IVFADC for sift 1 million docs dataset with 128 dims with k'=1024, m=8, w=8 produces recall of around 0.9 (Fig 7).

Running ann-benchmarks on the sift dataset with nprobe=8

ncentroids = 1024
m = 8 
code_size = 8 
self.coarse_quantizer = faiss.IndexFlatL2(X.shape[1])
index = faiss.IndexIVFPQ(self.coarse_quantizer, X.shape[1], ncentroids, m, code_size)

produces recall of 0.372.
FaissIVFPQ(n_list=1024, n_probe=8) 0.372 4493.587

What would be the way to achieve recall as reported by the paper?

question

Most helpful comment

@mdouze Thank you so much for your clarification, I have missed this.
This definition of recall looks to be a very relaxed definition, and not sure how much it is really useful in practice.

All 6 comments

Here is code that reproduces the result:
repro_IVF_paper.ipynb
It gives 87.3% recall. There is probably an error in your eval code.

Thanks so much for your reply. I will investigate further.

@mdouze I am not clear about the way you calculate recall:
recall_at_100 = (I[:, :100] == gt[:, :1]).sum() / float(xq.shape[0])

For every query, we have a ground truth -- 100 expected documents to find. So, we need to compare these 100 expected docs with 100 documents found during the search, calculate how many documents are in common (intersection), and this will be the recall for this query. We do the same for all queries to get the average recall among 10K queries.

But in your calculation, you only check if 1st expected document is present in 100 documents found during the search. That's not a traditional evaluation of recall. Do you have references when this definition of the recall is used?

So, using your code in repro_IVF_paper.ipynb, and calculating the recall in the traditional way:

total_matches = 0
for q in range(xq.shape[0]):
    matches = [id for id in I[q] if id in gt[q]]
    total_matches += len(matches)
recall_at_100 = (total_matches / float(xq.shape[0]))/ 100
print("recall@100: %.3f" % recall_at_100);

I am getting the result of:
recall@100: 0.436 or 43.6%

Check the definition of recall@100 in the paper:

The search quality is measured with recall@R, i.e., the
proportion of query vectors for which the nearest neighbor
is ranked in the first R positions. This measure indicates
the fraction of queries for which the nearest neighbor is
retrieved correctly, if a short-list of R vectors is verified
using Euclidean distances. Furthermore, the curve obtained by
varying R corresponds to the distribution function of the ranks,
and the point R=1 corresponds to the “precision” measure used
in [9] to evaluate ANN methods.

@mdouze Thank you so much for your clarification, I have missed this.
This definition of recall looks to be a very relaxed definition, and not sure how much it is really useful in practice.

no activity, closing.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mylyu picture mylyu  ·  3Comments

0DF0Arc picture 0DF0Arc  ·  3Comments

maozezhong picture maozezhong  ·  3Comments

linghuang picture linghuang  ·  3Comments

danny1984 picture danny1984  ·  3Comments