Faiss: Recall of IndexIVFPQ when nprobe is the same as nlist

Created on 16 Nov 2017  路  3Comments  路  Source: facebookresearch/faiss

Hi there,

I have been running IndexIVFPQ, and I have observed that when nprobe (the number of inverted lists to probe) is equal to nlist (the number of inverted lists created by coarse quantizer), the recall is sometimes not 1.

My understanding is that when nprobe equals to nlist, FAISS would probe every single inverted list and do an exhaustive search in each list. Since this search process would compare the query with every data point in the index, which essentially becomes an exhaustive search, I expect the recall should be 1.
How come recall is not 1 when nprobe equals to nList ?

I've observed this effect on SIFT10K, SIFT1M datasets, as well as my own dataset. One quick way to reproduce the result is to use SIFT10K data with the parameter dim = 128, nList = 100, m = 16, nbits = 8, nprobe = 100. In my experiment, R@1 is 0.72 and R@10 is 0.97.

question

Most helpful comment

With IndexIVFPQ, this is totally expected: this index makes has two sources approximations:

  • one depends on nprobe, as you mention it controls to how many items the query is compared
  • the other stems from the product quantization: IVFPQ is a compressed-domain search, in which indexed descriptors are approximated, not stored. This allows us to scale to billion-sized datasets on one machine, but this means that we rely on approximate distances.

If you want exact search, use IndexFlat or IndexGPUFlat. If you are targeting a high-accuracy regime (but not 100%), you can use IndexIVFFlat or IMI (or GPU counterparts).

If you want to keep compressed-domain index,

All 3 comments

With IndexIVFPQ, this is totally expected: this index makes has two sources approximations:

  • one depends on nprobe, as you mention it controls to how many items the query is compared
  • the other stems from the product quantization: IVFPQ is a compressed-domain search, in which indexed descriptors are approximated, not stored. This allows us to scale to billion-sized datasets on one machine, but this means that we rely on approximate distances.

If you want exact search, use IndexFlat or IndexGPUFlat. If you are targeting a high-accuracy regime (but not 100%), you can use IndexIVFFlat or IMI (or GPU counterparts).

If you want to keep compressed-domain index,

Hi Herv茅,

Thanks for your reply! That makes sense. The distance evaluation in PQ happens between a query and product quantization descriptor of a data point, not in between a query and the exact data point. It can happen that though point x has a closer distance to a query q than point y, point x's pq code is farther away to query q than point y's pq code. Intuitively, it seems there is no guarantee of reaching 100% recall when distance is calculated using approximated distances.

Yes, exactly.

Was this page helpful?
0 / 5 - 0 ratings