Faiss: negative distance returned in IndexFlatL2 search query

Created on 28 Dec 2017  路  12Comments  路  Source: facebookresearch/faiss

Dear all, does anyone know why the following code could return negative entries for D? I am calculating the L2-nearest neighbors of CIFAR images, for which I assume IndexFlatL2 should return non-negative distances (and 0 for exact match).

index = faiss.IndexFlatL2(d)
index.add(Data)
D, I = index.search(Data, 10)

Some notes:
1) The most problematic case seems to be when some feature dimensions have significantly larger variance than the others, in which case the negative entries can be quite large (say around -10^3).
2) I get the same problem using either CPU or GPU.
3) I get correct results running the tutorial code 1-Flat.py.

Thanks!

Most helpful comment

faiss.cvar.distance_compute_blas_threshold = 40

All 12 comments

Hi,
Distances for large batches (Data.shape[0] > 20) are computed with d(x, y) = ||x||^2 + ||y||^2 - 2 *
Therefore if the magnitude of x and y is very different, roundoff errors may output negative values.
If you reduce the batch size, it switches back to the explicit formulation that necessarily outputs positive distances.
More generally, it is not advised to use vectors with large differences in magnitude: they cause numerical instability and distances are often not significant.

No activity. Closing.

Hi, I'm quite new to nearest neighbor techniques and just tried faiss on my data (80 dimensional log magnitude mel spectrogram frames). I was surprised to see negative distances. @mdouze, what exactly you are recommending when you say "it is not advised to use vectors with large differences in magnitude"? A lot of datasets will have big differences in magnitude between different vectors, and you can't necessarily change the dataset (although in my case maybe some transformation, like undoing the log, would improve things, but I don't know any methodology for finding appropriate transformations). Reducing the batch size to below 20 is always a possibility, but I guess it will hurt performance, and it sounds like you are saying it won't actually help accuracy?

The problem is that if you have a query vector x and two database vectors y_1 and y_2, where ||x|| >> ||y_1|| and ||x|| >> ||y_2|| then there will be accuracy losses because computations are performed with 32-bit float precision.

For example, in 1D, float-32 => 24 bits mantissa => epsilon = 1/16M, so if there is a factor 16M between the magnitudes of x and y_i then || x - y_1|| = || x- y_2|| = ||x||, so y_1 and y_2 will be indistinguishable. Of course this is an extreme case, but any relative difference M in magnitude does incur a loss of precision that of log2(M) bits.

In the current version of faiss, you can switch between the two implementations by adjusting

distance_compute_blas_threshold

that is set to 20 by default.

Is it possible to set distance_compute_bias_threshold from python? If so, how exactly?

faiss.cvar.distance_compute_blas_threshold = 40

I met the same problem when I tried to generate simple data for testing. I used [1, 1], [2, 2], .. [N, N] where N = 10^5 as the database, and compared the result between Flat and IVF4096, Flat. I wanted to use Flat as ground truth but it turned out IVF4096, Flat was the accurate one. Now that I read the post, it may be caused by the magnitude issue. I think it may be better if you mention this in the wiki somewhere so beginners like us are less confused.

setting the distance_compute_blas_threshold (to value bigger than both my DB size and my query size, using faiss.cvar.distance_compute_blas_threshold = 400000 just after import faiss) doesn't seem to be doing anything with the GpuIndexFlatL2, does this work only for the CPU indexes and GPU always uses the x^2 + y^2 - 2xy trick, or am I doing something wrong?

It is only for CPU Faiss. GPU always uses cblas I believe. @wickedfoo ?

GPU always does the -2xy via cublas.

However, L2 distance computation should be prevented from going negative as of the last release I believe.

Yes indeed, for both CPU and GPU.

Perhaps this was fixed for index.search(Data, 10), but I get negative distances for index.range_search. Here's an example:

lims, D, I = index.range_search(data, 555)
>>> min(D)
-3145728.0
>>> max(D)
512.0

Is this numerical overflow?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wwmmqq picture wwmmqq  路  3Comments

maozezhong picture maozezhong  路  3Comments

xxllp picture xxllp  路  3Comments

cherryPotter picture cherryPotter  路  3Comments

0DF0Arc picture 0DF0Arc  路  3Comments