Faiss: negative distance returned in IndexFlatL2 search query

Created on 28 Dec 2017 · 12Comments · Source: facebookresearch/faiss

Dear all, does anyone know why the following code could return negative entries for D? I am calculating the L2-nearest neighbors of CIFAR images, for which I assume IndexFlatL2 should return non-negative distances (and 0 for exact match).

index = faiss.IndexFlatL2(d)
index.add(Data)
D, I = index.search(Data, 10)

Some notes:
1) The most problematic case seems to be when some feature dimensions have significantly larger variance than the others, in which case the negative entries can be quite large (say around -10^3).
2) I get the same problem using either CPU or GPU.
3) I get correct results running the tutorial code 1-Flat.py.

Thanks!

Source

hongyi-zhang

👍1

Most helpful comment

faiss.cvar.distance_compute_blas_threshold = 40

mdouze on 4 Jun 2018

🎉2 👍2

All 12 comments

Hi,
Distances for large batches (Data.shape[0] > 20) are computed with d(x, y) = ||x||^2 + ||y||^2 - 2 *
Therefore if the magnitude of x and y is very different, roundoff errors may output negative values.
If you reduce the batch size, it switches back to the explicit formulation that necessarily outputs positive distances.
More generally, it is not advised to use vectors with large differences in magnitude: they cause numerical instability and distances are often not significant.

mdouze on 3 Jan 2018

No activity. Closing.

mdouze on 16 Jan 2018

Hi, I'm quite new to nearest neighbor techniques and just tried faiss on my data (80 dimensional log magnitude mel spectrogram frames). I was surprised to see negative distances. @mdouze, what exactly you are recommending when you say "it is not advised to use vectors with large differences in magnitude"? A lot of datasets will have big differences in magnitude between different vectors, and you can't necessarily change the dataset (although in my case maybe some transformation, like undoing the log, would improve things, but I don't know any methodology for finding appropriate transformations). Reducing the batch size to below 20 is always a possibility, but I guess it will hurt performance, and it sounds like you are saying it won't actually help accuracy?

greaber on 31 May 2018

The problem is that if you have a query vector x and two database vectors y_1 and y_2, where ||x|| >> ||y_1|| and ||x|| >> ||y_2|| then there will be accuracy losses because computations are performed with 32-bit float precision.

For example, in 1D, float-32 => 24 bits mantissa => epsilon = 1/16M, so if there is a factor 16M between the magnitudes of x and y_i then || x - y_1|| = || x- y_2|| = ||x||, so y_1 and y_2 will be indistinguishable. Of course this is an extreme case, but any relative difference M in magnitude does incur a loss of precision that of log2(M) bits.

In the current version of faiss, you can switch between the two implementations by adjusting

distance_compute_blas_threshold

that is set to 20 by default.

mdouze on 1 Jun 2018

Is it possible to set distance_compute_bias_threshold from python? If so, how exactly?

greaber on 4 Jun 2018

faiss.cvar.distance_compute_blas_threshold = 40

mdouze on 4 Jun 2018

🎉2 👍2

I met the same problem when I tried to generate simple data for testing. I used [1, 1], [2, 2], .. [N, N] where N = 10^5 as the database, and compared the result between Flat and IVF4096, Flat. I wanted to use Flat as ground truth but it turned out IVF4096, Flat was the accurate one. Now that I read the post, it may be caused by the magnitude issue. I think it may be better if you mention this in the wiki somewhere so beginners like us are less confused.

hsiaoma on 5 Jun 2018

setting the distance_compute_blas_threshold (to value bigger than both my DB size and my query size, using faiss.cvar.distance_compute_blas_threshold = 400000 just after import faiss) doesn't seem to be doing anything with the GpuIndexFlatL2, does this work only for the CPU indexes and GPU always uses the x^2 + y^2 - 2xy trick, or am I doing something wrong?

serycjon on 27 Feb 2019

It is only for CPU Faiss. GPU always uses cblas I believe. @wickedfoo ?

mdouze on 27 Feb 2019

GPU always does the -2xy via cublas.

However, L2 distance computation should be prevented from going negative as of the last release I believe.

wickedfoo on 3 Mar 2019

👍1

Yes indeed, for both CPU and GPU.

mdouze on 4 Mar 2019

Perhaps this was fixed for index.search(Data, 10), but I get negative distances for index.range_search. Here's an example:

lims, D, I = index.range_search(data, 555)
>>> min(D)
-3145728.0
>>> max(D)
512.0

Is this numerical overflow?

joshim5 on 26 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Is there any whl file of faiss which can be installed by pip ?

wwmmqq · 3Comments

error when run faiss.Kmeans

maozezhong · 3Comments

make py error

xxllp · 3Comments

How can we build index/search based on cosine similarity

cherryPotter · 3Comments

Can Index be shard in CPU?

0DF0Arc · 3Comments