Faiss: Regarding the IndexFlatIP

Created on 28 Feb 2020 · 10Comments · Source: facebookresearch/faiss

Summary

Hi ,May I please know how can I get Cosine similarities not Cosine Distances while searching for similar documents. I've used IndexFlatIP as indexes,as it gives inner product.

distances, indices = index.search(query_vectors, k)

Running on:

[x] CPU
[ ] GPU

Interface:

[ ] C++
[x] Python

help wanted

Source

MaheshChandrra

All 10 comments

When I try to do a search I'm getting be below values:

results = index.search(query_vector, 10)
print(results)#prints distances and similar ids

(array([[267.5353 , 234.20415, 227.57852, 226.83115, 225.78455, 220.038  ,
         218.0101 , 217.20752, 217.03021, 215.2745 , 215.01762, 214.11276,
         213.06128, 212.98251, 212.56494, 210.98376, 210.3661 , 209.87708,
         209.74539, 209.55539]], dtype=float32),
 array([[  3205711,   5535941,   5639730,   5572735,   5803736,   5819228,
           5692490,   2974726,  11847732,   3104495,   2989770,   5845608,
           3132981, 127403668, 127401208,   5728888,   5799607,   5799609,
           5669756,   5579338]]))

Can someone please help me in understanding the distances which I received in the above list(distances,id's),how do I get Cosine similarity in the range or 0 to 1.

MaheshChandrra on 9 Mar 2020

You need to normalize your query vectors and the search space vectors. Something like this should do.

num_vectors = 1000000
vector_dim = 1024
vectors = np.random.rand(num_vectors, vector_dim)

#sample index code
quantizer = faiss.IndexFlatIP(1024)
index = faiss.IndexIVFFlat(quantizer, vector_dim, int(np.sqrt(num_vectors)), faiss.METRIC_INNER_PRODUCT)
train_vectors = vectors[:int(num_vectors/2)].copy()
faiss.normalize_L2(train_vectors)
index.train(train_vectors)
faiss.normalize_L2(vectors)
index.add(vectors)
#index creation done

#let's search
query_vector = np.random.rand(10, 1024)
faiss.normalize_L2(query_vector)
D, I = index.search(query_vector, 100)

print(D)

Please note:- faiss.normalize_L2() changes the input vector itself. No copy is created. Hence there it returns None. In case you want to use the original vector you need to create a copy of it by yourself before calling faiss.normalize_L2().
Hope this helps.

EvilPort2 on 9 Mar 2020

Hi EvilPort2,Thanks for the quick response,may I please know why are we doing index.train for the first half corpus and then adding the complete corpus,is there any possible way of normalizing all the vectors at once without doing a train??

Thanks in advance.

MaheshChandrra on 9 Mar 2020

I am not exactly sure as to what algorithm IndexIVFFlat uses underneath. But as far as I know, it uses something called KD tree for doing approximate search (@mdouze feel free to correct me). In a KD tree you first create some k clusters using the points in the corpus i.e the vector search space. The training is done for this clustering to happen. Now to search a vector you see which of the k clusters is nearest to the query vector by measuring the distance between the query and the cluster centroid. The cluster which is nearest to the query vector is now searched for the top nearest points hence reducing the search space. I have chosen k = square_root(number of vectors in the corpus).
When your vector search space is huge and you don't have enough RAM you can take a part of the corpus and train. Ideally you should train with all the vectors and not half of them like I have shown. Hence the ideal code should be something like this.

faiss.normalize_L2(vectors)
index.train(vectors)
index.add(vectors)

EvilPort2 on 10 Mar 2020

👍1

Also, just a small note. Since you want cosine similarity, it will range from -1 to +1.

EvilPort2 on 10 Mar 2020

My bad, forgot about negative similarity,Thanks for addressing.
One last query does faiss work well in creating indexes on a corpus of 6M embeddings?

Thanks for the quick response and the fix @EvilPort2 , got it fixed.

MaheshChandrra on 11 Mar 2020

no activity, closing.

mdouze on 1 Apr 2020

You need to normalize your query vectors and the search space vectors. Something like this should do.
num_vectors = 1000000
vector_dim = 1024
vectors = np.random.rand(num_vectors, vector_dim)

#sample index code
quantizer = faiss.IndexFlatIP(1024)
index = faiss.IndexIVFFlat(quantizer, vector_dim, int(np.sqrt(num_vectors)), faiss.METRIC_INNER_PRODUCT)
train_vectors = vectors[:int(num_vectors/2)].copy()
faiss.normalize_L2(train_vectors)
index.train(train_vectors)
faiss.normalize_L2(vectors)
index.add(vectors)
#index creation done

#let's search
query_vector = np.random.rand(10, 1024)
faiss.normalize_L2(query_vector)
D, I = index.search(query_vector, 100)

print(D)
Please note:- faiss.normalize_L2() changes the input vector itself. No copy is created. Hence there it returns None. In case you want to use the original vector you need to create a copy of it by yourself before calling faiss.normalize_L2().
Hope this helps.

hi,dear
have tried the codes,but

Traceback (most recent call last):
  File "faiss_method_.py", line 266, in <module>
    faiss.normalize_L2(train_vectors)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/faiss/__init__.py", line 674, in normalize_L2
    fvec_renorm_L2(x.shape[1], x.shape[0], swig_ptr(x))
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/faiss/swigfaiss.py", line 886, in fvec_renorm_L2
    return _swigfaiss.fvec_renorm_L2(d, nx, x)
TypeError: in method 'fvec_renorm_L2', argument 3 of type 'float *'

SO could you pls help me?
thx

ucasiggcas on 31 May 2020

train_vectors should be of dtype float32

mdouze on 31 May 2020

👍1

My bad, forgot about negative similarity,Thanks for addressing.
One last query does faiss work well in creating indexes on a corpus of 6M embeddings?

Thanks for the quick response and the fix @EvilPort2 , got it fixed.

Faiss is awesome for searching in a huge number of vectors. I think the search time will vary on your vector size and also the type of index you use. I think for 6M vectors you can either go for IVFFlat or HNSW index type. Or you can take a mixture of the both (which I don't know how it works) called IVF65536_HNSW32.

EvilPort2 on 31 May 2020

Was this page helpful?

0 / 5 - 0 ratings