Hi ,May I please know how can I get Cosine similarities not Cosine Distances while searching for similar documents. I've used IndexFlatIP as indexes,as it gives inner product.
distances, indices = index.search(query_vectors, k)
Running on:
Interface:
When I try to do a search I'm getting be below values:
results = index.search(query_vector, 10)
print(results)#prints distances and similar ids
(array([[267.5353 , 234.20415, 227.57852, 226.83115, 225.78455, 220.038 ,
218.0101 , 217.20752, 217.03021, 215.2745 , 215.01762, 214.11276,
213.06128, 212.98251, 212.56494, 210.98376, 210.3661 , 209.87708,
209.74539, 209.55539]], dtype=float32),
array([[ 3205711, 5535941, 5639730, 5572735, 5803736, 5819228,
5692490, 2974726, 11847732, 3104495, 2989770, 5845608,
3132981, 127403668, 127401208, 5728888, 5799607, 5799609,
5669756, 5579338]]))
Can someone please help me in understanding the distances which I received in the above list(distances,id's),how do I get Cosine similarity in the range or 0 to 1.
You need to normalize your query vectors and the search space vectors. Something like this should do.
num_vectors = 1000000
vector_dim = 1024
vectors = np.random.rand(num_vectors, vector_dim)
#sample index code
quantizer = faiss.IndexFlatIP(1024)
index = faiss.IndexIVFFlat(quantizer, vector_dim, int(np.sqrt(num_vectors)), faiss.METRIC_INNER_PRODUCT)
train_vectors = vectors[:int(num_vectors/2)].copy()
faiss.normalize_L2(train_vectors)
index.train(train_vectors)
faiss.normalize_L2(vectors)
index.add(vectors)
#index creation done
#let's search
query_vector = np.random.rand(10, 1024)
faiss.normalize_L2(query_vector)
D, I = index.search(query_vector, 100)
print(D)
Please note:- faiss.normalize_L2() changes the input vector itself. No copy is created. Hence there it returns None. In case you want to use the original vector you need to create a copy of it by yourself before calling faiss.normalize_L2().
Hope this helps.
Hi EvilPort2,Thanks for the quick response,may I please know why are we doing index.train for the first half corpus and then adding the complete corpus,is there any possible way of normalizing all the vectors at once without doing a train??
Thanks in advance.
I am not exactly sure as to what algorithm IndexIVFFlat uses underneath. But as far as I know, it uses something called KD tree for doing approximate search (@mdouze feel free to correct me). In a KD tree you first create some k clusters using the points in the corpus i.e the vector search space. The training is done for this clustering to happen. Now to search a vector you see which of the k clusters is nearest to the query vector by measuring the distance between the query and the cluster centroid. The cluster which is nearest to the query vector is now searched for the top nearest points hence reducing the search space. I have chosen k = square_root(number of vectors in the corpus).
When your vector search space is huge and you don't have enough RAM you can take a part of the corpus and train. Ideally you should train with all the vectors and not half of them like I have shown. Hence the ideal code should be something like this.
faiss.normalize_L2(vectors)
index.train(vectors)
index.add(vectors)
Also, just a small note. Since you want cosine similarity, it will range from -1 to +1.
My bad, forgot about negative similarity,Thanks for addressing.
One last query does faiss work well in creating indexes on a corpus of 6M embeddings?
Thanks for the quick response and the fix @EvilPort2 , got it fixed.
no activity, closing.
You need to normalize your query vectors and the search space vectors. Something like this should do.
num_vectors = 1000000 vector_dim = 1024 vectors = np.random.rand(num_vectors, vector_dim) #sample index code quantizer = faiss.IndexFlatIP(1024) index = faiss.IndexIVFFlat(quantizer, vector_dim, int(np.sqrt(num_vectors)), faiss.METRIC_INNER_PRODUCT) train_vectors = vectors[:int(num_vectors/2)].copy() faiss.normalize_L2(train_vectors) index.train(train_vectors) faiss.normalize_L2(vectors) index.add(vectors) #index creation done #let's search query_vector = np.random.rand(10, 1024) faiss.normalize_L2(query_vector) D, I = index.search(query_vector, 100) print(D)Please note:- faiss.normalize_L2() changes the input vector itself. No copy is created. Hence there it returns None. In case you want to use the original vector you need to create a copy of it by yourself before calling faiss.normalize_L2().
Hope this helps.
hi,dear
have tried the codes,but
Traceback (most recent call last):
File "faiss_method_.py", line 266, in <module>
faiss.normalize_L2(train_vectors)
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/faiss/__init__.py", line 674, in normalize_L2
fvec_renorm_L2(x.shape[1], x.shape[0], swig_ptr(x))
File "/home/xulm1/anaconda3/lib/python3.7/site-packages/faiss/swigfaiss.py", line 886, in fvec_renorm_L2
return _swigfaiss.fvec_renorm_L2(d, nx, x)
TypeError: in method 'fvec_renorm_L2', argument 3 of type 'float *'
SO could you pls help me?
thx
train_vectors should be of dtype float32
My bad, forgot about negative similarity,Thanks for addressing.
One last query does faiss work well in creating indexes on a corpus of 6M embeddings?Thanks for the quick response and the fix @EvilPort2 , got it fixed.
Faiss is awesome for searching in a huge number of vectors. I think the search time will vary on your vector size and also the type of index you use. I think for 6M vectors you can either go for IVFFlat or HNSW index type. Or you can take a mixture of the both (which I don't know how it works) called IVF65536_HNSW32.