QUESTION: I want to use FAISS on a dataset in my application domain. Here are the properties of the specific dataset:
low dimensional d<20, data distribution is uniform, vectors are dense, number of vectors in dataset are < 50000, input space is euclidean, (just consider as 20 dimensional 50,000 vectors created with random uniform distribution.
But the problem is: my dataset grows incrementally. In other words, initially, there is only one randomly generated vector in dataset. The query point will be generated randomly, its ONE nearest neighbor will be determined from dataset. After this, the query point will be added into dataset. This process is repeated. At any time, there will be a randomly generated query point, its nearest neighbor will be determined and it will be inserted to dataset.
For this, which indexing scheme to use? and what will be the proper way?
I have done this using IndexFlatL2, which is obviously brute-force and slow. Now I want to do the faster search, but it requires index.train(xb) step. But my xb has only one vector and it has to be built incrementally, as mentioned above. Based on the properties of my dataset that I know in advance, how can I use them to do faster search?
Side note: As my dataset is not too high dimensional and vectors are only a few thousand, will the Product Quantization be useful still, or we should skip product quantization for further speedup because I dont think the memory requirements are too high to use PQ?
There will be some regimes in which PQ provides a speedup for search (where the overhead of PQ lookup is overridden by the win in reducing memory bandwidth requirements), but typically the technique is used for compression rather than speedup and in many applications it will only be slower. Adding vectors it is certainly slower.
In the regime in which you are in, IndexIVFFlat makes the most sense, but this is an approximate index. The train() data needs to be on a representative sample of your data distribution. In the very tiny database regime (say, < 1000 vectors or so), a flat index will likely remain superior. As you are periodically adding new vectors, you may have to decide for yourself whether or not you wish to retrain an index / build a new one from scratch.
An exact non-exhaustive search space partitioning scheme like a k-D tree will likely work best in your low-ish dimensional (d < 20) use case. This kind of index can be built incrementally. There is no implementation of this in Faiss, however (it is typically designed for high dimensions, usually 32 - 2048 or so), so you'd have to find another library for it.
thank you. There are some points that I want to highlight:
ANN of exact NN, any of them is acceptable. I just need speed boost.
yes, kdtree supports incremental insertion but they perform poorly when d> 5.
i have used hnsw against kdtree, and hnsew performed better for d> 5
There is another observation that I cant explain: calculating nearest neighbor by L2FLAT was very fast as compared to my custom code for traditional exhaustive distance calculation to determine NN. Which is amazing because L2FLAT was also supposed to calculate exhaustive distances. Why it is faster?
... There is another observation that I cant explain: calculating nearest neighbor by L2FLAT was very fast as compared to my custom code for traditional exhaustive distance calculation to determine NN. Which is amazing because L2FLAT was also supposed to calculate exhaustive distances. Why it is faster?
FAISS is highly optimized, e.g. using SIMD and BLAS.
Here is the detailed expanation.
Just one observation: if the data is indeed uniform in a hyper-cube in low dimensions, then hashing will probably provide better search results than an IVF. This is because the IVF clustering cannot do anything meaningful on uniform data.
Unfortunately, no hashing algorithm is implemented in Faiss (except for binary indexes).