I have 3 billion vectors, build index need to split data set. (IVFPQ)
First i need to train global index, but the data set is so much big, the one machine is not support.
So I need to train in distributed system, but i don't know how implement it.
I know facebook research developed a distributed k-means, can share the code?
what the faiss next plane for super bigger data set ?
you should use a sample of vectors to do the training, not all. See
https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
for the size of the sample.
Now I filter 1% of the dataset to train, the traning dataset is 3000W data.
But the train is so slow, i use IVFPQ index, 40 core cpu, the inverted list centroid is 4 * sqrt(N).
It has taken 48 hours to to the training while it was still waiting for the coarse quantizing process.
I have 3 problems:
sample vectors, ie take 50 * k from the 3B vectors. This can be done online, just do one pass over the data and keep a vector with probability 50 * k / 3e9
see the guidelines -> IVF1048576_HNSW32,PQx
you can use GPUs to do the training, see https://gist.github.com/mdouze/46d6bbbaabca0b9778fca37ed2bcccf6
No activity, closing.
Thank you.