Faiss: How to train in distributed system(IVFPQ)?

Created on 20 Aug 2019  路  5Comments  路  Source: facebookresearch/faiss

Summary

I have 3 billion vectors, build index need to split data set. (IVFPQ)
First i need to train global index, but the data set is so much big, the one machine is not support.
So I need to train in distributed system, but i don't know how implement it.

I know facebook research developed a distributed k-means, can share the code?
what the faiss next plane for super bigger data set ?

question

All 5 comments

you should use a sample of vectors to do the training, not all. See
https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
for the size of the sample.

Now I filter 1% of the dataset to train, the traning dataset is 3000W data.
But the train is so slow, i use IVFPQ index, 40 core cpu, the inverted list centroid is 4 * sqrt(N).
It has taken 48 hours to to the training while it was still waiting for the coarse quantizing process.
I have 3 problems:

  1. How to do the filter to reduce the number in elegant way?
  2. What is the most fitful index type for our case?
  3. I wanna to comiplete the training process in 2 days, so what resource should I take, either CPU or GPU, the number of cores, etc.
  1. sample vectors, ie take 50 * k from the 3B vectors. This can be done online, just do one pass over the data and keep a vector with probability 50 * k / 3e9

  2. see the guidelines -> IVF1048576_HNSW32,PQx

  3. you can use GPUs to do the training, see https://gist.github.com/mdouze/46d6bbbaabca0b9778fca37ed2bcccf6

No activity, closing.

Thank you.

Was this page helpful?
0 / 5 - 0 ratings