Faiss: How to train in distributed system(IVFPQ)?

Created on 20 Aug 2019 · 5Comments · Source: facebookresearch/faiss

Summary

I have 3 billion vectors, build index need to split data set. (IVFPQ)
First i need to train global index, but the data set is so much big, the one machine is not support.
So I need to train in distributed system, but i don't know how implement it.

I know facebook research developed a distributed k-means, can share the code?
what the faiss next plane for super bigger data set ?

question

Source

hashyong

All 5 comments

you should use a sample of vectors to do the training, not all. See
https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
for the size of the sample.

mdouze on 22 Aug 2019

Now I filter 1% of the dataset to train, the traning dataset is 3000W data.
But the train is so slow, i use IVFPQ index, 40 core cpu, the inverted list centroid is 4 * sqrt(N).
It has taken 48 hours to to the training while it was still waiting for the coarse quantizing process.
I have 3 problems:

How to do the filter to reduce the number in elegant way?
What is the most fitful index type for our case?
I wanna to comiplete the training process in 2 days, so what resource should I take, either CPU or GPU, the number of cores, etc.

hashyong on 22 Aug 2019

sample vectors, ie take 50 * k from the 3B vectors. This can be done online, just do one pass over the data and keep a vector with probability 50 * k / 3e9
see the guidelines -> IVF1048576_HNSW32,PQx
you can use GPUs to do the training, see https://gist.github.com/mdouze/46d6bbbaabca0b9778fca37ed2bcccf6

mdouze on 22 Aug 2019

👍1

No activity, closing.

mdouze on 2 Sep 2019

👍1

Thank you.

hashyong on 8 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings