Faiss: question about Ondisk IVF

Created on 15 Nov 2020 · 5Comments · Source: facebookresearch/faiss

As is shown in https://github.com/facebookresearch/faiss/blob/master/demos/demo_ondisk_ivf.py, we divide the data into 4 parts and add them into the index sequentially.

Why do we divide the data into different parts? Is that necessary? If so, how many parts are appropriate for various data size?
What would happen if we directly add the data to the index, and use merge_ondisk function, which in my view turns the index capable to search on disk.
My data shape is around (80M, 32), and my choice of number of blocks is 128, where the flat IVFIP index gives 1-recall@1024=0.55 on my test data. Then I try to merge ondisk sub-block index in several different ways:
a. merge only the head 2 files into a populated index and the result is quite good, 1-recall@1024=0.22.
b. merge the head 10 files into a populated index and the result is quite good, 1-recall@1024=0.20.
It seems confusing that the ondisk index performs worse, which causes my concern. Does that meet our expectation?

question

Source

tangzhy

Most helpful comment

Sorry my mistake, I meant "we will drop support for add on disk at some point".
Merge will become the only way to build an on-disk index.

mdouze on 18 Nov 2020

👍2

All 5 comments

the number of parts is constrained by the available RAM
it would be very slow (and we will drop support for merge on disk at some point)
This is not normal. The ondisk search should give the exact same results as in-RAM.

mdouze on 17 Nov 2020

Hi @mdouze, when you say, "we will drop support for merge on disk at some point", what do you suggest as a better alternative to the on-disk search if the index cannot fit in RAM?

jeremyephron on 18 Nov 2020

Sorry my mistake, I meant "we will drop support for add on disk at some point".
Merge will become the only way to build an on-disk index.

mdouze on 18 Nov 2020

👍2

Hi @mdouze ,
I created the initial index (large dataset) and then saved it to disk (as it uses up almost all RAM and also to handle restarts). Afterwards, I load this index using read_index() to support search. This works fine.
In parallel new data has to be added (in batches, periodically every few minutes) without blocking search. What would be the recommended way to do this? The number of items in the new data varies significantly over batches.