I am working on a very large dataset (~100 million vectors of 2048 dimensions). I want to create a memmap for this data (as done in bench_polysemous_1bn.py). However, the dataset is distributed in about 100 binary files. What is the best way to do this? Do I need to combine all the binary files into one?
OS: Linux 16.04
Faiss version:
Faiss compilation options:
Running on :
Hi
Either solution is possible, mmapping 100 files is not a problem.
Note that vectors of 2048 dimensions is quite large, you may want to reduce them eg. by PCA.
hi, @mdouze. I have 500M vectors with 1000 dimesions , I had try to reduce to 512 dimesions by PCA first. But the search Accuracy is low, about 50% compare to Exhaustive search。 I think by PCA transform,the origin information was lost. so Are there any other kind of pre-processing to reduce the dimension with less information loss?
Most helpful comment
Hi
Either solution is possible, mmapping 100 files is not a problem.
Note that vectors of 2048 dimensions is quite large, you may want to reduce them eg. by PCA.