Faiss: Using np.memmap() for large separate files

Created on 12 Jun 2018 · 2Comments · Source: facebookresearch/faiss

Summary

I am working on a very large dataset (~100 million vectors of 2048 dimensions). I want to create a memmap for this data (as done in bench_polysemous_1bn.py). However, the dataset is distributed in about 100 binary files. What is the best way to do this? Do I need to combine all the binary files into one?

Platform

OS: Linux 16.04

Faiss version:

Faiss compilation options:

Running on :

[ ] CPU

Reproduction instructions

out-of-scope question

Source

khetanmayank

Most helpful comment

Hi
Either solution is possible, mmapping 100 files is not a problem.
Note that vectors of 2048 dimensions is quite large, you may want to reduce them eg. by PCA.

mdouze on 12 Jun 2018

👍2

All 2 comments

Hi
Either solution is possible, mmapping 100 files is not a problem.
Note that vectors of 2048 dimensions is quite large, you may want to reduce them eg. by PCA.

mdouze on 12 Jun 2018

👍2

hi, @mdouze. I have 500M vectors with 1000 dimesions , I had try to reduce to 512 dimesions by PCA first. But the search Accuracy is low, about 50% compare to Exhaustive search。 I think by PCA transform，the origin information was lost. so Are there any other kind of pre-processing to reduce the dimension with less information loss？