Faiss: Using np.memmap() for large separate files

Created on 12 Jun 2018  Â·  2Comments  Â·  Source: facebookresearch/faiss

Summary

I am working on a very large dataset (~100 million vectors of 2048 dimensions). I want to create a memmap for this data (as done in bench_polysemous_1bn.py). However, the dataset is distributed in about 100 binary files. What is the best way to do this? Do I need to combine all the binary files into one?

Platform

OS: Linux 16.04

Faiss version:

Faiss compilation options:

Running on :

  • [ ] CPU

Reproduction instructions

out-of-scope question

Most helpful comment

Hi
Either solution is possible, mmapping 100 files is not a problem.
Note that vectors of 2048 dimensions is quite large, you may want to reduce them eg. by PCA.

All 2 comments

Hi
Either solution is possible, mmapping 100 files is not a problem.
Note that vectors of 2048 dimensions is quite large, you may want to reduce them eg. by PCA.

hi, @mdouze. I have 500M vectors with 1000 dimesions , I had try to reduce to 512 dimesions by PCA first. But the search Accuracy is low, about 50% compare to Exhaustive search。 I think by PCA transform,the origin information was lost. so Are there any other kind of pre-processing to reduce the dimension with less information loss?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

danny1984 picture danny1984  Â·  3Comments

ilyakhov picture ilyakhov  Â·  3Comments

minjiaz picture minjiaz  Â·  3Comments

zoe-cheung picture zoe-cheung  Â·  3Comments

brunodoamaral picture brunodoamaral  Â·  3Comments