Faiss: train faiss with large dataset

Created on 16 Jan 2019  Â·  12Comments  Â·  Source: facebookresearch/faiss

Summary

Platform

OS:

Faiss version:

Faiss compilation options:

Running on:

  • [ ] CPU
  • [!] GPU

Interface:

  • [ ] C++
  • [!] Python

Reproduction instructions

Here is the problem. i want to build index of huge dataset, which size is 1B. i want to train this index with more data, but with the limit of RAM, I can only read 100m data and use these data to train the index. These data was saved in the txt file. i saw in the code of bench_gpu_1bn.py, it is possible to read large dataset by using np.memmap. but i cannot figure out how to use this method to deal with txt file. So, is there a good method used to read large txt file or convert my own data to .fvecs format?
very thanks

question

Most helpful comment

You probably don't want to train the index on the whole dataset (the example benchmark uses 1M vectors for training IIRC – @mdouze?).
Regarding np.memmap, you can look at this example.

All 12 comments

You probably don't want to train the index on the whole dataset (the example benchmark uses 1M vectors for training IIRC – @mdouze?).
Regarding np.memmap, you can look at this example.

Please refer to
https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
for reasonable amounts of training data.
The memory map will not work with text files. So either write some code to get blocks of text data or convert to a binary format like .fvecs that can be memory-mapped.

Please refer to
https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
for reasonable amounts of training data.
The memory map will not work with text files. So either write some code to get blocks of text data or convert to a binary format like .fvecs that can be memory-mapped.

Thanks for your replying, I finally figured out how to create large dataset which can be read with memory map. You can just use memory map to write your data and then you can read your large data with memory map directly. it seems that it's not necessary to convert your data to .fvecs format, which i tried and did not find a good way to do it.
I have a another question, can i build index with str id not int idx?

You probably don't want to train the index on the whole dataset (the example benchmark uses 1M vectors for training IIRC – @mdouze?).
Regarding np.memmap, you can look at this example.
Thanks for replying. i have already solved that problem. after training index, i want to change nprobe to increase accuracy, but no matter how i change the number of nprobe by using [ faiss.ParameterSpace().set_index_parameter(saved_index, "nprobe", 2000) ], it always return the same result, what is even worse is that, i can set the value of nprobe to which value is larger than nlist value. The index i used is read from disk where i saved after training.

@tf24-karatzhong You can use index.setNumProbes(2000), as you can see here.

about str indexes: this is not supported. See issue #641

@tf24-karatzhong You can use index.setNumProbes(2000), as you can see here.
I tried it, but it raises an error:
'IndexPreTransform' object has no attribute 'setNumProbes'

see
https://github.com/facebookresearch/faiss/wiki/FAQ#how-can-i-set-nprobe-on-an-opaque-index

see
https://github.com/facebookresearch/faiss/wiki/FAQ#how-can-i-set-nprobe-on-an-opaque-index

yeah , i saw that before and The code i used is:
saved_index=faiss.read_index('./index_file/cpu_pca_all.index')
faiss.ParameterSpace().set_index_parameter(saved_index, "nprobe", 1000)
saved_index.search(np.expand_dims(query_demo,0),10)
it returns the same result no matter the value of nprobe i set

@tf24-karatzhong

If you use GPU indices, replace ParameterSpace with GpuParameterSpace.

@tf24-karatzhong

If you use GPU indices, replace ParameterSpace with GpuParameterSpace.
I trained index with gpu, but i have to convert gpu index to cpu index in order to save the trained index.
and i also tried the following code:
ps = faiss.GpuParameterSpace()
saved_index=faiss.read_index('./index_file/cpu_pca_all.index')
res = faiss.StandardGpuResources()
gpu_index_f_saved=faiss.index_cpu_to_gpu(res,0,saved_index)
ps.set_index_parameter(gpu_index_f_saved, 'nprobe', 1)
gpu_index_f_saved.search(np.expand_dims(query_demo,0),20)
in this way, i also got the same result no matter the value of nprobe i set

@tf24-karatzhong
I have encountered a problem similar to yours. I have a large data set, but it is limited by memory . So, what is the specific solution for using memory mapping? maybe it's np.fromfile? Thanks !

Was this page helpful?
0 / 5 - 0 ratings

Related issues

cherryPotter picture cherryPotter  Â·  3Comments

zjjott picture zjjott  Â·  3Comments

linghuang picture linghuang  Â·  3Comments

danny1984 picture danny1984  Â·  3Comments

zoe-cheung picture zoe-cheung  Â·  3Comments