Need guidance...
We'll have an application where we will stream a set of vectors (on the order of a billion). We cannot wait until we collect all the vectors to train an index (you recommend IMI at this scale). We are thinking of building indexes for smaller batches of vectors... once we have a batch ready, we could train the index from a sample, create an index for the batch and in the end merge all the indexes. I understand only IVF supports merging of indexes, wanted your thoughts on this approach.
Thanks
What matters for training the IVF clustering is the distribution of
vectors, so as long as you collect enough vectors compared to the number of
inverted lists, you don’t need to retrain the index with the new vectors
(provided they all follow the same distribution).
On Tue 20 Nov 2018 at 22:05, mvss80 notifications@github.com wrote:
Need guidance...
We'll have an application where we will stream a set of vectors (on the
order of a billion). We cannot wait until we collect all the vectors to
train an index (you recommend IMI at this scale). We are thinking of
building indexes for smaller batches of vectors... once we have a batch
ready, we could train the index from a sample, create an index for the
batch and in the end merge all the indexes. I understand only IVF supports
merging of indexes, wanted your thoughts on this approach.Thanks
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/facebookresearch/faiss/issues/642, or mute the thread
https://github.com/notifications/unsubscribe-auth/ACHPYpyH88Uv8GCFkNbT9t4fD72EMQQZks5uxG6NgaJpZM4Yr7Mk
.>
Lucas Hosseini
lucas.[email protected]
Thanks Lucas for the quick response!
We don't have control on the order we receive vectors, most likely the distributions from batch to batch will be very different. Will there be any issues when we merge all indexes (one for each batch) in the end?
@mvss80, for the merge to be at all possible, the training must be done once and all the vectors must be added from the same trained index.
If you are certain that the distribution of the batches is significantly different, then you can use an IndexShards, that dispatches queries over several indexes without merging them.
Added a FAQ entry about this:
https://github.com/facebookresearch/faiss/wiki/FAQ#how-can-i-distribute-index-building-on-several-machines
@mdouze, thanks for the answer.
If I have too many batches where all indexes won't fit in memory, I tried doing this:
uber_index = faiss.IndexShards(D)
for i in range(NUM_BATCHES):
# read created index for each batch as mmap
sub_index = faiss.read_index("shard_"+str(i)+".index", faiss.IO_FLAG_MMAP)
uber_index.add_shard(sub_index)
and for querying:
uber_index.nprobe = nprobe
uber_index.search(xq, 5)
This seems to work for me. Should I be using OnDiskInvertedLists for adding shards as you show for merging in demo_ondisk_ivf.py?
The mmap trick will work, but possibly with a strong impact on performance (because the data is read from disk at search time). The OnDiskInvertedLists solution is a bit better because the data is more contiguous. To build an index that fits in memory (but that gives less accurate results) you could consider a compressed index, see
https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#if-very-important-then-opqx_ypqx
About setting the nprobe: I added a FAQ entry
https://github.com/facebookresearch/faiss/wiki/FAQ#how-can-i-set-nprobe-on-the-sub-indexes-of-an-indexshards-or-indexproxy
Got it. Just to clarify, I won't be able to use OnDiskInvertedLists to create IndexShards, right?
Also, when I create an IndexShards in memory after adding all sub_indexes without mmap and try writing it, I get an error. Is it not writable?
RuntimeError: Error in void faiss::write_index(const faiss::Index*, faiss::IOWriter*) at index_io.cpp:480: don't know how to serialize this type of index
OnDiskInvertedLists and IndexShards are different indeed
Storing an IndexShards is not supported, it probably does not make much sense.
If I want to query the closest points of a record already in IndexShards, I understand Faiss doesn't support query by id. When I try to reconstruct the vector, I get an error that reconstruct is not supported by IndexShards:
---> 12 print (uber_index.reconstruct(859456))
/anaconda3/envs/py36/lib/python3.6/site-packages/faiss/__init__.py in replacement_reconstruct(self, key)
151 def replacement_reconstruct(self, key):
152 x = np.empty(self.d, dtype=np.float32)
--> 153 self.reconstruct_c(key, swig_ptr(x))
154 return x
155
/anaconda3/envs/py36/lib/python3.6/site-packages/faiss/swigfaiss.py in reconstruct(self, key, recons)
1334
1335 def reconstruct(self, key, recons):
-> 1336 return _swigfaiss.Index_reconstruct(self, key, recons)
1337
1338 def reconstruct_n(self, i0, ni, recons):
RuntimeError: Error in virtual void faiss::Index::reconstruct(faiss::Index::idx_t, float*) const at Index.cpp:56: reconstruct not implemented for this type of index
I can go to the sub_index where the query vector has been indexed to reconstruct it. I can keep track of which id's went to which sub_indexes but is there a way to find this information from IndexShards?
Also, if my sub_indexes are all IVFFlat, trying to reconstruct gives me an error that it is not supported. What other index type can I use instead that supports reconstruct?
reconstruct and search_and_reconstruct are not supported by the GPU indexes (and by IndexProxy). I am not sure how hard it would be to support. @wickedfoo?
Currently the best way to handle this is to keep a hashtable with with vectors on CPU and do the reconstruction yourself.
@mdouze, looks like I've found a work-around... apologize for the long post. Please let me know if this is right.
Setting the maintain_direct_map flag to True in the IVFIndex allows me to use the IVFIndex as input to IndexIDMap2. This allows me to add records with id to the IndexIDMap2. This allows me to call reconstruct given an id. Is this correct?
On top of this, I can add multiple IndexIDMap2s created this way into IndexShards that allows me to search across them all. But I ran into an issue when I search IndexShards with a query vector. Depending on which shard the resulting vector came from, the id of the returned vector is offset by a shard-dependent number.
Let me explain with an example. I create a sample dataset of 1M 512-d records where the first element ranges from 0 through (1M-1) and the rest of the elements are zeros.
import numpy as np
import faiss
D = 512
INDEX_PATH = '/home'
data = np.zeros((1000000, D))
data[:, 0] = np.arange(0, 1000000)
Now, I'll create five separate IVFIndex/IndexIDMap2 indexes from the same dataset, read them as memory-mapped shards (shard_0 through shard_4) and then add them to IndexShards. shard_0 has id's [0, 1M), shard_1 has id's [1M, 2M) and so on.
# Create an IVF index, train and save it
trained_index = faiss.index_factory(D, "IVF10000,Flat")
trained_index.train(data.astype('float32'))
faiss.write_index(trained_index, os.path.join(INDEX_PATH, 'data_train.index'))
# Create Index Shards and add IndexIDMap2 to the index shards
uber_index = faiss.IndexShards(D)
for i in range(5):
sub_index = faiss.read_index(INDEX_PATH + '/data_train.index')
sub_index.make_direct_map(True)
index_map = faiss.IndexIDMap2(sub_index)
index_map.add_with_ids(data.astype('float32'), np.arange(1000000*i, 1000000*(i+1)))
faiss.write_index(index_map, os.path.join(INDEX_PATH, 'shard_{}'.format(i)))
sub_index = None
gc.collect()
shard_index = faiss.read_index(os.path.join(INDEX_PATH, 'shard_{}'.format(i)), faiss.IO_FLAG_MMAP)
uber_index.add_shard(shard_index)
All of this works so far.
When I query any of the individual shards separately, I get results as expected. For example, if I query the first record data[0:1, :] from shard_0, I get the returned id as 0 for an exact match and if I query the same from shard_4, I get the id 4M.
# query shard_0
print (uber_index.at(0).search(data[0:1, :].astype('float32'), 5))
(array([[ 0., 1., 4., 9., 16.]], dtype=float32), array([[0, 1, 2, 3, 4]]))
md5-33e45a86a14d64dff391ccff8b4652e1
# query shard_4
print (uber_index.at(4).search(data[0:1, :].astype('float32'), 5))
md5-33e45a86a14d64dff391ccff8b4652e1
(array([[ 0., 1., 4., 9., 16.]], dtype=float32), array([[4000000, 4000001, 4000002, 4000003, 4000004]]))
md5-5eb212afae83c54b7e08974c5e401b97
# query uber_index
print (uber_index.search(data[0:1, :].astype('float32'), 6))
md5-33e45a86a14d64dff391ccff8b4652e1
(array([[0., 0., 0., 0., 0., 1.]], dtype=float32), array([[ 0, 4000000, 8000000, 2000000, 6000000, 1]]))
As you can see above, the id's for exact matches should be 0, 1M, 2M, 3M, 4M instead of the above results. Looks like this is happening due to translations here.
What can we do to get the correct id's from uber_index? In other words, what do I need to do to make IndexShards work with IndexIDMap2?
Also, if I use IVFIndex with IndexIDMap2 as described above, it does not support remove_ids, is this a functionality you are thinking of adding?
Ok, I thought you needed an GpuIndexIVF (that does not support reconstruct).
About the mapping: if you don't need the indices to be translated, set successive_ids=false in the constructor
https://rawgit.com/facebookresearch/faiss/master/docs/html/structfaiss_1_1IndexShards.html
Thank you! That worked, I get the right indices for retrieved vectors from uber_index. But if I try remove_ids on uber_index or individual shards, I get the error
RuntimeError: Error in virtual long int faiss::IndexIVF::remove_ids(const faiss::IDSelector&) at IndexIVF.cpp:275: Error: '!maintain_direct_map' failed: direct map remove not implemented
It is not implemented and low-priority for us to implement.
no activity, closing.
Most helpful comment
The mmap trick will work, but possibly with a strong impact on performance (because the data is read from disk at search time). The OnDiskInvertedLists solution is a bit better because the data is more contiguous. To build an index that fits in memory (but that gives less accurate results) you could consider a compressed index, see
https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#if-very-important-then-opqx_ypqx
About setting the nprobe: I added a FAQ entry
https://github.com/facebookresearch/faiss/wiki/FAQ#how-can-i-set-nprobe-on-the-sub-indexes-of-an-indexshards-or-indexproxy