Faiss: Incrementally update the database

Created on 19 Jul 2017 · 8Comments · Source: facebookresearch/faiss

In my task, I get every database dense vector (1x20000) 1 at a time. I am trying to use the
index.add() method to update the database.

Roughly my code looks like:
`quantizer = faiss.IndexFlatL2( 16000 )
index = faiss.IndexIVFPQ( quantizer, 16000, 100, 8,8 )
index.train( np.random.random( 100000, 16000 ) )

In this loop I get the data from a queue which is filled in by another thread.

while True:
x = getData() #x is 1x16000 vector
index.add( x ) #takes 100ms :(

#50 nearest neighbours. This operation takes about 5ms.
D, I = index.search( x, 50 )

I can add and search databases incrementally. However for a d=16000 vectors with IndexIVFPQ
the insertion time is about 100 ms, query time is reasonable at 5ms.

I am wondering if I am doing something wrong? Any insights appreciated.

question

Source

mpkuse

All 8 comments

Hi,
Normally the add and search times should be roughly equivalent. I tested your code, see
https://gist.github.com/mdouze/4c6cc0a03fb24bfd005ac7e563df9c3c
and add takes ~3 ms. Could you check the timings on your side?

mdouze on 21 Jul 2017

Based on your snippet, here is my code. However, yet the append times are very high. What do you think might be wrong?
`

S_word = np.load( S_word_filename ) 
#This is 2000x16384. 2000 samples precomputed for testing purpose. but eventually these ones will be calculated online

quantizer = faiss.IndexFlatL2(16384)
index = faiss.IndexIVFPQ( quantizer, 16384, 256, 8, 8 )

index.train( np.random.random( (10000, 16384)  ).astype('float32') )
# training the index with random vectors.

for loop_index in range( 0, S_word.shape[0] ):
    #--- Faiss Index ---#
    startTimeFaiss = time.time()
    index.add( np.expand_dims(S_word[loop_index], axis=0) ) #1x16384
    # print 'Currently %d items in faissDB' %(faiss_index.ntotal), tcol.ENDC
    print '%04d) Faiss append time : %4.2fms' %( loop_index, ( time.time() - startTimeFaiss)*1000. )

    number_of_nearest_neighbors = 50
    startTimeSearch = time.time()
    # print tcol.OKBLUE, 'Search for %d nearest neighbors' %(number_of_nearest_neighbors), tcol.ENDC
    faiss_D, faiss_I = index.search( np.expand_dims(S_word[loop_index], axis=0), number_of_nearest_neighbors )
    print '%04d) Faiss Scoring : %4.2fms' %( loop_index, ( time.time() - startTimeSearch)*1000. )

On running the code, (some prints in the loop)
0055) Faiss append time : 99.34ms
0055) Faiss Scoring : 8.12ms
0056) Faiss append time : 98.70ms
0056) Faiss Scoring : 8.24ms
0057) Faiss append time : 105.56ms
0057) Faiss Scoring : 6.83ms
0058) Faiss append time : 104.92ms
0058) Faiss Scoring : 8.12ms
0059) Faiss append time : 103.86ms
0059) Faiss Scoring : 8.22ms
0060) Faiss append time : 105.58ms
0060) Faiss Scoring : 9.67ms

mpkuse on 7 Aug 2017

Could you run the same test as in the gist?

mdouze on 10 Aug 2017

I tried your test on 2 different computer. Computer-A is a usual PC and computer-B has Titan-X cards and a dual CPU mother board (although I am using just 1 CPU). Details on CPUs at the end.

On computer-A

Failed to load GPU Faiss: No module named swigfaiss_gpu
Faiss falling back to CPU-only.
0.0116009712219
0.00266194343567
0.00661206245422
0.00261282920837

On computer-B

Failed to load GPU Faiss: No module named swigfaiss_gpu
Faiss falling back to CPU-only.
0.0727758407593
0.0106279850006
0.0393769741058
0.00264501571655

Computer-B has rather large append times inspite being a more powerful processor. I am running the exact same code on both machines. What could be the issue in your opinion? If it is an issue with caching, any tips to fix it?

Configs ( cat /proc/cpuinfo )

Computer-A (has 4x 4GB of DDR3 synchronous ram @ 1333 MHz, 256KB L1, 1MB L2, 8MB L3 ). This is a 8 core processor

processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
stepping : 7
microcode : 0x1a
cpu MHz : 1600.125
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
bugs :
bogomips : 6800.11
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual

Computer-B (has 8x 16GB of DDR3 synchronous ram @ 2400 MHz, 384KB L1, 1536KB L2, 15MB L3 ). This is a 12 core processor

processor : 11
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz
stepping : 1
microcode : 0xb00001a
cpu MHz : 1247.507
cache size : 15360 KB
physical id : 0
siblings : 12
core id : 5
cpu cores : 6
apicid : 11
initial apicid : 11
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
bugs :
bogomips : 6796.14
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual

mpkuse on 14 Aug 2017

@mpkuse Hi, i also need the incrementally update the database. It's like i get some data everyday and i need add these data into the index and then i can search this data. So i just keep the index object and add data ,then re-train index objecet. It's right?