Add example to AnnoyTutorial where 2 parallel processes load the same model from disk and mmap the same index file.
I did prepared the code to save and fetch the model from disk by 2 parallel processes. What do we mean by mmap the same index file? What is the use case for this scenario? I will add description too.
@harshul1610 Thanks for taking this up. Could you please move annoy_index.load('index') into the thread and also output memory used? Memory should not increase much as the index stays on disk.
Also, it would be more professional to choose a less controversial word example instead of 'army'.
sure. I will do it.
Is it good?
To be more explicit Each process should have its own indexer
Is it good?
Hi @harshul1610 The code looks good.
Could you please add some text above cell 10 to provide motivation for the code?
Best would be add a new cell where 2 separate indices are created and used without saving/loading. It should use much more memory than cell 10.
Your text will explain that memory mapping saves RAM.
Also, no need to create new pr to update code. you can just keep pushing to existing branch
@tmylk It is the time that surely increases when we don't use memory mapped index. But, memory space used by processes is approximately the same.
when Index file not memory mapped
%%time
from gensim import models
from gensim.similarities.index import AnnoyIndexer
from multiprocessing import Process
import os
import psutil
model.save('/tmp/mymodel')
def f(process_id):
print 'Process Id: ', os.getpid()
process = psutil.Process(os.getpid())
new_model = models.Word2Vec.load('/tmp/mymodel')
vector = new_model["science"]
annoy_index = AnnoyIndexer(new_model,100)
approximate_neighbors = new_model.most_similar([vector], topn=5, indexer=annoy_index)
for neighbor in approximate_neighbors:
print neighbor
print 'Memory used by process '+str(os.getpid())+'=', process.memory_info()
p1 = Process(target=f, args=('1',))
p1.start()
p1.join()
p2 = Process(target=f, args=('2',))
p2.start()
p2.join()
Process Id: 9681
('organisations.', 0.6213911473751068)
('klusener', 0.6172938644886017)
('version', 0.6145751774311066)
('beveridge,', 0.6114714443683624)
('silence', 0.6113320291042328)
Memory used by process 9681= pmem(rss=224518144, vms=1353203712, shared=8687616, text=3051520, lib=0, data=1042268160, dirty=0)
Process Id: 9700
('organisations.', 0.6213911473751068)
('klusener', 0.6172938644886017)
('version', 0.6145751774311066)
('beveridge,', 0.6114714443683624)
('silence', 0.6113320291042328)
Memory used by process 9700= pmem(rss=224518144, vms=1353203712, shared=8687616, text=3051520, lib=0, data=1042268160, dirty=0)
CPU times: user 168 ms, sys: 16 ms, total: 184 ms
Wall time: 6.84 s
Index file memory mapped:
%%time
from gensim import models
from gensim.similarities.index import AnnoyIndexer
from multiprocessing import Process
import os
import psutil
model.save('/tmp/mymodel')
def f(process_id):
print 'Process Id: ', os.getpid()
process = psutil.Process(os.getpid())
new_model = models.Word2Vec.load('/tmp/mymodel')
vector = new_model["science"]
annoy_index = AnnoyIndexer()
annoy_index.load('index')
annoy_index.model = new_model
approximate_neighbors = new_model.most_similar([vector], topn=5, indexer=annoy_index)
for neighbor in approximate_neighbors:
print neighbor
print 'Memory used by process '+str(os.getpid())+'=', process.memory_info()
p1 = Process(target=f, args=('1',))
p1.start()
p1.join()
p2 = Process(target=f, args=('2',))
p2.start()
p2.join()
Resuts:
Process Id: 9648
('organisations.', 0.6213911473751068)
('klusener', 0.6172938644886017)
('version', 0.6145751774311066)
('beveridge,', 0.6114714443683624)
('silence', 0.6113320291042328)
Memory used by process 9648= pmem(rss=242716672, vms=1370664960, shared=26886144, text=3051520, lib=0, data=1042268160, dirty=0)
Process Id: 9663
('organisations.', 0.6213911473751068)
('klusener', 0.6172938644886017)
('version', 0.6145751774311066)
('beveridge,', 0.6114714443683624)
('silence', 0.6113320291042328)
Memory used by process 9663= pmem(rss=242716672, vms=1370664960, shared=26886144, text=3051520, lib=0, data=1042268160, dirty=0)
CPU times: user 104 ms, sys: 28 ms, total: 132 ms
Wall time: 471 ms
One thing is for sure when we are not loading index file from memory, there is an drastic increase in the shared memory that is shared by process.
Let me understand correctly what you are saying:
"cumulative RAM used by 2 processes with memory mapping from disk" = "RAM used by 2 processes, each created its own index"
It is RAM together by 2 processes, not sepately by each one. I don't see this statistics in your log above.
That means mmaping claim by Annoy is not true: "It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data."
@tmylk updated the statistics
Thanks, that is more clear now. So it is using less memory in total, as it is shared. Shared memory is included in RSS so it was confusing before.
Please add some text and this output to the notebook.
Thanks for the pr!