Gensim: Add MemoryMapping example to Annoy tutorial

Created on 27 Sep 2016 · 13Comments · Source: RaRe-Technologies/gensim

Add example to AnnoyTutorial where 2 parallel processes load the same model from disk and mmap the same index file.

difficulty easy documentation

Source

tmylk

All 13 comments

I did prepared the code to save and fetch the model from disk by 2 parallel processes. What do we mean by mmap the same index file? What is the use case for this scenario? I will add description too.

harshul1610 on 28 Sep 2016

@harshul1610 Thanks for taking this up. Could you please move annoy_index.load('index') into the thread and also output memory used? Memory should not increase much as the index stays on disk.

Also, it would be more professional to choose a less controversial word example instead of 'army'.

tmylk on 29 Sep 2016

sure. I will do it.

harshul1610 on 29 Sep 2016

Is it good?

harshul1610 on 29 Sep 2016

To be more explicit Each process should have its own indexer

tmylk on 29 Sep 2016

Is it good?

harshul1610 on 1 Oct 2016

Hi @harshul1610 The code looks good.
Could you please add some text above cell 10 to provide motivation for the code?
Best would be add a new cell where 2 separate indices are created and used without saving/loading. It should use much more memory than cell 10.
Your text will explain that memory mapping saves RAM.

tmylk on 2 Oct 2016

👍1

Also, no need to create new pr to update code. you can just keep pushing to existing branch

tmylk on 2 Oct 2016

👍1

@tmylk It is the time that surely increases when we don't use memory mapped index. But, memory space used by processes is approximately the same.

when Index file not memory mapped

%%time
from gensim import models
from gensim.similarities.index import AnnoyIndexer
from multiprocessing import Process
import os
import psutil

model.save('/tmp/mymodel')

def f(process_id):
    print 'Process Id: ', os.getpid()
    process = psutil.Process(os.getpid())
    new_model = models.Word2Vec.load('/tmp/mymodel')
    vector = new_model["science"]
    annoy_index = AnnoyIndexer(new_model,100)
    approximate_neighbors = new_model.most_similar([vector], topn=5, indexer=annoy_index)
    for neighbor in approximate_neighbors:
        print neighbor
    print 'Memory used by process '+str(os.getpid())+'=', process.memory_info()

p1 = Process(target=f, args=('1',))
p1.start()
p1.join()
p2 = Process(target=f, args=('2',))
p2.start()
p2.join()

Process Id: 9681
('organisations.', 0.6213911473751068)
('klusener', 0.6172938644886017)
('version', 0.6145751774311066)
('beveridge,', 0.6114714443683624)
('silence', 0.6113320291042328)
Memory used by process 9681= pmem(rss=224518144, vms=1353203712, shared=8687616, text=3051520, lib=0, data=1042268160, dirty=0)
Process Id: 9700
('organisations.', 0.6213911473751068)
('klusener', 0.6172938644886017)
('version', 0.6145751774311066)
('beveridge,', 0.6114714443683624)
('silence', 0.6113320291042328)
Memory used by process 9700= pmem(rss=224518144, vms=1353203712, shared=8687616, text=3051520, lib=0, data=1042268160, dirty=0)
CPU times: user 168 ms, sys: 16 ms, total: 184 ms
Wall time: 6.84 s

Index file memory mapped:

%%time
from gensim import models
from gensim.similarities.index import AnnoyIndexer
from multiprocessing import Process
import os
import psutil

model.save('/tmp/mymodel')

def f(process_id):
    print 'Process Id: ', os.getpid()
    process = psutil.Process(os.getpid())
    new_model = models.Word2Vec.load('/tmp/mymodel')
    vector = new_model["science"]
    annoy_index = AnnoyIndexer()
    annoy_index.load('index')
    annoy_index.model = new_model
    approximate_neighbors = new_model.most_similar([vector], topn=5, indexer=annoy_index)
    for neighbor in approximate_neighbors:
        print neighbor
    print 'Memory used by process '+str(os.getpid())+'=', process.memory_info()

p1 = Process(target=f, args=('1',))
p1.start()
p1.join()
p2 = Process(target=f, args=('2',))
p2.start()
p2.join()

Resuts:
Process Id: 9648
('organisations.', 0.6213911473751068)
('klusener', 0.6172938644886017)
('version', 0.6145751774311066)
('beveridge,', 0.6114714443683624)
('silence', 0.6113320291042328)
Memory used by process 9648= pmem(rss=242716672, vms=1370664960, shared=26886144, text=3051520, lib=0, data=1042268160, dirty=0)
Process Id: 9663
('organisations.', 0.6213911473751068)
('klusener', 0.6172938644886017)
('version', 0.6145751774311066)
('beveridge,', 0.6114714443683624)
('silence', 0.6113320291042328)
Memory used by process 9663= pmem(rss=242716672, vms=1370664960, shared=26886144, text=3051520, lib=0, data=1042268160, dirty=0)
CPU times: user 104 ms, sys: 28 ms, total: 132 ms
Wall time: 471 ms

One thing is for sure when we are not loading index file from memory, there is an drastic increase in the shared memory that is shared by process.

harshul1610 on 2 Oct 2016

Let me understand correctly what you are saying:
"cumulative RAM used by 2 processes with memory mapping from disk" = "RAM used by 2 processes, each created its own index"

It is RAM together by 2 processes, not sepately by each one. I don't see this statistics in your log above.

That means mmaping claim by Annoy is not true: "It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data."

tmylk on 2 Oct 2016

@tmylk updated the statistics

harshul1610 on 2 Oct 2016

Thanks, that is more clear now. So it is using less memory in total, as it is shared. Shared memory is included in RSS so it was confusing before.

Please add some text and this output to the notebook.

tmylk on 2 Oct 2016

👍1

Thanks for the pr!

tmylk on 18 Oct 2016

Was this page helpful?