Gensim: Similarity class does not use constant memory

Created on 28 Feb 2019 · 3Comments · Source: RaRe-Technologies/gensim

I am using the Similarity class to query a corpus (2.5M docs, features reduced to 300 using LSI). It gave me 77 shards, each ~57MB which looks ok, however when I query the index it loads everything (full 4GB) into memory, which is exactly what I wanted to avoid. The documentation states the Similarity class runs with constant memory, but that is not what happens. I used default values for chunk and shard sizes, which are 256 and 32768. The documentation states

shardsize should be chosen so that a shardsize x chunksize matrix of floats fits comfortably into main memory.

That would be 32 MB. Why is it using 4 GB instead?

Tested on:
Windows-2012ServerR2-6.3.9600-SP0
Python 3.6.4, AMD64
NumPy 1.16.2
SciPy 1.2.1
gensim 3.7.1 (also didn't work before updating with neither 3.6.0 nor 3.4.0)

Source

Laubeee

👍1

Most helpful comment

The difference between Similarity and MatrixSimilarity is that with Similarity, the index is split across multiple smaller "shards", with each shard stored to disk and mmaped back separately. Whereas with MatrixSimilarity, the entire index is stored in RAM / on disk + mmaped back as a single large matrix.

There's no major difference in performance or RAM. Both Similarity and MatrixSimilarity can be stored to disk and mmap'ed back. But the Similarity sharding allows some new workflows:

adding documents dynamically (only the last shard needs to be re-opened and changed)
mixing sparse and dense documents (each shard is stored as sparse/dense depending on the data in it, and different shards can have different sparsity)
opens room for parallelization (different shards may in theory live on different machines, so processing queries can be parallelized; see simserver and later scaletext).

If you need fast approximate retrieval, check out ANN libraries such as Annoy.

piskvorky on 1 Mar 2019

❤1 🎉1 👍1

All 3 comments

@Laubeee Similarity uses memory mapping (mmap) to load the index data into virtual memory. So while the virtual memory of your process will increase, the pages will be swapped out again from physical RAM by the OS when running out of RAM.

But note that swapping is bad for performance, so this doesn't get you much. For good performance, you still want all the data in RAM. The main advantage of Similarity is that it allows dynamically adding new documents, unlike (Sparse)MatrixSimilarity.

piskvorky on 28 Feb 2019

Thanks for clarification so far

Yet all doc resources very strongly advice to go for Similarity when RAM is an issue, so just for my understanding: when swapping MatrixSimilarity it first needs to write it to disk so it can be read again later (if needed), even if the object was read from a file in the first place. But when swapping Similarity the writing can be skipped as it is already present in virtual memory and therefore should be faster, is that correct?

Laubeee on 1 Mar 2019

👍1

There's no major difference in performance or RAM. Both Similarity and MatrixSimilarity can be stored to disk and mmap'ed back. But the Similarity sharding allows some new workflows:

adding documents dynamically (only the last shard needs to be re-opened and changed)
mixing sparse and dense documents (each shard is stored as sparse/dense depending on the data in it, and different shards can have different sparsity)
opens room for parallelization (different shards may in theory live on different machines, so processing queries can be parallelized; see simserver and later scaletext).

If you need fast approximate retrieval, check out ANN libraries such as Annoy.

piskvorky on 1 Mar 2019

❤1 🎉1 👍1

Was this page helpful?

0 / 5 - 0 ratings