I am using the Similarity class to query a corpus (2.5M docs, features reduced to 300 using LSI). It gave me 77 shards, each ~57MB which looks ok, however when I query the index it loads everything (full 4GB) into memory, which is exactly what I wanted to avoid. The documentation states the Similarity class runs with constant memory, but that is not what happens. I used default values for chunk and shard sizes, which are 256 and 32768. The documentation states
shardsize should be chosen so that a shardsize x chunksize matrix of floats fits comfortably into main memory.
That would be 32 MB. Why is it using 4 GB instead?
Tested on:
Windows-2012ServerR2-6.3.9600-SP0
Python 3.6.4, AMD64
NumPy 1.16.2
SciPy 1.2.1
gensim 3.7.1 (also didn't work before updating with neither 3.6.0 nor 3.4.0)
@Laubeee Similarity uses memory mapping (mmap) to load the index data into virtual memory. So while the virtual memory of your process will increase, the pages will be swapped out again from physical RAM by the OS when running out of RAM.
But note that swapping is bad for performance, so this doesn't get you much. For good performance, you still want all the data in RAM. The main advantage of Similarity is that it allows dynamically adding new documents, unlike (Sparse)MatrixSimilarity.
Thanks for clarification so far
Yet all doc resources very strongly advice to go for Similarity when RAM is an issue, so just for my understanding: when swapping MatrixSimilarity it first needs to write it to disk so it can be read again later (if needed), even if the object was read from a file in the first place. But when swapping Similarity the writing can be skipped as it is already present in virtual memory and therefore should be faster, is that correct?
The difference between Similarity and MatrixSimilarity is that with Similarity, the index is split across multiple smaller "shards", with each shard stored to disk and mmaped back separately. Whereas with MatrixSimilarity, the entire index is stored in RAM / on disk + mmaped back as a single large matrix.
There's no major difference in performance or RAM. Both Similarity and MatrixSimilarity can be stored to disk and mmap'ed back. But the Similarity sharding allows some new workflows:
If you need fast approximate retrieval, check out ANN libraries such as Annoy.
Most helpful comment
The difference between
SimilarityandMatrixSimilarityis that withSimilarity, the index is split across multiple smaller "shards", with each shard stored to disk and mmaped back separately. Whereas with MatrixSimilarity, the entire index is stored in RAM / on disk + mmaped back as a single large matrix.There's no major difference in performance or RAM. Both
SimilarityandMatrixSimilaritycan be stored to disk and mmap'ed back. But theSimilaritysharding allows some new workflows:If you need fast approximate retrieval, check out ANN libraries such as Annoy.