Gensim: min_similarity & max_distance does not work in levsim

Created on 28 Jun 2019 · 8Comments · Source: RaRe-Technologies/gensim

Expecting

As doc says, max_distance is a more efficient way, more quickly.
https://github.com/RaRe-Technologies/gensim/blob/369cc638225a2080faec25c4e2f6448d31e0492b/gensim/similarities/levenshtein.py#L32-L37

Problem description

However, max_distance does not speed up the process

https://github.com/RaRe-Technologies/gensim/blob/369cc638225a2080faec25c4e2f6448d31e0492b/gensim/similarities/levenshtein.py#L48-L51

Hacktoberfest good first issue help wanted

Source

BitMindLab

All 8 comments

@BitMindLab please fill in the issue template fully: what you're trying, what you're expecting, what you're seeing instead.

piskvorky on 28 Jun 2019

The doc-comment clearly does not match the code behavior, which gains no efficiency benefit from the specification of max_distance - it just sometimes returns a different value.

Either the code could be changed to match the comment intent, or the comment could be updated to not make false claims about what the code does.

gojomo on 12 Jul 2019

👍1

Whether the code does something more efficient or not is up to the code. I proposed (much) faster early-out variants in the past, but IIRC @Witiko didn't think it was worth it.

The only problem I see is that the documentation is unclear whether supplying max_distance must speed up the computation, or simply may speed it up. The intent is may: no guarantees, except that it's not slower.

So if the user knows their max distance threshold, they should always supply it. With the current implementation, supplying the parameter has no real effect, but that may change in the future.

@Witiko @gojomo @BitMindLab @mpenkov can you clarify the docstring wording? Or even better, implement the early-out optimization more intelligently (no need to compute the full Levenshtein distance once it's known to be above the max_distance threshold for certain).

piskvorky on 12 Jul 2019

@piskvorky As discussed in https://github.com/RaRe-Technologies/gensim/pull/2016#discussion_r202326763, replacing the levsim function with a stub (i.e. Levenshtein distance is not even computed) leads to less than 2× speed improvement. My conclusion was that rather than optimize the pointwise levsim function, which will lead to virtually no speed-up, we should be computing the Levenshtein distance between all word pairs in bulk.

Witiko on 13 Jul 2019

That's the one, thanks for digging it up :)

I assume "2×" also depends on the particular inputs, no? I.e. the longer the strings => the more costly to compute exact Levenshtein (especially compared to a low max_distance threshold… for example, with max_distance=N, we can exit early by simply checking abs(len(string1) - len(string2)) > N; compare to O(len(string1) * len(string2)) of full Levenshtein).

Replacing naive pair-wise computations by a better algo sounds good too. Which algo do you have in mind? Some trie / automaton? How would that work?

piskvorky on 13 Jul 2019

The average time complexity of computing the edit distance for all pairs using a pairwise algorithm is O(M²N²), where M is dictionary size and N is average word length. In fact, a bit tighter bound exists with respect to N, since a subquadratic algorithm was recently discovered, see here.

However, since N is constant for a language and M >> N, the actual average time complexity is O(M²), i.e. there is virtually no difference between computing pointwise edit distance and calling a stub, as seen in the measurements.

I am currently unaware of any algorithm for computing the edit distance in bulk, but I am looking. We can gain a little speed-up by optimizing the python-Levenshtein library, but the improvements will be insignificant.

Witiko on 13 Jul 2019

The improvements of early-out over using full Levenshtein distance are substantial. I've seen 10x+, gains (again, depending on string length: the longer the two inputs, the bigger the gain; two super large inputs are enough to halt your computation, you cannot take "average word length N" over squares like that).

Gensim is not an academic library, performance on real inputs and constant factors matter a lot.

If the "normal" words you tried on are all equally small, that's great, but is it true for all supported use-cases / languages?

The average time complexity of computing the edit distance for all pairs using a pairwise algorithm is O(M²N²)

But that's not the API I see in the TermSimilarityIndex module. The API claims it's looking for Get most similar terms for a given term.:

https://github.com/RaRe-Technologies/gensim/blob/369cc638225a2080faec25c4e2f6448d31e0492b/gensim/similarities/termsim.py#L34

Finding a small set of all words from a fixed static lexicon that are within a given distance from a dynamic query word is a well-studied problem. Trie and automata are the typical solution. It can be done extremely fast.

I've done this multiple times before myself (for spelling correction, incl. with custom edit distance to give different edit weights to typos, character accent marks "ř" vs "r", keyboard layouts "y" vs "z" etc). It should be fun to optimize this Gensim module too. But probably not any time soon, no spare time now, no itch to self-scratch.

piskvorky on 13 Jul 2019

If the "normal" words you tried on all equally small, that's great, but is it true for all supported use-cases / languages?

In the benchmark, we used the dictionary of the Google News dataset (English).
We would see a larger speed-up for languages with longer words such as German.

The API claims it's looking for Get most similar terms for a given term.:

This is true. In the SparseTermSimilarityMatrix class, we eventually call most_similar for every term, which is equivalent to computing edit distance between all word pairs in our current implementation. However, since we are only interested in the k nearest neighbors, our problem is easier.

Finding a small set of all words from a fixed static lexicon that are within a given distance from a dynamic query word is a well-studied problem. Trie and automata are the typical solution. It can be done extremely fast.

I've done this multiple times before myself (for spelling correction, incl. with custom edit distance fnc to give different edit weights to typos, character accent marks "ř" vs "r", keyboard layouts "y" vs "z" etc). It should be fun to optimize this Gensim module too. But probably not any time soon, no spare time now, no itch to self-scratch.

Here are some references:

Johnson, N., 2007. Nick's Blog. Damn Cool Algorithms, Part 1: BK-Trees. [[URL](http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees)]
Johnson, N., 2010. Nick's Blog. Damn Cool Algorithms: Levenshtein Automata. [[URL](http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata)]
Boytsov, L., 2011. Journal of Experimental Algorithmics. Indexing methods for approximate dictionary searching: Comparative analysis. [[URL](http://searchivarius.org/personal/leonid-boytsovs-publications#jea2011)]

Regrettably, I will have little time to work on this during July and August.
Perhaps some other contributor will take up this fun project in the meantime.

Witiko on 13 Jul 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings