Elasticsearch: Minhash token filter needs better documentation

Created on 5 Oct 2016 · 5Comments · Source: elastic/elasticsearch

The Minhash Token Filter documentation only describes the interface for the token filter. That is fine for most token filters, but this one is more complicated.

It should list possible use cases such as an alternative to the "more like this" query.
It should talk about the recommended number of shingles: 5.
It should give small but complete examples for 1 and 2.

In the Lucene issue, they discuss Jaccard and cosine similarities. Did that make it into the final patch? If so, should that be exposed as a setting?

:SearcAnalysis >docs

Source

rpedela

👍9

Most helpful comment

Also struggling to use this! Any help would be appreciated.

pkmital on 3 Oct 2017

👍3

All 5 comments

@rpedela I know nothing about it. Fancy sending a PR with the details?

clintongormley on 7 Oct 2016

Also struggling to use this! Any help would be appreciated.

pkmital on 3 Oct 2017

👍3

cc @elastic/es-search-aggs

romseygeek on 14 Mar 2018

👍2

mark it

wayliew on 24 Oct 2018

Just for the next people that might be confused as me, I want to leave the following hint.

I wondered to see the bucket_count parameter for the min_hash filter, despite the official wiki says nothing about it: https://en.wikipedia.org/wiki/MinHash

I found the clue here https://issues.apache.org/jira/browse/LUCENE-6968?focusedCommentId=15263867&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15263867

After a bit more digging, the single hash and keeping the minimum set can be improved.

See:
[1] http://jmlr.org/proceedings/papers/v32/shrivastava14.pdf
[2] http://www.auai.org/uai2014/proceedings/individuals/225.pdf

In summary: rather than keep the minimum set, split the hash space up into 500 buckets (for a 500 hash fingerprint) and keep the minimum for each bucket. To fill an empty bucket, take the minimum from the next non-empty bucket on the right with rotation.