The Minhash Token Filter documentation only describes the interface for the token filter. That is fine for most token filters, but this one is more complicated.
In the Lucene issue, they discuss Jaccard and cosine similarities. Did that make it into the final patch? If so, should that be exposed as a setting?
@rpedela I know nothing about it. Fancy sending a PR with the details?
Also struggling to use this! Any help would be appreciated.
cc @elastic/es-search-aggs
mark it
Just for the next people that might be confused as me, I want to leave the following hint.
I wondered to see the bucket_count parameter for the min_hash filter, despite the official wiki says nothing about it: https://en.wikipedia.org/wiki/MinHash
I found the clue here https://issues.apache.org/jira/browse/LUCENE-6968?focusedCommentId=15263867&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15263867
After a bit more digging, the single hash and keeping the minimum set can be improved.
See:
[1] http://jmlr.org/proceedings/papers/v32/shrivastava14.pdf
[2] http://www.auai.org/uai2014/proceedings/individuals/225.pdfIn summary: rather than keep the minimum set, split the hash space up into 500 buckets (for a 500 hash fingerprint) and keep the minimum for each bucket. To fill an empty bucket, take the minimum from the next non-empty bucket on the right with rotation.
Most helpful comment
Also struggling to use this! Any help would be appreciated.