Hdbscan: Relationship between two packages: hdbscan and hdbscan-with-cosine-distance

Created on 22 Apr 2019 · 4Comments · Source: scikit-learn-contrib/hdbscan

I was learning about the HDBSCAN package, and encountered another package "hdbscan-with-cosine-distance'.

What's the relationship between the two? Can someone give usage examples to illustrate the differences?

Thanks.

Source

arunmarathe

Most helpful comment

The "hdbscan-with-cosine-distance" was a fork by someone else to add cosine distance support. There were some quirks in making that happen, but if you really need cosine distance then it is an option. A better approach would be to use the current package and l2-normalize your data -- in that case euclidean distance is a close approximation to cosine distance.

For the record, there is also the not-so-scalable option of using HDBSCAN with cosine and arc-cosine metrics thus:

import hdbscan
clusterer = hdbscan.HDBSCAN(metric="cosine",
                            algorithm="generic")
result = clusterer.fit_predict(data)

Please see #69 for more info.

rtrad89 on 18 Jul 2019

👍4

All 4 comments

The "hdbscan-with-cosine-distance" was a fork by someone else to add cosine distance support. There were some quirks in making that happen, but if you really need cosine distance then it is an option. A better approach would be to use the current package and l2-normalize your data -- in that case euclidean distance is a close approximation to cosine distance.

lmcinnes on 23 Apr 2019

Just wanted to confirm that you are referring to l2-normalize for each sample (observation/row) of the data. If so should this be applied after scaling each feature (tag/column) ?

ravimulpuri on 3 May 2019

I understand what @lmcinnes has mentioned about using l2-normalize for each sample to get equivalent of cosine distance. I was brushing my basics on linear algebra and found this.

So for computing cos(theta), if we normalize each vector( from A&B) prior and then compute the distance using square-root of square of differences between the vectors (A & B), the calculated distance will be equal to sqrt(2 - 2(A.B)) which is = sqrt(2 - 2(cosine_distance)). Please correct me if I am wrong @lmcinnes

ravimulpuri on 14 Jun 2019

The "hdbscan-with-cosine-distance" was a fork by someone else to add cosine distance support. There were some quirks in making that happen, but if you really need cosine distance then it is an option. A better approach would be to use the current package and l2-normalize your data -- in that case euclidean distance is a close approximation to cosine distance.

For the record, there is also the not-so-scalable option of using HDBSCAN with cosine and arc-cosine metrics thus:

import hdbscan
clusterer = hdbscan.HDBSCAN(metric="cosine",
                            algorithm="generic")
result = clusterer.fit_predict(data)

Please see #69 for more info.

rtrad89 on 18 Jul 2019

👍4

Was this page helpful?

0 / 5 - 0 ratings

Related issues

HDBSCAN on GPU?

esvhd · 7Comments

MaybeEncodingError with large min_cluster_size

disimone · 3Comments

Crash when allow_single_cluster used with cluster_selection_epsilon

danielzgtg · 13Comments

TypeError: delayed() got an unexpected keyword argument 'check_pickle'

kevin-balkoski-enview · 5Comments

Import of hdbscan==0.8.20 fails with scikit-learn==0.21.0

uellue · 7Comments