Hdbscan: Relationship between two packages: hdbscan and hdbscan-with-cosine-distance

Created on 22 Apr 2019  路  4Comments  路  Source: scikit-learn-contrib/hdbscan

I was learning about the HDBSCAN package, and encountered another package "hdbscan-with-cosine-distance'.

What's the relationship between the two? Can someone give usage examples to illustrate the differences?

Thanks.

Most helpful comment

The "hdbscan-with-cosine-distance" was a fork by someone else to add cosine distance support. There were some quirks in making that happen, but if you really need cosine distance then it is an option. A better approach would be to use the current package and l2-normalize your data -- in that case euclidean distance is a close approximation to cosine distance.

For the record, there is also the not-so-scalable option of using HDBSCAN with cosine and arc-cosine metrics thus:

import hdbscan
clusterer = hdbscan.HDBSCAN(metric="cosine",
                            algorithm="generic")
result = clusterer.fit_predict(data)

Please see #69 for more info.

All 4 comments

The "hdbscan-with-cosine-distance" was a fork by someone else to add cosine distance support. There were some quirks in making that happen, but if you really need cosine distance then it is an option. A better approach would be to use the current package and l2-normalize your data -- in that case euclidean distance is a close approximation to cosine distance.

Just wanted to confirm that you are referring to l2-normalize for each sample (observation/row) of the data. If so should this be applied after scaling each feature (tag/column) ?

I understand what @lmcinnes has mentioned about using l2-normalize for each sample to get equivalent of cosine distance. I was brushing my basics on linear algebra and found this.

So for computing cos(theta), if we normalize each vector( from A&B) prior and then compute the distance using square-root of square of differences between the vectors (A & B), the calculated distance will be equal to sqrt(2 - 2(A.B)) which is = sqrt(2 - 2(cosine_distance)). Please correct me if I am wrong @lmcinnes
image

The "hdbscan-with-cosine-distance" was a fork by someone else to add cosine distance support. There were some quirks in making that happen, but if you really need cosine distance then it is an option. A better approach would be to use the current package and l2-normalize your data -- in that case euclidean distance is a close approximation to cosine distance.

For the record, there is also the not-so-scalable option of using HDBSCAN with cosine and arc-cosine metrics thus:

import hdbscan
clusterer = hdbscan.HDBSCAN(metric="cosine",
                            algorithm="generic")
result = clusterer.fit_predict(data)

Please see #69 for more info.

Was this page helpful?
0 / 5 - 0 ratings