I was learning about the HDBSCAN package, and encountered another package "hdbscan-with-cosine-distance'.
What's the relationship between the two? Can someone give usage examples to illustrate the differences?
Thanks.
The "hdbscan-with-cosine-distance" was a fork by someone else to add cosine distance support. There were some quirks in making that happen, but if you really need cosine distance then it is an option. A better approach would be to use the current package and l2-normalize your data -- in that case euclidean distance is a close approximation to cosine distance.
Just wanted to confirm that you are referring to l2-normalize
for each sample (observation/row) of the data. If so should this be applied after scaling each feature (tag/column) ?
I understand what @lmcinnes has mentioned about using l2-normalize for each sample to get equivalent of cosine distance. I was brushing my basics on linear algebra and found this.
So for computing cos(theta)
, if we normalize each vector( from A&B) prior and then compute the distance using square-root of square of differences between the vectors (A & B), the calculated distance will be equal to sqrt(2 - 2(A.B))
which is =
sqrt(2 - 2(cosine_distance))
. Please correct me if I am wrong @lmcinnes
The "hdbscan-with-cosine-distance" was a fork by someone else to add cosine distance support. There were some quirks in making that happen, but if you really need cosine distance then it is an option. A better approach would be to use the current package and l2-normalize your data -- in that case euclidean distance is a close approximation to cosine distance.
For the record, there is also the not-so-scalable option of using HDBSCAN
with cosine and arc-cosine metrics thus:
import hdbscan
clusterer = hdbscan.HDBSCAN(metric="cosine",
algorithm="generic")
result = clusterer.fit_predict(data)
Please see #69 for more info.
Most helpful comment
For the record, there is also the not-so-scalable option of using
HDBSCAN
with cosine and arc-cosine metrics thus:Please see #69 for more info.