Hdbscan: Can't seem to get cosine metric working.

Created on 28 Oct 2016  ยท  25Comments  ยท  Source: scikit-learn-contrib/hdbscan

hdbscan==0.8.2
Can't seem to get cosine metric working:

hdbscan.hdbscan_.DistanceMetric.get_metric('cosine')

So the metric is there, but:

clusterer = hdbscan.HDBSCAN(min_cluster_size=5, metric='cosine')
labels = clusterer.fit_predict(np.array(data_d2v))

yields:

ValueError: Unrecognized metric 'cosine'

bug

Most helpful comment

Sadly at this time, no. Current efficient methods require the use of space trees which, in turn, require metrics that admit the triangle inequality. Unfortunately this is not true of cosine distance, and hence things break down quickly. I am working on other approaches that would allow for more generic measures, but I honestly can't say when they will be available. In the meantime there is a semi-workaround available.

The next best thing to do is to use angular distance instead of cosine distance -- they are similar, but angular distance admits the triangle inequality. Now angular distance is not supported out of the box either (using the space trees in sklearn), but euclidean distance on l2 normalized vectors is equivalent to angular distance, so you can normalize the vectors and then use euclidean distance: thus something like

from sklearn.preprocessing import normalize

norm_data = normalize(data, norm='l2')
clusterer = HDBSCAN() # euclidean distance is the default
clusterer.fit(norm_data)

and that will be an approximation to using cosine distance.

All 25 comments

That is definitely a problem. I'll look into it as soon as I get a chance. In the meantime you can try 'arccos' as a metric, which may work better.

Same issue with 'arccos', thought it was a link to 'cos'.

It seems that the issue is that we are instantiating ball/kd-trees from sklearn, which does not support such distance metrics (yet). If you have a small enough dataset then adding the keyword argument algorithm='generic' will get you the results you desire. In the meantime I'll see if I can push the relevant metrics upstream to sklearn.

So I'm also trying to use cosine similarity as my distance metric. Here's what I've tried so far:

  • Straight up using metric='cosine' doesn't work at all because it's not implemented in sklearn.neighbors.DistanceMetric

  • Using algorithm='generic', metric='cosine' or precomputing the cosine distances and then using metric='precomputed' make the .fit method throw

    ValueError: Buffer dtype mismatch, expected 'double_t' but got 'float'`
    

    Casting the input matrix to double dtype solves the bug but the clustering run requires a lot of memory.

screen shot 2017-03-09 at 5 39 08 pm

There remain some issues with this -- it's still being worked on.
Apologies. In the meantime it is worth noting that if you normalize your
vectors then euclidean distance is equal to arccos distance and all is
well, so that's the right way to proceed for now.

On Wed, Mar 8, 2017 at 12:41 PM, William Tran notifications@github.com
wrote:

So I'm also trying to use cosine similarity as my distance metric. Here's
what I've tried so far:

-

Straight up using metric='cosine' doesn't work at all because it's not
implemented in sklearn.neighbors.DistanceMetric
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html
-

Using algorithm='generic', metric='cosine' make the .fit method throw ValueError:
Buffer dtype mismatch, expected 'double_t' but got 'float'
-

Precomputing the cosine similarity and then using metric='precomputed'
throws the same error

[image: screen shot 2017-03-08 at 12 30 17 pm]
https://cloud.githubusercontent.com/assets/2345378/23715732/9b9809ee-03fb-11e7-9ecd-0c2a2cf0a91e.png

โ€”
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/69#issuecomment-285112451,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBVOMKBGB5NqenBPQ0uNNTJEk6BlOks5rjuhagaJpZM4KjDT-
.

Thanks for the hint, I totally missed that part! :D

Is there anything I could do to help?

If you can figure out the "right" way to make it work I would appreciate
it. There are a few approaches and but they each have some drawbacks. I
would be happy to know if there is a "better" approach. Pull requests are
even better.

On Wed, Mar 8, 2017 at 2:40 PM, William Tran notifications@github.com
wrote:

Thanks for the hint, I totally missed that part! :D

Is there anything I could do to help?

From: Leland McInnes notifications@github.com
Reply-To: scikit-learn-contrib/hdbscan reply@reply.github.com
Date: Wednesday, March 8, 2017 at 2:37 PM
To: scikit-learn-contrib/hdbscan hdbscan@noreply.github.com
Cc: William Tran viet.tran@outlook.com, Comment <
[email protected]>
Subject: Re: [scikit-learn-contrib/hdbscan] Can't seem to get cosine
metric working. (#69)

There remain some issues with this -- it's still being worked on.
Apologies. In the meantime it is worth noting that if you normalize your
vectors then euclidean distance is equal to arccos distance and all is
well, so that's the right way to proceed for now.

On Wed, Mar 8, 2017 at 12:41 PM, William Tran notifications@github.com
wrote:

So I'm also trying to use cosine similarity as my distance metric.
Here's
what I've tried so far:

-

Straight up using metric='cosine' doesn't work at all because it's not
implemented in sklearn.neighbors.DistanceMetric
DistanceMetric.html>
-

Using algorithm='generic', metric='cosine' make the .fit method throw
ValueError:
Buffer dtype mismatch, expected 'double_t' but got 'float'
-

Precomputing the cosine similarity and then using metric='precomputed'
throws the same error

[image: screen shot 2017-03-08 at 12 30 17 pm]
03fb-11e7-9ecd-0c2a2cf0a91e.png>

โ€”
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
69#issuecomment-285112451>,
or mute the thread
ALaKBVOMKBGB5NqenBPQ0uNNTJEk6BlOks5rjuhagaJpZM4KjDT->
.

โ€”
You are receiving this because you commented.
Reply to this email directly, view it on GitHub scikit-learn-contrib/hdbscan/issues/69#issuecomment-285145306>, or mute
the thread ACPJonte1GIcknVapmpiDNg5qLaNi4z0ks5rjwOJgaJpZM4KjDT->.

โ€”
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/69#issuecomment-285145993,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBZDkrf3avatOISdRwOEQIvgWZLwcks5rjwQfgaJpZM4KjDT-
.

What ideas have you explored/worked on?

One can make custom kd-trees and ball-trees that support angular distance
(I have done this, but not folded it in -- I don't want to support
essentially duplicates of code from sklearn). One can just make a
normalized copy of the data and then run everything with euclidean distance
(easy, but some other results will potentially look weird/not match with
input data, for instance cluster exemplars etc., and the prediction will
break). One can require distance matrices, and then fix up the bug to
ensure everything gets recast as floats before going into Cython.

On Wed, Mar 8, 2017 at 2:42 PM, William Tran notifications@github.com
wrote:

What ideas have you explored/worked on?

From: Leland McInnes notifications@github.com
Reply-To: scikit-learn-contrib/hdbscan reply@reply.github.com
Date: Wednesday, March 8, 2017 at 2:41 PM
To: scikit-learn-contrib/hdbscan hdbscan@noreply.github.com
Cc: William Tran viet.tran@outlook.com, Comment <
[email protected]>
Subject: Re: [scikit-learn-contrib/hdbscan] Can't seem to get cosine
metric working. (#69)

If you can figure out the "right" way to make it work I would appreciate
it. There are a few approaches and but they each have some drawbacks. I
would be happy to know if there is a "better" approach. Pull requests are
even better.

On Wed, Mar 8, 2017 at 2:40 PM, William Tran notifications@github.com
wrote:

Thanks for the hint, I totally missed that part! :D

Is there anything I could do to help?

From: Leland McInnes notifications@github.com
Reply-To: scikit-learn-contrib/hdbscan reply@reply.github.com
Date: Wednesday, March 8, 2017 at 2:37 PM
To: scikit-learn-contrib/hdbscan hdbscan@noreply.github.com
Cc: William Tran viet.tran@outlook.com, Comment <
[email protected]>
Subject: Re: [scikit-learn-contrib/hdbscan] Can't seem to get cosine
metric working. (#69)

There remain some issues with this -- it's still being worked on.
Apologies. In the meantime it is worth noting that if you normalize your
vectors then euclidean distance is equal to arccos distance and all is
well, so that's the right way to proceed for now.

On Wed, Mar 8, 2017 at 12:41 PM, William Tran notifications@github.com

wrote:

So I'm also trying to use cosine similarity as my distance metric.
Here's
what I've tried so far:

-

Straight up using metric='cosine' doesn't work at all because it's not
implemented in sklearn.neighbors.DistanceMetric
DistanceMetric.html>
-

Using algorithm='generic', metric='cosine' make the .fit method throw
ValueError:
Buffer dtype mismatch, expected 'double_t' but got 'float'
-

Precomputing the cosine similarity and then using metric='precomputed'
throws the same error

[image: screen shot 2017-03-08 at 12 30 17 pm]
03fb-11e7-9ecd-0c2a2cf0a91e.png>

โ€”
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
69#issuecomment-285112451>,
or mute the thread
ALaKBVOMKBGB5NqenBPQ0uNNTJEk6BlOks5rjuhagaJpZM4KjDT->
.

โ€”
You are receiving this because you commented.
Reply to this email directly, view it on GitHub scikit-learn-contrib/hdbscan/issues/69#issuecomment-285145306>, or mute
the thread ACPJonte1GIcknVapmpiDNg5qLaNi4z0ks5rjwOJgaJpZM4KjDT->.

โ€”
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
69#issuecomment-285145993>,
or mute the thread
ALaKBZDkrf3avatOISdRwOEQIvgWZLwcks5rjwQfgaJpZM4KjDT->
.

โ€”
You are receiving this because you commented.
Reply to this email directly, view it on GitHub scikit-learn-contrib/hdbscan/issues/69#issuecomment-285146345>, or mute
the thread ACPJohEU0ipIDGB2KmNqIzRQtkLlqe7Hks5rjwRkgaJpZM4KjDT->.

โ€”
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/69#issuecomment-285146784,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBc1fFwXUBeoFtVjhXsyZgwcYDfLGks5rjwS-gaJpZM4KjDT-
.

Personally I feel that merging your custom trees into sklearn proper will be the best approach, but I don't know enough about the specific implementation differences that could block the merge to comment further on that.

For the normalized copy approach, if the different vector values is a problem, maybe we could keep a scaling vector so we can recover the unscaled input data for operations that require it?

As for the Cython bug, I double checked my code today and it seems casting the input matrix to np.double actually works, so that could be another way to do it. The downside of that is that the cast cause memory usage to go through the roof (maxed out my 16GB when I clustered a (25000, 300) input with algorithm='generic', metric='cosine')

I understand, but that is simply one of the issues with computing an
all-pairs distance matrix. I'll see if I can work out something else
reasonable.

On Thu, Mar 9, 2017 at 11:54 AM, William Tran notifications@github.com
wrote:

Personally I feel that merging your custom trees into sklearn proper will
be the best approach, but I don't know enough about the specific
implementation differences and whether any of them is blocking that merge
to comment further on that.

For the normalized copy approach, if the different vector values is a
problem, maybe we could keep a scaling vector so we can recover the
unscaled input data for operations that require it?

As for the Cython bug, I double checked my code today and it seems casting
the input matrix to np.double actually works, so that could be another
way to do it. The downside of that is that the cast cause memory usage to
go through the roof (maxed out my 16GB when I clustered a (25000, 300)
input with algorithm='generic', metric='cosine')

โ€”
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/69#issuecomment-285410595,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBcSkkWcfb9aqItXjtmiMCpXTVNt-ks5rkC7JgaJpZM4KjDT-
.

@lmcinnes sorry which "issue" out of the three do you mean?

The casting to doubles -- in general pairwise distance matrices are
expensive and not a good way to go if you can help it. I should make that
easier however.

On Thu, Mar 9, 2017 at 1:36 PM, William Tran notifications@github.com
wrote:

@lmcinnes https://github.com/lmcinnes sorry which "issue" out of the
three do you mean?

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/69#issuecomment-285439752,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBbaEOFiVxhxT0u2IbUrjYm2xzMn2ks5rkEaxgaJpZM4KjDT-
.

This seems working for me:
pw_distance = pairwise_distances(df[col_names], metric="cosine", n_jobs=-1) clusterer = HDBSCAN(min_cluster_size=min_cluster_size, metric='precomputed') clusterer.fit(pw_distance)

The above didn't work for me, but this did...

distance = pairwise_distances(X, metric='cosine')
clusterer = HDBSCAN(min_cluster_size=min_cluster_size, metric='precomputed')
db = clusterer.fit(distance.astype('float64'))

Hi @lmcinnes .. I still find this issue occurring .. I tried computing pairwise_distance over Cosine distance and clustering based on that but it is highly memory inefficient. Do you have any workaround to this ? thanks :)

Sadly at this time, no. Current efficient methods require the use of space trees which, in turn, require metrics that admit the triangle inequality. Unfortunately this is not true of cosine distance, and hence things break down quickly. I am working on other approaches that would allow for more generic measures, but I honestly can't say when they will be available. In the meantime there is a semi-workaround available.

The next best thing to do is to use angular distance instead of cosine distance -- they are similar, but angular distance admits the triangle inequality. Now angular distance is not supported out of the box either (using the space trees in sklearn), but euclidean distance on l2 normalized vectors is equivalent to angular distance, so you can normalize the vectors and then use euclidean distance: thus something like

from sklearn.preprocessing import normalize

norm_data = normalize(data, norm='l2')
clusterer = HDBSCAN() # euclidean distance is the default
clusterer.fit(norm_data)

and that will be an approximation to using cosine distance.

thank you for that wonderful explanation @lmcinnes .. A pretty noob question from my end but the application i am looking into is to find clusters of similar document vectors (inferred from a bigger pretrained one). Do you think in such scenario normalizing vector to accommodate triangular inequality will do the trick ?

There are no noob questions. I honestly can't say for sure -- it depends on how the document vectors were generated. If they are the result of PCA (or similar) on a TFIDF matrix then you should be fine. If they are the result of doc2vec or similar then it should work, but you may want to consider doing a non-linear dimension reduction first. If you are up for experimental code then UMAP might be one such option (by default it embeds to 2 dimensions, but supports arbitrary embedding dimension; something like 10 or 20 might be a reasonable thing to try).

To be honest I really don't know for sure -- there's not much you can do but try. On that front, please let me know what, if anything, works!

Thanks a lot @lmcinnes .. I'll try my hand on UMAP. I did try it out with the l2 normalized vector and i am pleasantly surprised as they too produce very similar cluster results to the one i got using precomputed cosine metric (+ the advantage of no additional RAM consumption).

But one thing i observed (and is not related to this issue) is that of the CPU usage. I am observing it is taking a long time to compute the clusters and also that my cpu usage is below 20% throughout. below is my code:
clusterer1 = hdbscan.HDBSCAN(min_cluster_size=40,min_samples=10, cluster_selection_method='leaf', core_dist_n_jobs=-1)
db = clusterer1.fit(norm_data)

am i doing something wrong?

Nothing jumps out at me. The likely cause is that most of the time is spent shuffling memory around to and from the CPU. These days memory bandwidth to the processor is often the actual bottleneck, and without good cache oblivious algorithms or significant tuning it is hard to fix that,.

@lmcinnes I was wondering if it will make a difference if the normalization is done along features or samples?

You would need to normalize across samples to make euclidean approximately the same as cosine -- you are essentially projecting all the samples onto a unit n-sphere, and then (small) euclidean distances are essentially angular distances around the n-sphere.

That explains it! thanks ๐Ÿ‘

One workaround worked for me is to convert the matrix to a sparse matrix, and cosine should work.

Was this page helpful?
0 / 5 - 0 ratings