Hdbscan: Too much noise found

Created on 5 Nov 2016  Â·  15Comments  Â·  Source: scikit-learn-contrib/hdbscan

Hi,

I am having a problem with clustering a data set. The data has been extensively filtered before clustering to remove uninteresting samples, so I would expect that the vast majority of samples clusters with some other samples. If I try other clustering algorithms which simply separate the dataset into clusters, with reasonable number of clusters, I see that all clusters make sense and there is (almost) no noise. However with all min_cluster_size and min_samples parameters I have tried hdbscan considers a lot of samples (~1/4-1/3) as noise. I can clearly see that there is structure in that noise by eye too... Is there anything else to do about it?

I'm attaching a seaborn clustermap to show that there is no real noise in the data, and what I can get with HDBSCAN (with the leftmost cluster being what is detected as noise)
clustermap
hdbscan_clusters

More what I expect is produced by Agglomerative Clustering:
agglomerativeclustring_fromcormatrix_11_clusters

Is it possible to force all points to the nearest cluster, for example?

Most helpful comment

I've merged in the prediction branch which includes soft clustering. It is still "experimental" but I believe it should work. I'd be keen to have some people try it out, so anyone interested in this please take a moment and clone from master and experiment. The relevant new routines are

approximate_predict
membership_vector
all_points_membership_vectors

The docstrings associated should give you an idea of how to use them in the meantime while I get some proper documentation/tutorial material written.

All 15 comments

I am currently working on code that can provide a membership vector,
proving the probability that a given point is in each of the found
clusters. This is currently in the prediction branch and not complete yet.
It might satisfy your desire to assign everything.

The other alternative is to simply access the single_linkage_tree_
attribute, which can provide you with an uncondensed tree that is akin to
robust single linkage. This should provide you with access to information
that should provide something more equivalent to the hierarchical
clusterings.

I'm travelling at the moment, so I can't get into too many details.
Hopefully some of my colleagues can tackle this in a little more detail.

On Sat, Nov 5, 2016 at 9:19 AM, Ilya Flyamer [email protected]
wrote:

Hi,

I am having a problem with clustering a data set. The data has been
extensively filtered before clustering to remove uninteresting samples, so
I would expect that the vast majority of samples clusters with some other
samples. If I try other clustering algorithms which simply separate the
dataset into clusters, with reasonable number of clusters, I see that all
clusters make sense and there is (almost) no noise. However with all
min_cluster_size and min_samples parameters I have tried hdbscan considers
a lot of samples (~1/4-1/3) as noise. I can clearly see that there is
structure in that noise by eye too... Is there anything else to do about it?

I'm attaching a seaborn clustermap to show that there is no real noise in
the data, and what I can get with HDBSCAN (with the leftmost cluster being
what is detected as noise)
[image: clustermap]
https://cloud.githubusercontent.com/assets/2895034/20030282/ed2e5d24-a359-11e6-8cd9-2dbff65d1cab.png
[image: hdbscan_clusters]
https://cloud.githubusercontent.com/assets/2895034/20030299/1f251336-a35a-11e6-8d10-8dfe2caa615a.png

More what I expect is produced by Agglomerative Clustering:
[image: agglomerativeclustring_fromcormatrix_11_clusters]
https://cloud.githubusercontent.com/assets/2895034/20030278/e92e1386-a359-11e6-8153-4411f4c5f89c.png

Is it possible to force all points to the nearest cluster, for example?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/72, or mute the
thread
https://github.com/notifications/unsubscribe-auth/ALaKBWkYn-uoISODYp8Qf_y-uA2TOsR9ks5q7IJ0gaJpZM4KqQme
.

I should point out that I agree that the results you are seeing are less than ideal, but I would need to know a little more about the data to start to understand why that might be the case. It would be nice to getter better results here.

Thanks for the answer! Concerning the data - this is a correlation matrix with ~5000 rows and columns. What else would you like to know about it? I think I could share it probably...
(I tried clustering the raw data and not a correlation matrix, and the results were only worse, which I see by silhouette score)
Would using single_linkage_tree from hdbscan have any advantage over hierarchical clustering?

I would be wary of banking too much on the silhouette score -- it can get a
little strange if you have a lot of noise. The "single linkage tree" is
actually a robust single linkage, so yes, it has some advantages of
standard single linkage hierarchical clustering, most particularly it is
resistant to noise. It may well be the case, however, that your particular
dataset works best with standard hierarchical clustering. In that case
please do check into shareability, because I'm always keen to see/have
examples of hdbscan not working well so I can try to improve it.

On Mon, Nov 7, 2016 at 6:48 AM, Ilya Flyamer [email protected]
wrote:

Thanks for the answer! Concerning the data - this is a correlation matrix
with ~5000 rows and columns. What else would you like to know about it? I
think I could share it probably...
(I tried clustering the raw data and not a correlation matrix, and the
results were only worse, which I see by silhouette score)
Would using single_linkage_tree from hdbscan have any advantage over
hierarchical clustering?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/72#issuecomment-258815682,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBVjL5Oo91yMaXZUROOWPwHZKwJPBks5q7xAAgaJpZM4KqQme
.

Is there anything better? Doesn't look like there is much noise really anyway I think?
OK, I see, thanks!
I'll look into sharing the data, I'll have to check with someone else about this.

OK, I can share the data with you. This is the raw matrix ~5000 by 10, the figures I showed were correlation matrices of all rows vs all rows (which is quite big, so it's easier to re-create it on your side). Let me know if you can get it to work better with this data!
data.tsv.zip

Thanks. I'll take a look at it when I get a chance. Unfortunately I'm
travelling at the moment so I don't have as much time at the moment. I'll
let you know if I can do anything with the data.

On Tue, Nov 8, 2016 at 9:45 AM, Ilya Flyamer [email protected]
wrote:

OK, I can share the data with you. This is the raw matrix ~5000 by 10, the
figures I showed were correlation matrices of all rows vs all rows (which
is quite big, so it's easier to re-create it on your side). Let me know if
you can get it to work better with this data!
data.tsv.zip
https://github.com/scikit-learn-contrib/hdbscan/files/578440/data.tsv.zip

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/72#issuecomment-259154863,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBTbYHo72xsDm8vxn_IqO5TFcEGzFks5q8Ir-gaJpZM4KqQme
.

I'm wondering how this membership probability vector differs from clusterer.probabilities_?

The goal of the membership probability vector (which is getting closer to
landing in master at last) is to provide, for each point, a vector of
probabilities of it being a member of each cluster. This includes noise
points, which are not assigned to clusters, but you potentially want to
know what their "most likely" cluster would be etc.

On Sun, Jan 8, 2017 at 7:51 PM, K.-Michael Aye notifications@github.com
wrote:

I'm wondering how this membership probability vector differs from
clusterer.probabilities_?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/72#issuecomment-271194475,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBWxesqs8CElBXXNgHg0qevnlhK1Lks5rQYSXgaJpZM4KqQme
.

I have a similar issue with my dataset, in that the most common "cluster" is often -1 (noise). You mentioned above:

I am currently working on code that can provide a membership vector,
proving the probability that a given point is in each of the found
clusters. This is currently in the prediction branch and not complete yet.
It might satisfy your desire to assign everything.

I wonder if you have any ETA for when that branch will be considered stable/merged into master, or if there is some way to use the single linkage or condensed tree to estimate the "closest" cluster for the noise points (even if it doesn't come with a probability)

It's coming fairly soon. I can't make any promises at this time, but I
would really like to have it arrive in February or March. I understand this
is fairly high priority as there are a number of requests for this, so I'll
try to get it done ASAP.

On Mon, Jan 30, 2017 at 5:26 PM, Kevin Gullikson notifications@github.com
wrote:

I have a similar issue with my dataset, in that the most common "cluster"
is often -1 (noise). You mentioned above:

I am currently working on code that can provide a membership vector,
proving the probability that a given point is in each of the found
clusters. This is currently in the prediction branch and not complete yet.
It might satisfy your desire to assign everything.

I wonder if you have any ETA for when that branch will be considered
stable/merged into master, or if there is some way to use the single
linkage or condensed tree to estimate the "closest" cluster for the noise
points (even if it doesn't come with a probability)

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/72#issuecomment-276211846,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBd7ZojoQzWjCncU8tf92J5ku0cVbks5rXmOCgaJpZM4KqQme
.

I've merged in the prediction branch which includes soft clustering. It is still "experimental" but I believe it should work. I'd be keen to have some people try it out, so anyone interested in this please take a moment and clone from master and experiment. The relevant new routines are

approximate_predict
membership_vector
all_points_membership_vectors

The docstrings associated should give you an idea of how to use them in the meantime while I get some proper documentation/tutorial material written.

I wonder what have happened to those routines. Have anyone tested them? They not pushed to master - as far as I can tell...

They are essentially just the soft clustering routines. Unfortunately there seem to be some odd bugs that I have never had the time to track down. They may well work for you case.

So if I understood correctly then this: membership_vector *= prob_in_some_cluster(x, tree, cluster_ids, point_dict, max_lambda_dict) should give me a membership to each cluster.
Does it includes missing (noise a.k.a. cluster=-1) points?
It looks like I need to redo entire notebook now in order to see if I get what I need in the first place...
Since I'll be looking into it - no promises - but I can check if I can catch a bug you mentioned - any reproducible example?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

eyaler picture eyaler  Â·  12Comments

mickohara23 picture mickohara23  Â·  10Comments

learningbymodeling picture learningbymodeling  Â·  8Comments

kevin-balkoski-enview picture kevin-balkoski-enview  Â·  5Comments

disimone picture disimone  Â·  3Comments