Hdbscan: Do I need to standarize my data if I am using hdbscan or hiererchichal clustering?

Created on 7 Jul 2018  Â·  10Comments  Â·  Source: scikit-learn-contrib/hdbscan

As far as I know, when using a linear model you need to put in the same scale your data. However, HDBSCAN, is a density based clustering method, this do I need to normalize or scale my dataset?

I tried both options and when standard scaling my data I have two clusters, on the other hand when I dont scale my data I end with 89. My fear is that if I scale, I can bias my results.

Most helpful comment

The answer to this is unfortunately "it depends". It is really domain and
data specific. Data can be on different scales and still be meaningful.
Perhaps you are clustering populations in the tropics -- if so then the
latitude values will be in a much narrower range than the longitude values.
Standardising, however, would be the wrong thing to do -- while the data
ranges are very different they are fundamentally on the same scale, and
distance measurements are meaningful without standardising the data. On the
other hand if you want to cluster heights and weights of a population of
people then standardising the data would be the more sensible option or
else your results will be swamped by one variable or the other (whichever
produces a larger range of values). I think you need to ask yourself where
your features are drawn from, and whether there is a natural sense of
distance on the un-normalised data, or if the features really are on
completely independent scales of measure.

Sorry this isn't a nice cut and dried answer, but the reality is that it is
data dependent, and will require some domain expertise to make the right
decision for a given dataset.

On Sat, Jul 7, 2018 at 3:17 PM Alonso notifications@github.com wrote:

As far as I know, when using a linear model you need to put in the same
scale your data. However, HDBSCAN, is a density based clustering method,
this do I need to normalize or scale my dataset?

I tried both options and when standard scaling my data I have two
clusters, on the other hand when I dont scale my data I end with 89. My
fear is that if I scale, I can bias my results.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/225, or mute the
thread
https://github.com/notifications/unsubscribe-auth/ALaKBTY0K72PmzgScsKInasF5Dw3lllhks5uEQlZgaJpZM4VGdqU
.

All 10 comments

The answer to this is unfortunately "it depends". It is really domain and
data specific. Data can be on different scales and still be meaningful.
Perhaps you are clustering populations in the tropics -- if so then the
latitude values will be in a much narrower range than the longitude values.
Standardising, however, would be the wrong thing to do -- while the data
ranges are very different they are fundamentally on the same scale, and
distance measurements are meaningful without standardising the data. On the
other hand if you want to cluster heights and weights of a population of
people then standardising the data would be the more sensible option or
else your results will be swamped by one variable or the other (whichever
produces a larger range of values). I think you need to ask yourself where
your features are drawn from, and whether there is a natural sense of
distance on the un-normalised data, or if the features really are on
completely independent scales of measure.

Sorry this isn't a nice cut and dried answer, but the reality is that it is
data dependent, and will require some domain expertise to make the right
decision for a given dataset.

On Sat, Jul 7, 2018 at 3:17 PM Alonso notifications@github.com wrote:

As far as I know, when using a linear model you need to put in the same
scale your data. However, HDBSCAN, is a density based clustering method,
this do I need to normalize or scale my dataset?

I tried both options and when standard scaling my data I have two
clusters, on the other hand when I dont scale my data I end with 89. My
fear is that if I scale, I can bias my results.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/225, or mute the
thread
https://github.com/notifications/unsubscribe-auth/ALaKBTY0K72PmzgScsKInasF5Dw3lllhks5uEQlZgaJpZM4VGdqU
.

Thanks @lmcinnes for your response... According to this post, that was my second fear. My dataset comes from mobile tap data, I am trying to cluster user app usage. In other words, based on clicks I would like to know what type of behavior is doing the user with a mobile application.

If everything is counts then they are largely on the same scale, and you
probably don't need to standardise. If they are more subtle app based
measures then you would have to consult with the people who collect the
data what is going to make the most sense.

On Sat, Jul 7, 2018 at 4:23 PM Alonso notifications@github.com wrote:

Thanks @lmcinnes https://github.com/lmcinnes for your response...
According to this post, that was my second fear. My dataset comes from
mobile tap data, I am trying to cluster user app usage. In other words,
based on clicks I would like to know what type of behavior is doing the
user with a mobile application.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/225#issuecomment-403240847,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBcrh99FnRazJda_1y9PYgNr6UgDAks5uERjOgaJpZM4VGdqU
.

Thanks again @lmcinnes ! I have a mixed data set, some data columns are counts, the other data chunk is also "counts" obtained from a tfidf representation. Do you think I need to scale?

You are going to need to do something; whether standard scaling is the
right answer here is a trickier question. Certainly the clustering will be
dominated by count data if you don't scale. You may want to partition your
features into count data and TFIDF data and cluster each separately and
compare that with your combined clustering options. Ultimately you may have
to do a bit of EDA and cluster validation -- this is all a bit more of an
art than a science.

On Sat, Jul 7, 2018 at 4:29 PM Alonso notifications@github.com wrote:

Thanks again @lmcinnes https://github.com/lmcinnes ! I have a mixed
data set, some of instances are counts, the other data chunk is also counts
obtained from a tfidf representation. Do you think I need to scale?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/225#issuecomment-403241097,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBaRCIccT3L-wF-BniSFKVGjV2knpks5uERovgaJpZM4VGdqU
.

That's exactly what my set up look like:

  1. count data
  2. tfidf matrix
  3. horizontal concatenation (count data, tfidf)
  4. standard_scale(count data, tfidf matrix)
  5. clustering with hdbscan

Another thing that I tried to do was: instead of using a tfidf matrix I just used a BOW representation (count vectorizer) which as far as I understand is a counts matrix. After doing this I got the same number of clusters that I got with tfidf. Do you think that by doing this (using a bow representation), both data chunks can be more compatible?, and then clustered?

It sounds not unreasonable. I would still spend some time doing EDA on the
clusters to see if they really make sense.

On Sat, Jul 7, 2018 at 4:50 PM Alonso notifications@github.com wrote:

That's exactly what my set up look like:

  1. count data
  2. tfidf matrix
  3. horizontal concatenation (count data, tfidf)
  4. standard_scale(count data, tfidf matrix)
  5. clustering with hdbscan

Another thing that I tried to do was: instead of using a tfidf matrix I
just used a BOW representation (count vectorizer) which as far as I
understand is a counts matrix. After doing this I got the same number of
clusters that I got with tfidf. Do you think that by doing this (using a
bow representation), both data chunks can be more compatible?, and then
clustered?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/225#issuecomment-403242132,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBeNZvRsi0avjUk2V573Q9LbAJ6o0ks5uER8SgaJpZM4VGdqU
.

Thanks, I will try the EDA.

Hi
I am trying to detect similarities in different source codes from given data sets and m using HDBSCAN technique but how can I do that could you please give me any hint?Is there any need to do vectorization?

This HDBSCAN implementation requires vector input data, so if you have textual data you'll need some form of vectorization.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rw picture rw  Â·  12Comments

s0j0urn picture s0j0urn  Â·  10Comments

uellue picture uellue  Â·  7Comments

learningbymodeling picture learningbymodeling  Â·  8Comments

Berowne picture Berowne  Â·  3Comments