Hdbscan: Using different distance metrics for different features.

Created on 10 Mar 2017 · 8Comments · Source: scikit-learn-contrib/hdbscan

I have a dataset where each sample contains lat/lon as well as speed for a given point. It would be straight forward clustering the points just based on position using the haversine distance as metric. I would, however include speed as a feature. This would imply using a different different metric. For example I could include all points that are both located close, but also within a given speed difference from the other points in the cluster.

If there was possible to specify the metric for each feature this could be solved. Are there any other suggestions on how i may do it? I am able to extract the neighbors based on each feature using ball_tree with different metrics. After that I have written manual code to generate clusters, but it is really slow compared to hdbscan. I could of course calculate a distance metric and provide it as precomputed, but as I have a large dataset, this would take up a too much memory.

Any inputs would be appreciated.

Source

thomasht86

👍2

Most helpful comment

I can't share the exact code unfortunately but I can try to help:

When these algorithms use distance metrics, since the distance metrics are called so frequently they are optimized using Cython - a language that takes Python and adds static typing (among other things) such that your Cython code can be compiled into C code for speed, but still be called directly from Python code.

In scikit-learn, there is a dist_metrics.pyx module (note .pyx indicating cython versus just .py for regular python code) in scikit-learn/sklearn/neighbors/dist_metrics.pyx, which lmcinnes has effectively copied into his HDBSCAN repo here. This dist_metrics.pyx is where the Cython code lives that gets compiled into C for later. Step 1 of what I was describing was just to create your own Cython distance metric (using Cython for speed). Here is an example from the existing dist_metrics.pyx file:

#------------------------------------------------------------
# SEuclidean Distance
#  d = sqrt(sum((x_i - y_i2)^2 / v_i))
cdef class SEuclideanDistance(DistanceMetric):
    r"""Standardized Euclidean Distance metric
    .. math::
       D(x, y) = \sqrt{ \sum_i \frac{ (x_i - y_i) ^ 2}{V_i} }
    """
    def __init__(self, V):
        self.vec = np.asarray(V, dtype=DTYPE)
        self.vec_ptr = get_vec_ptr(self.vec)
        self.size = self.vec.shape[0]
        self.p = 2

    cdef inline DTYPE_t rdist(self, DTYPE_t* x1, DTYPE_t* x2,
                              ITYPE_t size) nogil except -1:
        if size != self.size:
            with gil:
                raise ValueError('SEuclidean dist: size of V does not match')
        cdef DTYPE_t tmp, d=0
        cdef np.intp_t j
        for j in range(size):
            tmp = x1[j] - x2[j]
            d += tmp * tmp / self.vec_ptr[j]
        return d

    cdef inline DTYPE_t dist(self, DTYPE_t* x1, DTYPE_t* x2,
                             ITYPE_t size) nogil except -1:
        return sqrt(self.rdist(x1, x2, size))

    cdef inline DTYPE_t _rdist_to_dist(self, DTYPE_t rdist) nogil except -1:
        return sqrt(rdist)

    cdef inline DTYPE_t _dist_to_rdist(self, DTYPE_t dist) nogil except -1:
        return dist * dist

    def rdist_to_dist(self, rdist):
        return np.sqrt(rdist)

    def dist_to_rdist(self, dist):
        return dist ** 2

So step 1 would be to go ahead and implement your cdef class MyMetric(DistanceMetric) and all of its attributes, similar to the already-existing SEuclideanDistance implemented above. Also add a string-to-class entry in the METRIC_MAPPING object around line 68 of dist_metrics.pyx for your new metric. Ultimately this code will be doing the actual distance calculation. Once you've implemented your Cython class, don't forget to compile it. Unlike Python, Cython has to be compiled. (Sorry if you already know all this Cython stuff, I'm just trying to be thorough).

Step 2 - this is what I was really reporting to @lmcinnes is that the API he's using in his various tree types - his API call is just a little bit off. If you want to replicate what I did and get it working with the dist_metrics.pyx file and have your own specific keyword parameters passed correctly to your Cython distance metric, you'll need to modify hdbscan/hdbscan_.py. lmcinnes wrote a bunch of different tree types in hdbscan_.py:
_hdbscan_prims_kdtree (line 171)
_hdbscan_prims_balltree (line 204)
_hdbscan_boruvka_kdtree (line 235)
et cetera.
Within each of these tree types, he is passing the keyword parameters through to certain scikit-learn types like BallTree and KDTree and DistanceMetric using the 'kwargs' object, when he actually needs to pop the 'metric_params' object out of the 'kwargs' object and pass metric_params to these objects instead. So if you add a code block to do that, it all works fine. Or wait for this to bump up high enough on lmcinnes' priority list and he will do it for you someday :)

I hope this is helpful!

QuiteAFoxtrot on 30 Apr 2018

👍2

All 8 comments

In effect you need a custom metric, and in practice anything is going to be
slow doing that. In principle balltrees support callable metrics so you can
write a python function as a metric and pass it to the balltree -- with a
little work it might be possible to push that all the way through the
algorithm (or it might just work now if you pass a function instead of a
string as the metric and specifically use algorithm='boruvka_balltree').
The catch is that the computationally expensive stuff lives in Cython and
having to do the roundtrip up from C to python for each distance call is
going to be very expensive.It might be enough to get you there.

Longer term what you need is a means of embedding your points in a space
that effectively combines the metrics. I am actually working on building
such a thing, but there is no useable code yet (just theoretical results).
Sorry I can't be more help.

On Fri, Mar 10, 2017 at 3:01 AM, thomasht86 notifications@github.com
wrote:

I have a dataset where each sample contains lat/lon as well as speed for a
given point. It would be straight forward clustering the points just based
on position using the haversine distance as metric. I would, however
include speed as a feature. This would imply using a different different
metric. For example I could include all points that are both located close,
but also within a given speed difference from the other points in the
cluster.

If there was possible to specify the metric for each feature this could be
solved. Are there any other suggestions on how i may do it? I am able to
extract the neighbors based on each feature using ball_tree with different
metrics. After that I have written manual code to generate clusters, but it
is really slow compared to hdbscan. I could of course calculate a distance
metric and provide it as precomputed, but as I have a large dataset, this
would take up a too much memory.

Any inputs would be appreciated.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/91, or mute the
thread
https://github.com/notifications/unsubscribe-auth/ALaKBYaCK6hJ9LrJdrwAm9H0c25hC1uBks5rkQNwgaJpZM4MZFTK
.

lmcinnes on 10 Mar 2017

👍1

Thanks a lot for your input!
I created a custom metric function, which works fine with the ball_tree by
itself, but when trying to pass it to hdbscan I receive following error
message: Cannot use Boruvka with BallTree for this metric!

If ball_tree can use this metric, why can`t hdbscan?

2017-03-10 14:19 GMT+01:00 Leland McInnes notifications@github.com:

In effect you need a custom metric, and in practice anything is going to be
slow doing that. In principle balltrees support callable metrics so you can
write a python function as a metric and pass it to the balltree -- with a
little work it might be possible to push that all the way through the
algorithm (or it might just work now if you pass a function instead of a
string as the metric and specifically use algorithm='boruvka_balltree').
The catch is that the computationally expensive stuff lives in Cython and
having to do the roundtrip up from C to python for each distance call is
going to be very expensive.It might be enough to get you there.

Longer term what you need is a means of embedding your points in a space
that effectively combines the metrics. I am actually working on building
such a thing, but there is no useable code yet (just theoretical results).
Sorry I can't be more help.

On Fri, Mar 10, 2017 at 3:01 AM, thomasht86 notifications@github.com
wrote:

I have a dataset where each sample contains lat/lon as well as speed for
a
given point. It would be straight forward clustering the points just
based
on position using the haversine distance as metric. I would, however
include speed as a feature. This would imply using a different different
metric. For example I could include all points that are both located
close,
but also within a given speed difference from the other points in the
cluster.

If there was possible to specify the metric for each feature this could
be
solved. Are there any other suggestions on how i may do it? I am able to
extract the neighbors based on each feature using ball_tree with
different
metrics. After that I have written manual code to generate clusters, but
it
is really slow compared to hdbscan. I could of course calculate a
distance
metric and provide it as precomputed, but as I have a large dataset, this
would take up a too much memory.

Any inputs would be appreciated.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/91, or mute the
thread
ALaKBYaCK6hJ9LrJdrwAm9H0c25hC1uBks5rkQNwgaJpZM4MZFTK>
.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/91#issuecomment-285666973,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AXbP8MIOO7inlKJIDNmDHXIoSq_fWm1uks5rkU3XgaJpZM4MZFTK
.

thomasht86 on 13 Mar 2017

I'll have to look into that and get back to you. It should be possible to
ensure that you cna pass custom metrics through, so it shouldn't be too
hard to fix.

On Mon, Mar 13, 2017 at 4:24 AM, thomasht86 notifications@github.com
wrote:

Thanks a lot for your input!
I created a custom metric function, which works fine with the ball_tree by
itself, but when trying to pass it to hdbscan I receive following error
message: Cannot use Boruvka with BallTree for this metric!

If ball_tree can use this metric, why can`t hdbscan?

2017-03-10 14:19 GMT+01:00 Leland McInnes notifications@github.com:

In effect you need a custom metric, and in practice anything is going to
be
slow doing that. In principle balltrees support callable metrics so you
can
write a python function as a metric and pass it to the balltree -- with a
little work it might be possible to push that all the way through the
algorithm (or it might just work now if you pass a function instead of a
string as the metric and specifically use algorithm='boruvka_balltree').
The catch is that the computationally expensive stuff lives in Cython and
having to do the roundtrip up from C to python for each distance call is
going to be very expensive.It might be enough to get you there.

Longer term what you need is a means of embedding your points in a space
that effectively combines the metrics. I am actually working on building
such a thing, but there is no useable code yet (just theoretical
results).
Sorry I can't be more help.

On Fri, Mar 10, 2017 at 3:01 AM, thomasht86 notifications@github.com
wrote:

I have a dataset where each sample contains lat/lon as well as speed
for
a
given point. It would be straight forward clustering the points just
based
on position using the haversine distance as metric. I would, however
include speed as a feature. This would imply using a different
different
metric. For example I could include all points that are both located
close,
but also within a given speed difference from the other points in the
cluster.

If there was possible to specify the metric for each feature this could
be
solved. Are there any other suggestions on how i may do it? I am able
to
extract the neighbors based on each feature using ball_tree with
different
metrics. After that I have written manual code to generate clusters,
but
it
is really slow compared to hdbscan. I could of course calculate a
distance
metric and provide it as precomputed, but as I have a large dataset,
this
would take up a too much memory.

Any inputs would be appreciated.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/91, or mute
the
thread
ALaKBYaCK6hJ9LrJdrwAm9H0c25hC1uBks5rkQNwgaJpZM4MZFTK>
.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
91#issuecomment-285666973>,
or mute the thread
AXbP8MIOO7inlKJIDNmDHXIoSq_fWm1uks5rkU3XgaJpZM4MZFTK>
.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/91#issuecomment-286042146,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBU0fC8JSdexhCyYDi5yPkD2_EMO7ks5rlP1OgaJpZM4MZFTK
.

lmcinnes on 13 Mar 2017

Hi guys!

First wanted to say - awesome job on this. I wasn't sure whether to open a new issue or not, but the custom metric stuff you guys are talking about on this issue I have actually done. I forked a scikit-learn distro, added a custom metric to the dist_metrics.pyx Cython code and tested it out on some stuff. However, if you try to follow that pattern, you will find the hdbscan_.py methods are calling the dist_metrics.pyx API slightly incorrectly. Currently in methods like

_hdbscan_prims_balltree

They call into Ball_Tree for example like so:
tree = BallTree(X, metric=metric, leaf_size=leaf_size, **kwargs)

In order to properly pass the metric keyword parameters however, it needs to pop the actual keywords out, something like this:

if 'metric_params' in kwargs:
        metric_params = kwargs.pop('metric_params')
else:
        metric_params = None

tree = BallTree(X, metric=metric, leaf_size=leaf_size, **metric_params)

I imagine you'll have to do this in each of the _hdbscan_METHOD_TREETYPE functions to support custom distances properly. Changing these few lines did allow me to use a custom distance metric however and I think it's a pretty simple fix. Hope this is helpful.

QuiteAFoxtrot on 6 Sep 2017

👍1

Thanks for the heads up; I believe there is a bunch of refactoring to do (including) this, that I'm busily putting of for now. Hopefully I'll get there soon.

lmcinnes on 11 Sep 2017

@QuiteAFoxtrot could you please share a bigger chunk of your code? I am a bit confused.
Thanks

bitasd on 30 Apr 2018

I can't share the exact code unfortunately but I can try to help:

#------------------------------------------------------------
# SEuclidean Distance
#  d = sqrt(sum((x_i - y_i2)^2 / v_i))
cdef class SEuclideanDistance(DistanceMetric):
    r"""Standardized Euclidean Distance metric
    .. math::
       D(x, y) = \sqrt{ \sum_i \frac{ (x_i - y_i) ^ 2}{V_i} }
    """
    def __init__(self, V):
        self.vec = np.asarray(V, dtype=DTYPE)
        self.vec_ptr = get_vec_ptr(self.vec)
        self.size = self.vec.shape[0]
        self.p = 2

    cdef inline DTYPE_t rdist(self, DTYPE_t* x1, DTYPE_t* x2,
                              ITYPE_t size) nogil except -1:
        if size != self.size:
            with gil:
                raise ValueError('SEuclidean dist: size of V does not match')
        cdef DTYPE_t tmp, d=0
        cdef np.intp_t j
        for j in range(size):
            tmp = x1[j] - x2[j]
            d += tmp * tmp / self.vec_ptr[j]
        return d

    cdef inline DTYPE_t dist(self, DTYPE_t* x1, DTYPE_t* x2,
                             ITYPE_t size) nogil except -1:
        return sqrt(self.rdist(x1, x2, size))

    cdef inline DTYPE_t _rdist_to_dist(self, DTYPE_t rdist) nogil except -1:
        return sqrt(rdist)

    cdef inline DTYPE_t _dist_to_rdist(self, DTYPE_t dist) nogil except -1:
        return dist * dist

    def rdist_to_dist(self, rdist):
        return np.sqrt(rdist)

    def dist_to_rdist(self, dist):
        return dist ** 2

I hope this is helpful!

QuiteAFoxtrot on 30 Apr 2018

👍2

@QuiteAFoxtrot Thanks for the answer. Would you please tell me which function of EuclideanDistance class and how should I change it to achieve the following (sorry I almost no nothing about cython and whatever I tried I was getting some compilation error asking me to use gil !!!):

def calculate distance():
        d = euclideanDistance(tens1, tens2)
        #normalize in between [0, 1]
        mn = np.min(d)
        sqz = np.max(d) - mn
        d = (d - mn) / sqz
        d[d[:]<=0.001] = 0.001
        d[d[:]>=1.0] = 1.0
        return d

Thanks in advance