Hdbscan: Can DBCV score be negative for HDBSCAN?

Created on 18 Apr 2018  Â·  11Comments  Â·  Source: scikit-learn-contrib/hdbscan

Hi,

I am obtaining negative values for DBCV scores for multiple clusterings I obtain by varying HDBSCAN hyper-parameters. It is my understanding this is impossible for HDBSCAN. Referring to the paper by Moulavi et. al. SDM 2014, DBCV is a weighted sum of Validity Index values of clusters. For a cluster _Ci_, this index is defined as:

image

So the only way this value can be negative is if the numerator is negative. Now, _DSPC(Ci, Cj)_ is the minimum distance between any pair of points where one belongs to _Ci_ and the other to _Cj_; and _DSC(Ci)_ is the longest edge in the mutual-reachability MST of _Ci_. In the case of HDBSCAN, notice that _DSC(Ci)_ must always be smaller because otherwise cluster _Ci_ would split before it becomes separated from _Cj_. Another way to justify this is that HDBSCAN deletes edges in decreasing order of weights. And therefore, if a heavier edge in _Ci_ (the one corresponding to _DSC(Ci)_) was retained then the lighter edge (the so-called minimal separator corresponding to _DSPC(Ci, Cj)_) must also have been a part of the connected component which defines _Ci_. Therefore, either _Ci_ and _Cj_ are one and the same cluster, or _DSPC(Ci, Cj)_ is greater than _DSC(Ci)_. Also, noise does not play a role in this computation since noise points are not parts of any MST. These points are only considered later when computing the weights in the weighted sum.

If this understanding is okay, then DBCV score for HDBSCAN must always be positive. It cannot even be 0. Please correct me if I am wrong.

Otherwise, there seems to be a bug in the implementation of hdbscan.validity.validity_index method? Or, although much more unlikely, in the implementation of core clustering algorithm?

Most helpful comment

It seems that a resolution lies in the difference in the definitions of MST as used in HDBSCAN and in DBCV. While HDBSCAN builds the MST using mutual reachability distances, DBCV builds it using _all-points_ reachability distance. Superficially, these appear to be different values for any given pair of points. This alone could cause the DBCV score for HDBSCAN clustering to become negative. Additionally, DBCV scores are lower when approx_min_span_tree flag is True. Upon making it False, I see that the scores have inched above 0 but that's no guarantee. But the general trend seems to be as expected.

While investigating this, I implemented a variant of DBCV score which uses the original MST created by HDBSCAN [I'm pasting the code below]. This runs faster since the MST is pre-computed, and also shows the same general trend regarding approximate MSTs, i.e. scores with exact MSTs are higher. And this score is always positive for DBSCAN. As @lmcinnes pointed out in another thread, stability scores don't seem to be positively correlated with DBCV scores. I'm yet to figure out if they are correlated at all.

Thanks for your responses and for this wonderful tool.

def DBCV_variant(self, multiplier=2):
    '''
    This is imagined to be an instance method in order to reuse the MR-MST
    computed at the time of clustering in stead of recomputing the All-Points 
    MR-MST as described in Moulavi, et. al. 2014.

    multiplier is an integer which helps getting rid of infinities in density 
    separation values of clusters. 
    '''
    if not self.gen_min_span_tree:
        raise AttributeError("Minimum spanning tree not present. " + 
                             "Either 'gen_min_span_tree' was not set to " + 
                             "True or the tree was not generated in spite " + 
                             "of it owing to internal optimization criteria.")
        return

    labels = self.labels_
    noise_size, *cluster_size = np.bincount(labels + 1)
    total = noise_size + np.sum(cluster_size)
    num_clusters = len(cluster_size)
    DSC = np.zeros(num_clusters)

    # Unltimately, for each Ci, we only require the
    # minimum of DSPC(Ci, Cj) over all Cj != Ci. 
    # So let's call this value DSPC_wrt(Ci), i.e.
    # density separation 'with respect to' Ci.
    DSPC_wrt = np.ones(num_clusters) * np.inf
    max_distance = 0

    mst_df = self.minimum_spanning_tree_.to_pandas()

    for edge in mst_df.iterrows():
        label1 = labels[int(edge[1]['from'])]
        label2 = labels[int(edge[1]['to'])]
        length = edge[1]['distance']

        max_distance = max(max_distance, length)

        if label1 == -1 or label2 == -1:
            continue

        if label1 == label2:
            # Set the density sparseness of the cluster
            # to the sparsest value seen so far.
            DSC[label1] = max(length, DSC[label1])
        else:
            # Check whether density separations with
            # respect to each of these clusters can 
            # be reduced.
            DSPC_wrt[label1] = min(length, DSPC_wrt[label1])
            DSPC_wrt[label2] = min(length, DSPC_wrt[label2])

    # DSPC_wrt[Ci] might be infinite if the connected component for Ci is
    # an "island" in the MR-MST. Whereas for other clusters Cj and Ck, the
    # MR-MST might contain an edge with one point in Cj and ther other one
    # in Ck. Here, we fix the infinite desnsity separation of Ci.
    #
    # TODO: Think of a better yet efficient way to handle this.
    DSPC_wrt[np.where(DSPC_wrt == np.inf)] = multiplier * max_distance

    V_index = [(DSPC_wrt[i] - DSC[i]) / max(DSPC_wrt[i], DSC[i]) for i in range(num_clusters)]
    score = np.sum([(cluster_size[i] * V_index[i]) / total for i in range(num_clusters)])
    return score

All 11 comments

The validity index code has never been completely thoroughly tested. It is possible there is a bug, or an assumption being violated somewhere. If you could pin down a minimal test case to reproduce it would be easier to track down.

I am not at liberty to share the dataset I'm working with. I will figure out a way of creating something artificial which can reproduce the error. It will take me a while to do this and I'll get back as soon as I can.

That's okay, I understand. I'm sorry I can't provide more help -- but I agree with you, this shouldn't theoretically happen, and right now I can't see anything in the code that would cause it to happen, so I am at a bit of a loss.

It seems that a resolution lies in the difference in the definitions of MST as used in HDBSCAN and in DBCV. While HDBSCAN builds the MST using mutual reachability distances, DBCV builds it using _all-points_ reachability distance. Superficially, these appear to be different values for any given pair of points. This alone could cause the DBCV score for HDBSCAN clustering to become negative. Additionally, DBCV scores are lower when approx_min_span_tree flag is True. Upon making it False, I see that the scores have inched above 0 but that's no guarantee. But the general trend seems to be as expected.

While investigating this, I implemented a variant of DBCV score which uses the original MST created by HDBSCAN [I'm pasting the code below]. This runs faster since the MST is pre-computed, and also shows the same general trend regarding approximate MSTs, i.e. scores with exact MSTs are higher. And this score is always positive for DBSCAN. As @lmcinnes pointed out in another thread, stability scores don't seem to be positively correlated with DBCV scores. I'm yet to figure out if they are correlated at all.

Thanks for your responses and for this wonderful tool.

def DBCV_variant(self, multiplier=2):
    '''
    This is imagined to be an instance method in order to reuse the MR-MST
    computed at the time of clustering in stead of recomputing the All-Points 
    MR-MST as described in Moulavi, et. al. 2014.

    multiplier is an integer which helps getting rid of infinities in density 
    separation values of clusters. 
    '''
    if not self.gen_min_span_tree:
        raise AttributeError("Minimum spanning tree not present. " + 
                             "Either 'gen_min_span_tree' was not set to " + 
                             "True or the tree was not generated in spite " + 
                             "of it owing to internal optimization criteria.")
        return

    labels = self.labels_
    noise_size, *cluster_size = np.bincount(labels + 1)
    total = noise_size + np.sum(cluster_size)
    num_clusters = len(cluster_size)
    DSC = np.zeros(num_clusters)

    # Unltimately, for each Ci, we only require the
    # minimum of DSPC(Ci, Cj) over all Cj != Ci. 
    # So let's call this value DSPC_wrt(Ci), i.e.
    # density separation 'with respect to' Ci.
    DSPC_wrt = np.ones(num_clusters) * np.inf
    max_distance = 0

    mst_df = self.minimum_spanning_tree_.to_pandas()

    for edge in mst_df.iterrows():
        label1 = labels[int(edge[1]['from'])]
        label2 = labels[int(edge[1]['to'])]
        length = edge[1]['distance']

        max_distance = max(max_distance, length)

        if label1 == -1 or label2 == -1:
            continue

        if label1 == label2:
            # Set the density sparseness of the cluster
            # to the sparsest value seen so far.
            DSC[label1] = max(length, DSC[label1])
        else:
            # Check whether density separations with
            # respect to each of these clusters can 
            # be reduced.
            DSPC_wrt[label1] = min(length, DSPC_wrt[label1])
            DSPC_wrt[label2] = min(length, DSPC_wrt[label2])

    # DSPC_wrt[Ci] might be infinite if the connected component for Ci is
    # an "island" in the MR-MST. Whereas for other clusters Cj and Ck, the
    # MR-MST might contain an edge with one point in Cj and ther other one
    # in Ck. Here, we fix the infinite desnsity separation of Ci.
    #
    # TODO: Think of a better yet efficient way to handle this.
    DSPC_wrt[np.where(DSPC_wrt == np.inf)] = multiplier * max_distance

    V_index = [(DSPC_wrt[i] - DSC[i]) / max(DSPC_wrt[i], DSC[i]) for i in range(num_clusters)]
    score = np.sum([(cluster_size[i] * V_index[i]) / total for i in range(num_clusters)])
    return score

I just hit upon a corner case in the computation of DBCV and I don't recall at the top of my head how the paper handles it: What happens if number of clusters = 1?

I see nothing in the formula that handles this and in that case DBCV score would be undefined since DSPC values are initialized with inf. To handle this, I added a new variable in my DBCV_variant method.

min_outlier_sep = np.inf # only required if num_clusters = 1

While processing MR-MST edges, it is updated thus.

        if label1 == -1 and label2 == -1:
            continue
        elif label1 == -1 or label2 == -1:
            # If exactly one of the points is noise
            min_outlier_sep = min(min_outlier_sep, length)
            continue

And at the end of processing all MR-MST edges, it is used thus.

    # In case min_outlier_sep is still np.inf, we assign a new value to it.
    # This only makes sense if num_clusters = 1 since it has turned out 
    # that the MR-MST has no edges between a noise point and a core point.
    min_outlier_sep = max_distance if min_outlier_sep == np.inf else min_outlier_sep

    # DSPC_wrt[Ci] might be infinite if the connected component for Ci is
    # an "island" in the MR-MST. Whereas for other clusters Cj and Ck, the
    # MR-MST might contain an edge with one point in Cj and ther other one
    # in Ck. Here, we fix the infinite density separation of Ci.
    #
    # TODO: Think of a better yet efficient way to handle this.
    correction = multiplier * (max_distance if num_clusters > 1 else min_outlier_sep)
    DSPC_wrt[np.where(DSPC_wrt == np.inf)] = correction

I hope this is helpful/meaningful. I'll be very happy if someone could point out problems in this or in the DBCV_variant method in general. That would improve my understanding as well.

I don't recall exactly now, but I did deal with at least one case of undefined behaviour of DBCV as per the paper; it may have been this, or something else.

DBCV value is always between -1 and 1 as written in the original paper.
"It is easy to verify that our index produces values between −1 and +1, with greater values of the measure indicating better density-based clustering solutions"

https://pdfs.semanticscholar.org/a40f/b167d9d658242017ef6bc59a2f3933c2fded.pdf

Is this DBCV_variant something you'd like to consider rolling into the hdbscan package officially? Would you like me to raise a pull request?

I would be happy with a pull request at this stage.

I would just like to add to this that when using precomputed distances in the "normal" version currently in use, I get numpy overflows on my dataset when d > 35 when calculating all_points_core_distance() due to d being an exponent on this line:
distance_matrix[distance_matrix != 0] = (1.0 / distance_matrix[distance_matrix != 0]) ** d. Using the generated MST from the clusterer like this solution does would probably avoid those problems :)

Pull request created here: https://github.com/scikit-learn-contrib/hdbscan/pull/243

Travis CI build fails, and I can't be sure if that's from my commits. Would appreciate some help.

Was this page helpful?
0 / 5 - 0 ratings