Hdbscan: Impact of adding extreme values?

Created on 8 Mar 2018  路  6Comments  路  Source: scikit-learn-contrib/hdbscan

I've struggled with a difficult decision that's not unique to this algorithm: How to treat my missing data. That said, I believe my implementation may be uniquely situated to perform well under hdbscan, and am looking for feedback.

My missing data is not actually 'missing'; it does not exist because it is not applicable to that entry. Therefore, imputation does not make sense, since that would assign data where there should not be any.

I also do not want to convert fields with missing entries into categoricals; I don't want to lose the meaning of the data values in that field.

Therefore, I'm generating an extreme, far-outlying value for each field, and I fill that field's missing entries with that extreme, far-outlying value. This seemingly should ensure that, from a KNN perspective, 'missing' is not clustered with non-missing values for that specific field. These datapoints with 'missing' also seemingly will not have a greater possibility of being classified as outliers, unless being 'missing' is a very rare situation for other points with a similar profile.

Does this treatment seem sound? What are the drawbacks?

All 6 comments

That treatment seems sound given the application you have in mind. I agree, dealing with missing data is always tricky, and often depends on what you want to do with the data. Based on the description of your desiderata here I think this makes sense. As long as you want to cleanly separate missing v non-missing as a first level clustering and then cluster within those sets. Of course you could theoretically partition your data to do that as well if you wished.

As long as you want to cleanly separate missing v non-missing as a first level clustering and then cluster within those sets.

Optimally, I'd like to allow for the possibility of missing data being clustered with non-missing. If two entries, for example, were more similar on 99 fields than any other pair of entries, but on the 100th one was missing while the other is not, I'd hope they'd still cluster together.

You would need to make sure that your fill value was not too extremal to ensure that's the case. That makes it a little tricker -- I would suggest doing a little EDA on the distribution of distances to the kth nearest neighbor and try to choose a suitable fill value accordingly.

My current challenge is actually for an arbitrary, rather than specific, dataset, which makes EDA a bit trickier, though it is the approach I've been pursuing.

The right amount of extremity for a fill value likely depends on use case, but this is what I am leaning toward as appropriate for my intention:

range_param = range_len(np.nanmin(data[c]), np.nanmax(data[c]))
sd_param = 3 * data[c].std()
selected_param = np.nanmax([range_param, sd_param])
fill_value = np.nanmin(data[c]) - selected_param

Basically, I'm filling with a value that is either 3 standard deviations, or the range, away from the minimum in the field, whichever is greater.

That sounds like a reasonable approach under the circumstances. It will, at least provide a starting point for some cluster analysis to see if it is doing something sensible.

Great, I appreciate the guidance.

(Closing the issue, but will add notes if experience provides anything of interest.)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mickohara23 picture mickohara23  路  10Comments

chaturv3di picture chaturv3di  路  11Comments

uellue picture uellue  路  7Comments

esvhd picture esvhd  路  7Comments

rw picture rw  路  12Comments