Hdbscan: prediction.membership_vector always fails with type error

Created on 5 Jul 2017  路  12Comments  路  Source: scikit-learn-contrib/hdbscan

probs = hdbscan.prediction.membership_vector(clusterer, X_train)
ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'long'

I tried this on latest master with the following metrics: manhattan, hamming, euclidean.

Most helpful comment

Thank you! I'll see if I can work up a short list of things and put it here. Any and all help is greatly appreciated!

All 12 comments

Ah, it looks like this happens if my training X data is of type np.int64. This was hard to debug. I recommend adding type assertions somewhere.

Actually the membership_vector function tends to return lots of nans and values like 2.77568154e-157, which I don't know how to interpret. I see that the tests for it are commented out, so I'll let this go :-) Maybe the best solution is to deprecate this function?

+1 to this, thanks for figuring out where the float64_t thing came from

Things work, but it was never very well tested for diverse datasets and there are certainly some issues. I'll have to see if I can find some time to sort through this and get it fixed.

I mean, I think it's also partly numpy's fault. I had an error message saying Buffer dtype mismatch, expected 'float64_t' but got 'float64'. A bit too cryptic to figure out.

That's definitely just a typo somewhere on my part I believe. I should be able fix that, but as for the rest ... that may take a little more time.

Happy to help if you can define some concrete TODOs.

Thank you! I'll see if I can work up a short list of things and put it here. Any and all help is greatly appreciated!

Is this a good place to ask about why this isn't in sklearn itself?

I have talked to the sklearn maintainers and the short answer is that the algorithm (HDBSCAN*) is considered too new (i.e. has insufficient citations in the literature) for inclusion at this time. As the algorithm gains in popularity and citations inclusion in sklearn proper will hopefully happen.

Likewise keen to help out with this, as I'd like to see this adopted much more widely.

That makes sense; an algorithm being in sklearn implies that the technique has credibility.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

thomasht86 picture thomasht86  路  8Comments

arunmarathe picture arunmarathe  路  4Comments

mickohara23 picture mickohara23  路  10Comments

eyaler picture eyaler  路  12Comments

danielzgtg picture danielzgtg  路  13Comments