Hdbscan: prediction.membership_vector always fails with type error

Created on 5 Jul 2017 · 12Comments · Source: scikit-learn-contrib/hdbscan

probs = hdbscan.prediction.membership_vector(clusterer, X_train)
ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'long'

I tried this on latest master with the following metrics: manhattan, hamming, euclidean.

Source

Most helpful comment

Thank you! I'll see if I can work up a short list of things and put it here. Any and all help is greatly appreciated!

lmcinnes on 15 Nov 2017

👍2

All 12 comments

Ah, it looks like this happens if my training X data is of type np.int64. This was hard to debug. I recommend adding type assertions somewhere.

rw on 5 Jul 2017

Actually the membership_vector function tends to return lots of nans and values like 2.77568154e-157, which I don't know how to interpret. I see that the tests for it are commented out, so I'll let this go :-) Maybe the best solution is to deprecate this function?

rw on 5 Jul 2017

+1 to this, thanks for figuring out where the float64_t thing came from

janfreyberg on 9 Nov 2017

Things work, but it was never very well tested for diverse datasets and there are certainly some issues. I'll have to see if I can find some time to sort through this and get it fixed.

lmcinnes on 10 Nov 2017

I mean, I think it's also partly numpy's fault. I had an error message saying Buffer dtype mismatch, expected 'float64_t' but got 'float64'. A bit too cryptic to figure out.

janfreyberg on 10 Nov 2017

That's definitely just a typo somewhere on my part I believe. I should be able fix that, but as for the rest ... that may take a little more time.

lmcinnes on 13 Nov 2017

Happy to help if you can define some concrete TODOs.

rw on 15 Nov 2017

👍1

Thank you! I'll see if I can work up a short list of things and put it here. Any and all help is greatly appreciated!

lmcinnes on 15 Nov 2017

👍2

Is this a good place to ask about why this isn't in sklearn itself?

rw on 15 Nov 2017

I have talked to the sklearn maintainers and the short answer is that the algorithm (HDBSCAN*) is considered too new (i.e. has insufficient citations in the literature) for inclusion at this time. As the algorithm gains in popularity and citations inclusion in sklearn proper will hopefully happen.

lmcinnes on 15 Nov 2017

Likewise keen to help out with this, as I'd like to see this adopted much more widely.

janfreyberg on 15 Nov 2017

That makes sense; an algorithm being in sklearn implies that the technique has credibility.

rw on 17 Nov 2017

Was this page helpful?

0 / 5 - 0 ratings