Hdbscan: buffer dtype mismatch

Created on 4 Nov 2016 · 8Comments · Source: scikit-learn-contrib/hdbscan

I am trying to run hdbscan but I get the error:

ValueError: Buffer dtype mismatch, expected 'double_t' but got 'long long'

I have attached my code below, it is standard from the example I have seen. Not sure how to proceed.
Thank you

import hdbscan
hdb = hdbscan.HDBSCAN(min_cluster_size=10)
clusters_hdb = hdb.fit_predict(feature_vects)

Source

learningbymodeling

Most helpful comment

The dtype mismatch still seems to be an issue with metric='precomputed'. I get the error on fitting an np.float32 distance matrix, but casting to np.float64 fixes the problem.

VarIr on 6 Nov 2017

👍6

All 8 comments

I think I need a little more information. Can you post the full stack trace that occurs at the error? Or possibly share the dataset that is failing?

lmcinnes on 4 Nov 2016

Unfortunately, I can't share the dataset but I found a dataset online which produces the same result.
The dataset is available at CrowdFlower and is labeled as "Identifying key phrases in text", so you can download it from there.
Here is the standalone code to reproduce the error.
Does this help? Thanks.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

csv_file_path = "Key-phrases-DFE-794640.csv"

raw_corpus = pd.read_csv(csv_file_path)

# Only interested in the column raw_corpus["answer"]

# Number of features in data set
feature_size = 2000

raw_corpus = raw_corpus.fillna("")

vectorizer = CountVectorizer(
                        strip_accents = "ascii",
                        analyzer = "word",
                        tokenizer = None,
                        preprocessor = None,
                        ngram_range = (1, 1),
                        stop_words = "english",
                        max_df = 1.00,
                        min_df = 0.01,
                        max_features = feature_size)

vectorizer = vectorizer.fit(raw_corpus["answer"])
feature_vects = vectorizer.transform(raw_corpus["answer"]).toarray()
vocab = vectorizer.get_feature_names()

import hdbscan
hdb = hdbscan.HDBSCAN(min_cluster_size=10)
try:
    clusters_hdb = hdb.fit_predict(feature_vects)
except IndexError:
    exc_type, exc_value, exc_traceback = sys.exc_info()

Traceback (most recent call last):

File "", line 4, in
clusters_hdb = hdb.fit_predict(feature_vects)

File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 750, in fit_predict
self.fit(X)

File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 732, in fit
self._min_spanning_tree) = hdbscan(X, **kwargs)

File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 507, in hdbscan
gen_min_span_tree, **kwargs)

File "C:\Program Files\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py", line 283, in call
return self.func(_args, *_kwargs)

File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 196, in _hdbscan_prims_kdtree
min_spanning_tree = mst_linkage_core_vector(X, core_distances, dist_metric, alpha)

File "hdbscan/_hdbscan_linkage.pyx", line 51, in hdbscan._hdbscan_linkage.mst_linkage_core_vector (hdbscan_hdbscan_linkage.c:3840)

ValueError: Buffer dtype mismatch, expected 'double_t' but got 'long long'

learningbymodeling on 4 Nov 2016

Is it possible that your feature vectors are all integer valued? Can you
try casting it to float64? e.g.

clusters_hdb = hdb.fit_predict(feature_vects.astype(np.float64))

Let me know if that works. That sort of casting _should_ be happening
internally, and you shouldn't have know or worry about the type of your
input data, but perhaps a check/conversion is missing.

On Fri, Nov 4, 2016 at 11:35 AM, learningbymodeling <
[email protected]> wrote:

Does this help? Thanks.

Traceback (most recent call last):

File "", line 2, in
clusters_hdb = hdb.fit_predict(feature_vects)

File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py",
line 750, in fit_predict
self.fit(X)

File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py",
line 732, in fit
self._min_spanning_tree) = hdbscan(X, **kwargs)

File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py",
line 507, in hdbscan
gen_min_span_tree, **kwargs)

File "C:\Program Files\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py",
line 283, in _call_
return self.func(_args, *_kwargs)

File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py",
line 196, in _hdbscan_prims_kdtree
min_spanning_tree = mst_linkage_core_vector(X, core_distances,
dist_metric, alpha)

File "hdbscan/_hdbscan_linkage.pyx", line 51, in
hdbscan._hdbscan_linkage.mst_linkage_core_vector
(hdbscan_hdbscan_linkage.c:3840)

ValueError: Buffer dtype mismatch, expected 'double_t' but got 'long long'

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/71#issuecomment-258464722,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBWmxCZe3G-PhyrhjtaOZXE96C-tdks5q61DUgaJpZM4KpJMj
.

lmcinnes on 4 Nov 2016

@lmcinnes

Let me know if that works. That sort of casting should be happening
internally, and you shouldn't have know or worry about the type of your
input data, but perhaps a check/conversion is missing.

I give faith that this issue is still there at least in the last pip package - I had to cast my uchar vectors to np.float64

edgarriba on 22 Dec 2016

Sorry about that -- it fell off my radar. Should be fixed now, and I'll try to get a new pip package out soon to remedy the problem globally. Thanks for the heads up that I had missed this one.

lmcinnes on 22 Dec 2016

The dtype mismatch still seems to be an issue with metric='precomputed'. I get the error on fitting an np.float32 distance matrix, but casting to np.float64 fixes the problem.

VarIr on 6 Nov 2017

👍6

I'll see if I can hunt that down. Sorry about the issue, but I'm glad that at least casting can provide a workaround for now.

lmcinnes on 6 Nov 2017

👍1

FYI same problem 8/6/2019 with integer input and metric='precomputed'. Casting to float64 fixed.