Hi,
first of all, thanks for this nice algorithm.
I managed to use it without problems with min_cluster_size up to 1000, but when I try larger sizes (2000 or 3000) I end up having a strange error, that I paste here. I must admit I have absolutely no clue of what I may be doing wrong. Apparently at some point an integer is created that is out of bounds.
I need large cluster sizes since I perform some measurements on the clusters that require fairly high statistics to be meaningful.
Is this an intrinsic limitation of the algorithm, or a bug?
Thanks for your help,
Andrea.
MaybeEncodingError Traceback (most recent call last)
<ipython-input-113-ee5141ffadc3> in <module>()
5
6 scanner = hdbscan.HDBSCAN(min_cluster_size=3000,core_dist_n_jobs=8)
----> 7 labels_=scanner.fit_predict(myMatrix[[c for c in myMatrix.columns if c not in ['T','E']]].as_matrix())
8
9 labels=pd.Series(labels_,index=myMatrix.index)
~/Envs/DSpy3/lib/python3.5/site-packages/hdbscan/hdbscan_.py in fit_predict(self, X, y)
884 cluster labels
885 """
--> 886 self.fit(X)
887 return self.labels_
888
~/Envs/DSpy3/lib/python3.5/site-packages/hdbscan/hdbscan_.py in fit(self, X, y)
862 self._condensed_tree,
863 self._single_linkage_tree,
--> 864 self._min_spanning_tree) = hdbscan(X, **kwargs)
865
866 if self.prediction_data:
~/Envs/DSpy3/lib/python3.5/site-packages/hdbscan/hdbscan_.py in hdbscan(X, min_cluster_size, min_samples, alpha, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
589 approx_min_span_tree,
590 gen_min_span_tree,
--> 591 core_dist_n_jobs, **kwargs)
592 else: # Metric is a valid BallTree metric
593 # TO DO: Need heuristic to decide when to go to boruvka;
~/Envs/DSpy3/lib/python3.5/site-packages/sklearn/externals/joblib/memory.py in __call__(self, *args, **kwargs)
281
282 def __call__(self, *args, **kwargs):
--> 283 return self.func(*args, **kwargs)
284
285 def call_and_shelve(self, *args, **kwargs):
~/Envs/DSpy3/lib/python3.5/site-packages/hdbscan/hdbscan_.py in _hdbscan_boruvka_kdtree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
285 leaf_size=leaf_size // 3,
286 approx_min_span_tree=approx_min_span_tree,
--> 287 n_jobs=core_dist_n_jobs, **kwargs)
288 min_spanning_tree = alg.spanning_tree()
289 # Sort edges of the min_spanning_tree by weight
hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__ (hdbscan/_hdbscan_boruvka.c:5195)()
hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds (hdbscan/_hdbscan_boruvka.c:5915)()
~/Envs/DSpy3/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
766 # consumption.
767 self._iterating = False
--> 768 self.retrieve()
769 # Make sure that we get a last message telling us we are done
770 elapsed_time = time.time() - self._start_time
~/Envs/DSpy3/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in retrieve(self)
717 ensure_ready = self._managed_backend
718 backend.abort_everything(ensure_ready=ensure_ready)
--> 719 raise exception
720
721 def __call__(self, iterable):
~/Envs/DSpy3/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in retrieve(self)
680 # check if timeout supported in backend future implementation
681 if 'timeout' in getfullargspec(job.get).args:
--> 682 self._output.extend(job.get(timeout=self.timeout))
683 else:
684 self._output.extend(job.get())
/usr/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
606 return self._value
607 else:
--> 608 raise self._value
609
610 def _set(self, i, obj):
MaybeEncodingError: Error sending result: '[(array([[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
1.41421356e+00, 1.41421356e+00, 1.41421356e+00],
[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
1.00122462e+00, 1.00122462e+00, 1.00122462e+00],
[ 0.00000000e+00, 0.00000000e+00, 1.00000000e-04, ...,
1.16887719e+00, 1.16887719e+00, 1.16887719e+00],
...,
[ 0.00000000e+00, 9.90149509e-03, 9.90149509e-03, ...,
1.44791749e+00, 1.44791749e+00, 1.44792649e+00],
[ 0.00000000e+00, 9.90099010e-03, 9.90099010e-03, ...,
1.42667049e+00, 1.42667049e+00, 1.42667049e+00],
[ 0.00000000e+00, 0.00000000e+00, 2.00000000e-04, ...,
1.05598088e+00, 1.05598088e+00, 1.05598088e+00]]), array([[ 0, 127921, 67104, ..., 173516, 104788, 104348],
[ 1, 137202, 152209, ..., 132662, 180350, 250154],
[ 2, 149585, 29951, ..., 66168, 185454, 50577],
...,
[ 70358, 134531, 9396, ..., 219884, 230679, 8889],
[ 70359, 230328, 230747, ..., 255354, 6058, 27905],
[154623, 70360, 156591, ..., 238950, 105002, 130022]]))]'. Reason: 'error("'i' format requires -2147483648 <= number <= 2147483647",)'
I think the issue is in parallel computation of core distances for points, where there is too much information getting passed through. I think ultimately the issue here is more to do with the min_samples
parameter value (hdbscan doesn't cope well with very large min_samples sizes -- I need to add that to the FAQ at some point). The trick is that, if unspecified, it is set to the same value as min_cluster_size
. I would suggest that you may want to specify is separately in this case -- that large a min_samples value is probably problematic. So you could use something more like:
scanner = hdbscan.HDBSCAN(min_cluster_size=3000, min_samples=100, core_dist_n_jobs=8)
And that may work better. Let me know if that helps at all.
Hi, just to report that your suggestion worked :)
I also had this problem, and the suggestion by @lmcinnes solved the isssue.
The problem is only partly due to the behaviour of HDBSCAN's parameter min_samples
. The multiprocessing
backend of joblib
struggles with making pickle
dumps of large arrays, see https://bugs.python.org/issue17560 .
If - for whatever reason - you are keen to have min_samples
being a large number, one solution is to change the backend of joblib
. One can, for example, use the dask.distributed
backend. This, I believe, by-passes pickle
, and thus avoids the MaybeEncodingError
above. A word of warning is that this solution makes intensive use of your machine's RAM.
Instructions on how to change the backend of joblib
are found here. A good example for the dask.distributed
backend is also found here.
Most helpful comment
I also had this problem, and the suggestion by @lmcinnes solved the isssue.
The problem is only partly due to the behaviour of HDBSCAN's parameter
min_samples
. Themultiprocessing
backend ofjoblib
struggles with makingpickle
dumps of large arrays, see https://bugs.python.org/issue17560 .If - for whatever reason - you are keen to have
min_samples
being a large number, one solution is to change the backend ofjoblib
. One can, for example, use thedask.distributed
backend. This, I believe, by-passespickle
, and thus avoids theMaybeEncodingError
above. A word of warning is that this solution makes intensive use of your machine's RAM.Instructions on how to change the backend of
joblib
are found here. A good example for thedask.distributed
backend is also found here.