Hdbscan: MaybeEncodingError with large min_cluster_size

Created on 14 Jun 2017  路  3Comments  路  Source: scikit-learn-contrib/hdbscan

Hi,

first of all, thanks for this nice algorithm.

I managed to use it without problems with min_cluster_size up to 1000, but when I try larger sizes (2000 or 3000) I end up having a strange error, that I paste here. I must admit I have absolutely no clue of what I may be doing wrong. Apparently at some point an integer is created that is out of bounds.

I need large cluster sizes since I perform some measurements on the clusters that require fairly high statistics to be meaningful.

Is this an intrinsic limitation of the algorithm, or a bug?

Thanks for your help,

Andrea.

MaybeEncodingError                        Traceback (most recent call last)
<ipython-input-113-ee5141ffadc3> in <module>()
      5 
      6 scanner =  hdbscan.HDBSCAN(min_cluster_size=3000,core_dist_n_jobs=8)
----> 7 labels_=scanner.fit_predict(myMatrix[[c for c in myMatrix.columns if c not in ['T','E']]].as_matrix())
      8 
      9 labels=pd.Series(labels_,index=myMatrix.index)

~/Envs/DSpy3/lib/python3.5/site-packages/hdbscan/hdbscan_.py in fit_predict(self, X, y)
    884             cluster labels
    885         """
--> 886         self.fit(X)
    887         return self.labels_
    888 

~/Envs/DSpy3/lib/python3.5/site-packages/hdbscan/hdbscan_.py in fit(self, X, y)
    862          self._condensed_tree,
    863          self._single_linkage_tree,
--> 864          self._min_spanning_tree) = hdbscan(X, **kwargs)
    865 
    866         if self.prediction_data:

~/Envs/DSpy3/lib/python3.5/site-packages/hdbscan/hdbscan_.py in hdbscan(X, min_cluster_size, min_samples, alpha, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
    589                                              approx_min_span_tree,
    590                                              gen_min_span_tree,
--> 591                                              core_dist_n_jobs, **kwargs)
    592         else:  # Metric is a valid BallTree metric
    593             # TO DO: Need heuristic to decide when to go to boruvka;

~/Envs/DSpy3/lib/python3.5/site-packages/sklearn/externals/joblib/memory.py in __call__(self, *args, **kwargs)
    281 
    282     def __call__(self, *args, **kwargs):
--> 283         return self.func(*args, **kwargs)
    284 
    285     def call_and_shelve(self, *args, **kwargs):

~/Envs/DSpy3/lib/python3.5/site-packages/hdbscan/hdbscan_.py in _hdbscan_boruvka_kdtree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
    285                                  leaf_size=leaf_size // 3,
    286                                  approx_min_span_tree=approx_min_span_tree,
--> 287                                  n_jobs=core_dist_n_jobs, **kwargs)
    288     min_spanning_tree = alg.spanning_tree()
    289     # Sort edges of the min_spanning_tree by weight

hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__ (hdbscan/_hdbscan_boruvka.c:5195)()

hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds (hdbscan/_hdbscan_boruvka.c:5915)()

~/Envs/DSpy3/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    766                 # consumption.
    767                 self._iterating = False
--> 768             self.retrieve()
    769             # Make sure that we get a last message telling us we are done
    770             elapsed_time = time.time() - self._start_time

~/Envs/DSpy3/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in retrieve(self)
    717                     ensure_ready = self._managed_backend
    718                     backend.abort_everything(ensure_ready=ensure_ready)
--> 719                 raise exception
    720 
    721     def __call__(self, iterable):

~/Envs/DSpy3/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in retrieve(self)
    680                 # check if timeout supported in backend future implementation
    681                 if 'timeout' in getfullargspec(job.get).args:
--> 682                     self._output.extend(job.get(timeout=self.timeout))
    683                 else:
    684                     self._output.extend(job.get())

/usr/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
    606             return self._value
    607         else:
--> 608             raise self._value
    609 
    610     def _set(self, i, obj):

MaybeEncodingError: Error sending result: '[(array([[  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          1.41421356e+00,   1.41421356e+00,   1.41421356e+00],
       [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          1.00122462e+00,   1.00122462e+00,   1.00122462e+00],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e-04, ...,
          1.16887719e+00,   1.16887719e+00,   1.16887719e+00],
       ..., 
       [  0.00000000e+00,   9.90149509e-03,   9.90149509e-03, ...,
          1.44791749e+00,   1.44791749e+00,   1.44792649e+00],
       [  0.00000000e+00,   9.90099010e-03,   9.90099010e-03, ...,
          1.42667049e+00,   1.42667049e+00,   1.42667049e+00],
       [  0.00000000e+00,   0.00000000e+00,   2.00000000e-04, ...,
          1.05598088e+00,   1.05598088e+00,   1.05598088e+00]]), array([[     0, 127921,  67104, ..., 173516, 104788, 104348],
       [     1, 137202, 152209, ..., 132662, 180350, 250154],
       [     2, 149585,  29951, ...,  66168, 185454,  50577],
       ..., 
       [ 70358, 134531,   9396, ..., 219884, 230679,   8889],
       [ 70359, 230328, 230747, ..., 255354,   6058,  27905],
       [154623,  70360, 156591, ..., 238950, 105002, 130022]]))]'. Reason: 'error("'i' format requires -2147483648 <= number <= 2147483647",)'

Most helpful comment

I also had this problem, and the suggestion by @lmcinnes solved the isssue.

The problem is only partly due to the behaviour of HDBSCAN's parameter min_samples. The multiprocessing backend of joblib struggles with making pickle dumps of large arrays, see https://bugs.python.org/issue17560 .

If - for whatever reason - you are keen to have min_samples being a large number, one solution is to change the backend of joblib. One can, for example, use the dask.distributed backend. This, I believe, by-passes pickle, and thus avoids the MaybeEncodingError above. A word of warning is that this solution makes intensive use of your machine's RAM.

Instructions on how to change the backend of joblib are found here. A good example for the dask.distributed backend is also found here.

All 3 comments

I think the issue is in parallel computation of core distances for points, where there is too much information getting passed through. I think ultimately the issue here is more to do with the min_samples parameter value (hdbscan doesn't cope well with very large min_samples sizes -- I need to add that to the FAQ at some point). The trick is that, if unspecified, it is set to the same value as min_cluster_size. I would suggest that you may want to specify is separately in this case -- that large a min_samples value is probably problematic. So you could use something more like:

scanner =  hdbscan.HDBSCAN(min_cluster_size=3000, min_samples=100, core_dist_n_jobs=8)

And that may work better. Let me know if that helps at all.

Hi, just to report that your suggestion worked :)

I also had this problem, and the suggestion by @lmcinnes solved the isssue.

The problem is only partly due to the behaviour of HDBSCAN's parameter min_samples. The multiprocessing backend of joblib struggles with making pickle dumps of large arrays, see https://bugs.python.org/issue17560 .

If - for whatever reason - you are keen to have min_samples being a large number, one solution is to change the backend of joblib. One can, for example, use the dask.distributed backend. This, I believe, by-passes pickle, and thus avoids the MaybeEncodingError above. A word of warning is that this solution makes intensive use of your machine's RAM.

Instructions on how to change the backend of joblib are found here. A good example for the dask.distributed backend is also found here.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mickohara23 picture mickohara23  路  10Comments

s0j0urn picture s0j0urn  路  10Comments

prighini picture prighini  路  15Comments

thomasht86 picture thomasht86  路  8Comments

esvhd picture esvhd  路  7Comments