Hdbscan: boruvka joblib error

Created on 4 Feb 2016 · 12Comments · Source: scikit-learn-contrib/hdbscan

hdbscan 0.6.5, sklearn 0.17.0
calling HDBSCAN.fit() with algorithm=boruvka_kdtree or boruvka_balltree, i sometimes get this following error. it works fine with algorithm=prims_kdtree or prims_balltree

Traceback (most recent call last):
File "", line 1, in
File "c:\python2764\Lib\multiprocessing\forking.py", line 380, in main
prepare(preparation_data)
File "c:\python2764\Lib\multiprocessing\forking.py", line 495, in prepare
'parents_main', file, path_name, etc
...( references to my code calling HDBSCAN.fit() )...
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\hdbscan\hdbscan_.py", line 531, in fit
self._min_spanning_tree) = hdbscan(X, self.get_params())
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\hdbscan\hdbscan_.py", line 363, in hdbscan
gen_min_span_tree)
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\sklearn\externals\joblib\memory.py", line 283, in **call
return self.func(_args, *_kwargs)
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\hdbscan\hdbscan_.py", line 163, in _hdbscan_boruvka_kdtree
alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric, leaf_size=leaf_size // 3)
File "hdbscan/_hdbscan_boruvka.pyx", line 335, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__ (hdbscan_hdbscan_boruvka.c:4746)
File "hdbscan/_hdbscan_boruvka.pyx", line 364, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds (hdbscan_hdbscan_boruvka.c:5401)
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 771, in __call__
n_jobs = self._initialize_pool()
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 518, in _initialize_pool
raise ImportError('[joblib] Attempting to do parallel computing '
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if __name__ == 'main'". Please see the joblib documentation on Parallel for more information

bug

Source

eyaler

All 12 comments

I believe this is a Windows related error, and unfortunately my access to Windows systems to test and debug on is very limited. I'll try to get to this when I can, but I can't promise a swift resolution in this case.

lmcinnes on 4 Feb 2016

It seems this is a known issue with joblib on windows ... you'll see similar problems with the scikit-learn RandomForest if you specify n_jobs. You need to wrap any part of your code that isn't function definitions or imports in an

if __name__ == "__main__":

to have it work properly. See http://pythonhosted.org/joblib/parallel.html for details, or https://www.mail-archive.com/[email protected]/msg08093.html for examples of other ways this error can crop up.

lmcinnes on 5 Feb 2016

😕4 👎4

Indeed this solves the issue, however maybe you can allow the user to run with the equivalent of njobs=1?

eyaler on 5 Feb 2016

👎2

Sorry for the long delay. Getting started on this now. In master you can set core_dist_n_jobs=1 to achieve this. This should appear in the next release.

lmcinnes on 12 Mar 2016

👎1 👍1

I see this issue on macOS in a Jupyter notebook working with scikit-learn and multicore processing. This MWE tickles the issue:

import numpy as np, numpy.random as npr
from sklearn import cluster

data = npr.poisson(1,(100,10))
algorithm = cluster.KMeans
algorithm_kwargs = dict(n_clusters=4,n_jobs=-1)
estimator = algorithm(**algorithm_kwargs)
labels = estimator.fit_predict(data)

The code runs as expected when run as a Python script.

However, under a Jupyter notebook, it throws this error:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-64-5a0071cf17dc> in <module>()
      3 algorithm_kwargs = dict(n_clusters=4,n_jobs=-1)
      4 estimator = algorithm(**algorithm_kwargs)
----> 5 labels = estimator.fit_predict(data)

/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cluster/k_means_.py in fit_predict(self, X, y)
    915             Index of the cluster each sample belongs to.
    916         """
--> 917         return self.fit(X).labels_
    918 
    919     def fit_transform(self, X, y=None):

/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cluster/k_means_.py in fit(self, X, y)
    894                 tol=self.tol, random_state=random_state, copy_x=self.copy_x,
    895                 n_jobs=self.n_jobs, algorithm=self.algorithm,
--> 896                 return_n_iter=True)
    897         return self
    898 

/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cluster/k_means_.py in k_means(X, n_clusters, init, precompute_distances, n_init, max_iter, verbose, tol, random_state, copy_x, n_jobs, algorithm, return_n_iter)
    361                                    # Change seed to ensure variety
    362                                    random_state=seed)
--> 363             for seed in seeds)
    364         # Get results with the lowest inertia
    365         labels, inertia, centers, n_iters = zip(*results)

/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    747         self._aborting = False
    748         if not self._managed_backend:
--> 749             n_jobs = self._initialize_backend()
    750         else:
    751             n_jobs = self._effective_n_jobs()

/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _initialize_backend(self)
    545         try:
    546             n_jobs = self._backend.configure(n_jobs=self.n_jobs, parallel=self,
--> 547                                              **self._backend_args)
    548             if self.timeout is not None and not self._backend.supports_timeout:
    549                 warnings.warn(

/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in configure(self, n_jobs, parallel, **backend_args)
    303         if already_forked:
    304             raise ImportError(
--> 305                 '[joblib] Attempting to do parallel computing '
    306                 'without protecting your import on a system that does '
    307                 'not support forking. To use parallel-computing in a '

ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if __name__ == '__main__'". Please see the joblib documentation on Parallel for more information

I'm not sure if this is an issue with joblib or some downstream issue with Jupyter, but it is a problem.

The very same issue arises in HDBSCAN], e.g. the following alternate code tickles the issue:

import hdbscan

algorithm = hdbscan.HDBSCAN
algorithm_kwargs = dict(min_cluster_size=10,allow_single_cluster=True)

essandess on 5 Apr 2018

I think this is an upstream issue with joblib, but thanks for reporting. I'll try to look into it when I can get some time and ensure that it is upstream, and check if there is a way to workaround the issue internally in hdbscan.

lmcinnes on 5 Apr 2018

Thanks. Or possibly upstream with Jupyter—I didn’t have this issue until recently, after upgrading my stack. Whether it’s Jupyter or joblib, joblib’s error message is borked.

essandess on 5 Apr 2018

Why is this closed? it is still an issue with the latest joblib (0.12) when running in jupyter on a mac. Works fine for a while then randomly starts failing forcing me to restart the whole kernel and lose ea lot of work

simonhughes22 on 29 Jun 2018

The original issue (relating to windows) was closed, and this was never re-opened. I haven't been able to reproduce it myself, and as far as I can tell it is a joblib issue that I can't do much about (I don't pretend to know or understand joblib well enough to suggest a fix).

lmcinnes on 29 Jun 2018

Ah yes I need to post this on the joblib repo, assuming they have one

simonhughes22 on 29 Jun 2018

😕1

Posted here: https://github.com/joblib/joblib/issues/709

simonhughes22 on 29 Jun 2018

Thanks!

On Fri, Jun 29, 2018 at 1:47 PM Simon Hughes notifications@github.com
wrote:

Posted here: joblib/joblib#709
https://github.com/joblib/joblib/issues/709

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/22#issuecomment-401426040,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBTm0MiRTBbKNOq_LIR5-JWNv5uVgks5uBmgegaJpZM4HTNO2
.