Hdbscan: boruvka joblib error

Created on 4 Feb 2016  Â·  12Comments  Â·  Source: scikit-learn-contrib/hdbscan

hdbscan 0.6.5, sklearn 0.17.0
calling HDBSCAN.fit() with algorithm=boruvka_kdtree or boruvka_balltree, i sometimes get this following error. it works fine with algorithm=prims_kdtree or prims_balltree

Traceback (most recent call last):
File "", line 1, in
File "c:\python2764\Lib\multiprocessing\forking.py", line 380, in main
prepare(preparation_data)
File "c:\python2764\Lib\multiprocessing\forking.py", line 495, in prepare
'parents_main', file, path_name, etc
...( references to my code calling HDBSCAN.fit() )...
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\hdbscan\hdbscan_.py", line 531, in fit
self._min_spanning_tree) = hdbscan(X, self.get_params())
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\hdbscan\hdbscan_.py", line 363, in hdbscan
gen_min_span_tree)
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\sklearn\externals\joblib\memory.py", line 283, in **call

return self.func(_args, *_kwargs)
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\hdbscan\hdbscan_.py", line 163, in _hdbscan_boruvka_kdtree
alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric, leaf_size=leaf_size // 3)
File "hdbscan/_hdbscan_boruvka.pyx", line 335, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__ (hdbscan_hdbscan_boruvka.c:4746)
File "hdbscan/_hdbscan_boruvka.pyx", line 364, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds (hdbscan_hdbscan_boruvka.c:5401)
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 771, in __call__
n_jobs = self._initialize_pool()
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 518, in _initialize_pool
raise ImportError('[joblib] Attempting to do parallel computing '
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if __name__ == 'main'". Please see the joblib documentation on Parallel for more information

bug

All 12 comments

I believe this is a Windows related error, and unfortunately my access to Windows systems to test and debug on is very limited. I'll try to get to this when I can, but I can't promise a swift resolution in this case.

It seems this is a known issue with joblib on windows ... you'll see similar problems with the scikit-learn RandomForest if you specify n_jobs. You need to wrap any part of your code that isn't function definitions or imports in an

if __name__ == "__main__":

to have it work properly. See http://pythonhosted.org/joblib/parallel.html for details, or https://www.mail-archive.com/[email protected]/msg08093.html for examples of other ways this error can crop up.

Indeed this solves the issue, however maybe you can allow the user to run with the equivalent of njobs=1?

Sorry for the long delay. Getting started on this now. In master you can set core_dist_n_jobs=1 to achieve this. This should appear in the next release.

I see this issue on macOS in a Jupyter notebook working with scikit-learn and multicore processing. This MWE tickles the issue:

import numpy as np, numpy.random as npr
from sklearn import cluster

data = npr.poisson(1,(100,10))
algorithm = cluster.KMeans
algorithm_kwargs = dict(n_clusters=4,n_jobs=-1)
estimator = algorithm(**algorithm_kwargs)
labels = estimator.fit_predict(data)

The code runs as expected when run as a Python script.

However, under a Jupyter notebook, it throws this error:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-64-5a0071cf17dc> in <module>()
      3 algorithm_kwargs = dict(n_clusters=4,n_jobs=-1)
      4 estimator = algorithm(**algorithm_kwargs)
----> 5 labels = estimator.fit_predict(data)

/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cluster/k_means_.py in fit_predict(self, X, y)
    915             Index of the cluster each sample belongs to.
    916         """
--> 917         return self.fit(X).labels_
    918 
    919     def fit_transform(self, X, y=None):

/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cluster/k_means_.py in fit(self, X, y)
    894                 tol=self.tol, random_state=random_state, copy_x=self.copy_x,
    895                 n_jobs=self.n_jobs, algorithm=self.algorithm,
--> 896                 return_n_iter=True)
    897         return self
    898 

/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cluster/k_means_.py in k_means(X, n_clusters, init, precompute_distances, n_init, max_iter, verbose, tol, random_state, copy_x, n_jobs, algorithm, return_n_iter)
    361                                    # Change seed to ensure variety
    362                                    random_state=seed)
--> 363             for seed in seeds)
    364         # Get results with the lowest inertia
    365         labels, inertia, centers, n_iters = zip(*results)

/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    747         self._aborting = False
    748         if not self._managed_backend:
--> 749             n_jobs = self._initialize_backend()
    750         else:
    751             n_jobs = self._effective_n_jobs()

/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _initialize_backend(self)
    545         try:
    546             n_jobs = self._backend.configure(n_jobs=self.n_jobs, parallel=self,
--> 547                                              **self._backend_args)
    548             if self.timeout is not None and not self._backend.supports_timeout:
    549                 warnings.warn(

/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in configure(self, n_jobs, parallel, **backend_args)
    303         if already_forked:
    304             raise ImportError(
--> 305                 '[joblib] Attempting to do parallel computing '
    306                 'without protecting your import on a system that does '
    307                 'not support forking. To use parallel-computing in a '

ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if __name__ == '__main__'". Please see the joblib documentation on Parallel for more information

I'm not sure if this is an issue with joblib or some downstream issue with Jupyter, but it is a problem.

The very same issue arises in HDBSCAN], e.g. the following alternate code tickles the issue:

import hdbscan

algorithm = hdbscan.HDBSCAN
algorithm_kwargs = dict(min_cluster_size=10,allow_single_cluster=True)

I think this is an upstream issue with joblib, but thanks for reporting. I'll try to look into it when I can get some time and ensure that it is upstream, and check if there is a way to workaround the issue internally in hdbscan.

Thanks. Or possibly upstream with Jupyter—I didn’t have this issue until recently, after upgrading my stack. Whether it’s Jupyter or joblib, joblib’s error message is borked.

Why is this closed? it is still an issue with the latest joblib (0.12) when running in jupyter on a mac. Works fine for a while then randomly starts failing forcing me to restart the whole kernel and lose ea lot of work

The original issue (relating to windows) was closed, and this was never re-opened. I haven't been able to reproduce it myself, and as far as I can tell it is a joblib issue that I can't do much about (I don't pretend to know or understand joblib well enough to suggest a fix).

Ah yes I need to post this on the joblib repo, assuming they have one

Thanks!

On Fri, Jun 29, 2018 at 1:47 PM Simon Hughes notifications@github.com
wrote:

Posted here: joblib/joblib#709
https://github.com/joblib/joblib/issues/709

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn-contrib/hdbscan/issues/22#issuecomment-401426040,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBTm0MiRTBbKNOq_LIR5-JWNv5uVgks5uBmgegaJpZM4HTNO2
.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

esvhd picture esvhd  Â·  7Comments

architec997 picture architec997  Â·  14Comments

disimone picture disimone  Â·  3Comments

s0j0urn picture s0j0urn  Â·  10Comments

thomasht86 picture thomasht86  Â·  8Comments