Failed to create the local cluster. I got a different cases of error when I put the extra worker argument or not.
When I run following code:
from dask.distributed import LocalCluster
cluster = LocalCluster()
Three kinds of error string are shown and then failed to create the local cluster.
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 4>>
Traceback (most recent call last):
File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
return self.callback()
File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/nanny.py", line 414, in memory_monitor
process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <zmq.eventloop.ioloop.ZMQIOLoop object at 0x7fb340c0df90>>, <Task finished coro=<Nanny._on_exit() done, defined at /home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/nanny.py:440> exception=TypeError('addresses should be strings or tuples, got None')>)
Traceback (most recent call last):
File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/nanny.py", line 443, in _on_exit
await self.scheduler.unregister(address=self.worker_address)
File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/core.py", line 861, in send_recv_from_rpc
result = await send_recv(comm=comm, op=key, **kwargs)
File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/core.py", line 660, in send_recv
raise exc.with_traceback(tb)
File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/core.py", line 513, in handle_comm
result = await result
File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/scheduler.py", line 2208, in remove_worker
address = self.coerce_address(address)
File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/scheduler.py", line 4946, in coerce_address
raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None
distributed.utils - ERROR - addresses should be strings or tuples, got None
Traceback (most recent call last):
File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/utils.py", line 656, in log_errors
yield
File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/scheduler.py", line 2208, in remove_worker
address = self.coerce_address(address)
File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/scheduler.py", line 4946, in coerce_address
raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None
distributed.core - ERROR - addresses should be strings or tuples, got None
Traceback (most recent call last):
File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/core.py", line 513, in handle_comm
result = await result
File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/scheduler.py", line 2208, in remove_worker
address = self.coerce_address(address)
File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/scheduler.py", line 4946, in coerce_address
raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None
http://paste.debian.net/1158552/
When I put extra worker parameter local_directory into LocalCluster() like following code:
from dask.distributed import LocalCluster
cluster = LocalCluster(local_directory="/tmp/dask-worker-space")
Now only the Error type A above shown and then failed to create the local cluster.
http://paste.debian.net/1158551/
from dask.distributed import LocalCluster
cluster = LocalCluster()
from dask.distributed import LocalCluster
cluster = LocalCluster(local_directory="/tmp/dask-worker-space")
2.20.0
(jupyter) idblab@debian-20200402:~$ conda list
# packages in environment at /home/idblab/anaconda3/envs/jupyter:
#
# Name Version Build Channel
dask 2.20.0 py_0
dask-core 2.20.0 py_0
distributed 2.20.0 py37_0
Python 3.7.7Linux debian-20200402 5.7.0-2-amd64 #1 SMP Debian 5.7.10-1 (2020-07-26) x86_64 GNU/Linuxconda install dask distributed -c conda-forgeThank you for raising this issue @jmkim . I'm trying to reproduce this problem and running into some questions.
I'm curious, are you running this in Jupyter when it fails? If so, can you try running this in ipython and see if it also fails there?
Also, does this problem persist if you upgrade to 2.22.0 ?
Thank you for closing @jmkim . I'm curious, did you resolve your issue? If so, do you know what made things work for you?
I mis-closed the issue, sorry @mrocklin ! I am re-opening the issue now.
Thank you for your fast reply and sorry for my late feedback.
Now I tried with these versions:
(jupyter) idblab@debian-20200402:~$ conda list
# packages in environment at /home/idblab/anaconda3/envs/jupyter:
#
# Name Version Build Channel
dask 2.22.0 py_0 conda-forge
dask-core 2.22.0 py_0 conda-forge
distributed 2.22.0 py37hc8dfbb8_0 conda-forge
Just with iPython or Jupyter Lab works very well without any errors.
However, I am using Google Colab, with local runtime (Jupyter Lab backend) and here the errors are occurred.
When I create a Python 3 kernel in Jupyter Lab directly, those above codes works well, but when I create a kernel to Jupyter Lab using Google Colab, it shows the error like above.
I am not familiar at all with connecting to local runtimes, but I am wondering if this is due to google colab running with an old version of distributed (1.25.3 if you run import distributed; distributed.__version__) and tornado (5.1.1).
Having a look at this, apparently it is also using their own shells and kernels
You might try to check if the interactive shell you get is as expected:
import IPython
a = IPython.get_ipython()
print(a)
print(a.kernel)
On my local jupyter this returns:
<ipykernel.zmqshell.ZMQInteractiveShell object at 0x7f71c8b44690>
<ipykernel.ipkernel.IPythonKernel object at 0x7f71c8074390>
While on google colab
<google.colab._shell.Shell object at 0x7ff372caef98>
<google.colab._kernel.Kernel object at 0x7ff372caee80>
As I have no idea what is different about these shells and kernels there might be some code in there that might not be playing nice with distributed.
Also, having another look at the notebook I linked in this dask issue: It now has the correct tornado, but requires a cloudpickle>=1.5.
In this example notebook I am able to install distributed and run the LocalCluster on google colab itself. So I suspect any leftover issue might be in the interaction between your kernel and google colab.
Hi, thank you!
import distributed; distributed.__version__
Showes me:
'2.22.0'
And,
import IPython
a = IPython.get_ipython()
print(a)
print(a.kernel)
Shows me:
<ipykernel.zmqshell.ZMQInteractiveShell object at 0x7f8e414c1a90>
<ipykernel.ipkernel.IPythonKernel object at 0x7f8e3fed83d0>
Alright, so IPython should run within your kernel.
My last guess would be that maybe the way that processes are spawned trough your interface might be broken.
Does LocalCluster(processes=False) work?
If that doesn't solve your issue, I am out of ideas to further debug this. So hopefully someone else then has an idea how to continue on this.
Thank you @sroet for helping to debug this.
If these suggestions don't work I wonder if there is some way to ask folks at Google Colab how their environments differ from vanilla Jupyter? Unfortunately with proprietary systems like Colab it's hard for us to test and understand issues. It's much easier with open systems like Jupyter.
If processes=False does not work I would advice you to raise an issue on this repo
Because:
The only thing that is left as a possible origin of the failure is: The proxy that lets Colab connect to your local kernel.
Therefor I would advice you to raise an issue on the repo that hosts the code that is used to establish the connection.
@jmkim Just checking in, did you manage to resolve or make progress on this issue?
Me still have an intent to solve this issue, but because of me in hurry, I am temporarily using the Jupyter Lab instead of Google Colab. But will use Colab again in a week.
Let me follow your instruction in a week. Sorry for my delays.
I'm also getting all A, B, C, and D errors, but I'm running on HPC, not using Jupyter Notebook.
That is useful information. Thank you. If you have logs from one of the
workers that would be helpful.
On Tue, Sep 1, 2020 at 9:21 AM Jaeyoung Chun notifications@github.com
wrote:
I'm also getting all A, B, C, and D errors, but I'm running on HPC, not
using Jupyter Notebook.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dask/distributed/issues/4007#issuecomment-684976373,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AACKZTBMN65ZJCZUSCWIFYDSDUNR5ANCNFSM4PRUAC5Q
.
Here's the stderr captured by LSF: stderr.txt.gz
This is how I set up LocalCluster:
worker_kwargs = {
"n_workers": n_workers,
"memory_limit": "64G",
"memory_target_fraction": 0.95,
"memory_spill_fraction": 0.99,
"memory_pause_fraction": False,
# "memory_terminate_fraction": False,
}
# do not kill worker at 95% memory level
dask.config.set({"distributed.worker.memory.terminate": False})
dask.config.set({"distributed.scheduler.allowed-failures": 50})
# setup Dask distributed client
cluster = LocalCluster(**worker_kwargs)
client = Client(cluster)
The first entry in those logs is this
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
File "/home/chunj/miniconda3/envs/seqc/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/chunj/miniconda3/envs/seqc/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/chunj/miniconda3/envs/seqc/lib/python3.7/site-packages/distributed/process.py", line 191, in _run
target(*args, **kwargs)
File "/home/chunj/miniconda3/envs/seqc/lib/python3.7/site-packages/distributed/nanny.py", line 728, in _run
worker = Worker(**worker_kwargs)
File "/home/chunj/miniconda3/envs/seqc/lib/python3.7/site-packages/distributed/worker.py", line 489, in __init__
os.makedirs(local_directory)
File "/home/chunj/miniconda3/envs/seqc/lib/python3.7/os.py", line 223, in makedirs
mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/fscratch/chunj/rna-velocity/CMI/dask-worker-space'
Looking at recent code it looks like this line was changed
- os.makedirs(local_directory)
+ os.makedirs(local_directory, exist_ok=True)
Please upgrade Dask and distributed
conda install dask
or
pip install dask[complete] --upgrade
I'm using Dask 2.21.0. Looks like I can upgrade to 2.25.0, but are you saying the following exception caused all those A, B, C, and D errors later?
FileExistsError: [Errno 17] File exists: '/fscratch/chunj/rna-velocity/CMI/dask-worker-space'
That's my guess, yes. Is it easy for you to try?
Okay. Will try and let you know. Thanks.
So far I haven't seen this issue since the upgrade.
I'm glad to hear it. I'm going to close this for now in hopes that the issue is resolved. We'll reopen if this still persists for someone on latest release.