Distributed: Multiple different errors while creating the Dask LocalCluster

Created on 1 Aug 2020  Â·  20Comments  Â·  Source: dask/distributed

What happened

Failed to create the local cluster. I got a different cases of error when I put the extra worker argument or not.

Case 1: without worker parameter

When I run following code:

from dask.distributed import LocalCluster
cluster = LocalCluster()

Three kinds of error string are shown and then failed to create the local cluster.

Error type A

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 4>>
Traceback (most recent call last):
  File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
    return self.callback()
  File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/nanny.py", line 414, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'

Error type B

tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <zmq.eventloop.ioloop.ZMQIOLoop object at 0x7fb340c0df90>>, <Task finished coro=<Nanny._on_exit() done, defined at /home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/nanny.py:440> exception=TypeError('addresses should be strings or tuples, got None')>)
Traceback (most recent call last):
  File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/nanny.py", line 443, in _on_exit
    await self.scheduler.unregister(address=self.worker_address)
  File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/core.py", line 861, in send_recv_from_rpc
    result = await send_recv(comm=comm, op=key, **kwargs)
  File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/core.py", line 660, in send_recv
    raise exc.with_traceback(tb)
  File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/core.py", line 513, in handle_comm
    result = await result
  File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/scheduler.py", line 2208, in remove_worker
    address = self.coerce_address(address)
  File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/scheduler.py", line 4946, in coerce_address
    raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None

Error type C

distributed.utils - ERROR - addresses should be strings or tuples, got None
Traceback (most recent call last):
  File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/utils.py", line 656, in log_errors
    yield
  File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/scheduler.py", line 2208, in remove_worker
    address = self.coerce_address(address)
  File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/scheduler.py", line 4946, in coerce_address
    raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None

Error type D

distributed.core - ERROR - addresses should be strings or tuples, got None
Traceback (most recent call last):
  File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/core.py", line 513, in handle_comm
    result = await result
  File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/scheduler.py", line 2208, in remove_worker
    address = self.coerce_address(address)
  File "/home/idblab/anaconda3/envs/jupyter/lib/python3.7/site-packages/distributed/scheduler.py", line 4946, in coerce_address
    raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None

Full logs

http://paste.debian.net/1158552/

Case 2: with worker parameter

When I put extra worker parameter local_directory into LocalCluster() like following code:

from dask.distributed import LocalCluster
cluster = LocalCluster(local_directory="/tmp/dask-worker-space")

Now only the Error type A above shown and then failed to create the local cluster.

Full logs

http://paste.debian.net/1158551/

What you expected to happen

  • This is my first time for using Dask. I couldn't load Dask at first and never before, so I don't know how the expected behaviour.

Minimal Complete Verifiable Example

Case 1: without worker parameter

from dask.distributed import LocalCluster
cluster = LocalCluster()

Case 2: with worker parameter

from dask.distributed import LocalCluster
cluster = LocalCluster(local_directory="/tmp/dask-worker-space")

Anything else we need to know?

  • Potential related issue: #3955 (same with Error type A)

Environment

  • Dask version: 2.20.0
    (jupyter) idblab@debian-20200402:~$ conda list # packages in environment at /home/idblab/anaconda3/envs/jupyter: # # Name Version Build Channel dask 2.20.0 py_0 dask-core 2.20.0 py_0 distributed 2.20.0 py37_0
  • Python version: Python 3.7.7
  • Operating System: Linux debian-20200402 5.7.0-2-amd64 #1 SMP Debian 5.7.10-1 (2020-07-26) x86_64 GNU/Linux
  • Install method (conda, pip, source): conda install dask distributed -c conda-forge

All 20 comments

Thank you for raising this issue @jmkim . I'm trying to reproduce this problem and running into some questions.

I'm curious, are you running this in Jupyter when it fails? If so, can you try running this in ipython and see if it also fails there?

Also, does this problem persist if you upgrade to 2.22.0 ?

Thank you for closing @jmkim . I'm curious, did you resolve your issue? If so, do you know what made things work for you?

I mis-closed the issue, sorry @mrocklin ! I am re-opening the issue now.

Thank you for your fast reply and sorry for my late feedback.

Now I tried with these versions:

(jupyter) idblab@debian-20200402:~$ conda list
# packages in environment at /home/idblab/anaconda3/envs/jupyter:
#
# Name                    Version                   Build  Channel
dask                      2.22.0                     py_0    conda-forge
dask-core                 2.22.0                     py_0    conda-forge
distributed               2.22.0           py37hc8dfbb8_0    conda-forge

Just with iPython or Jupyter Lab works very well without any errors.

However, I am using Google Colab, with local runtime (Jupyter Lab backend) and here the errors are occurred.

When I create a Python 3 kernel in Jupyter Lab directly, those above codes works well, but when I create a kernel to Jupyter Lab using Google Colab, it shows the error like above.

I am not familiar at all with connecting to local runtimes, but I am wondering if this is due to google colab running with an old version of distributed (1.25.3 if you run import distributed; distributed.__version__) and tornado (5.1.1).

Having a look at this, apparently it is also using their own shells and kernels
You might try to check if the interactive shell you get is as expected:

import IPython
a = IPython.get_ipython()
print(a)
print(a.kernel)

On my local jupyter this returns:

<ipykernel.zmqshell.ZMQInteractiveShell object at 0x7f71c8b44690>
<ipykernel.ipkernel.IPythonKernel object at 0x7f71c8074390>

While on google colab

<google.colab._shell.Shell object at 0x7ff372caef98>
<google.colab._kernel.Kernel object at 0x7ff372caee80>

As I have no idea what is different about these shells and kernels there might be some code in there that might not be playing nice with distributed.

Also, having another look at the notebook I linked in this dask issue: It now has the correct tornado, but requires a cloudpickle>=1.5.

In this example notebook I am able to install distributed and run the LocalCluster on google colab itself. So I suspect any leftover issue might be in the interaction between your kernel and google colab.

Hi, thank you!

import distributed; distributed.__version__

Showes me:

'2.22.0'

And,

import IPython
a = IPython.get_ipython()
print(a)
print(a.kernel)

Shows me:

<ipykernel.zmqshell.ZMQInteractiveShell object at 0x7f8e414c1a90>
<ipykernel.ipkernel.IPythonKernel object at 0x7f8e3fed83d0>

Alright, so IPython should run within your kernel.

My last guess would be that maybe the way that processes are spawned trough your interface might be broken.

Does LocalCluster(processes=False) work?

If that doesn't solve your issue, I am out of ideas to further debug this. So hopefully someone else then has an idea how to continue on this.

Thank you @sroet for helping to debug this.

If these suggestions don't work I wonder if there is some way to ask folks at Google Colab how their environments differ from vanilla Jupyter? Unfortunately with proprietary systems like Colab it's hard for us to test and understand issues. It's much easier with open systems like Jupyter.

If processes=False does not work I would advice you to raise an issue on this repo

Because:

  1. You have confirmed your code runs on your local kernel
  2. The notebook in my previous comment show it runs on the Colab kernels as well (albeit with a force install)

The only thing that is left as a possible origin of the failure is: The proxy that lets Colab connect to your local kernel.
Therefor I would advice you to raise an issue on the repo that hosts the code that is used to establish the connection.

@jmkim Just checking in, did you manage to resolve or make progress on this issue?

Me still have an intent to solve this issue, but because of me in hurry, I am temporarily using the Jupyter Lab instead of Google Colab. But will use Colab again in a week.

Let me follow your instruction in a week. Sorry for my delays.

I'm also getting all A, B, C, and D errors, but I'm running on HPC, not using Jupyter Notebook.

That is useful information. Thank you. If you have logs from one of the
workers that would be helpful.

On Tue, Sep 1, 2020 at 9:21 AM Jaeyoung Chun notifications@github.com
wrote:

I'm also getting all A, B, C, and D errors, but I'm running on HPC, not
using Jupyter Notebook.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dask/distributed/issues/4007#issuecomment-684976373,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AACKZTBMN65ZJCZUSCWIFYDSDUNR5ANCNFSM4PRUAC5Q
.

Here's the stderr captured by LSF: stderr.txt.gz

This is how I set up LocalCluster:

    worker_kwargs = {
        "n_workers": n_workers,
        "memory_limit": "64G",
        "memory_target_fraction": 0.95,
        "memory_spill_fraction": 0.99,
        "memory_pause_fraction": False,
        # "memory_terminate_fraction": False,
    }

    # do not kill worker at 95% memory level
    dask.config.set({"distributed.worker.memory.terminate": False})
    dask.config.set({"distributed.scheduler.allowed-failures": 50})

    # setup Dask distributed client
    cluster = LocalCluster(**worker_kwargs)
    client = Client(cluster)

The first entry in those logs is this

Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/home/chunj/miniconda3/envs/seqc/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/chunj/miniconda3/envs/seqc/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/chunj/miniconda3/envs/seqc/lib/python3.7/site-packages/distributed/process.py", line 191, in _run
    target(*args, **kwargs)
  File "/home/chunj/miniconda3/envs/seqc/lib/python3.7/site-packages/distributed/nanny.py", line 728, in _run
    worker = Worker(**worker_kwargs)
  File "/home/chunj/miniconda3/envs/seqc/lib/python3.7/site-packages/distributed/worker.py", line 489, in __init__
    os.makedirs(local_directory)
  File "/home/chunj/miniconda3/envs/seqc/lib/python3.7/os.py", line 223, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/fscratch/chunj/rna-velocity/CMI/dask-worker-space'

Looking at recent code it looks like this line was changed

-     os.makedirs(local_directory)
+    os.makedirs(local_directory, exist_ok=True)

Please upgrade Dask and distributed

conda install dask

or

pip install dask[complete] --upgrade

I'm using Dask 2.21.0. Looks like I can upgrade to 2.25.0, but are you saying the following exception caused all those A, B, C, and D errors later?

FileExistsError: [Errno 17] File exists: '/fscratch/chunj/rna-velocity/CMI/dask-worker-space'

That's my guess, yes. Is it easy for you to try?

Okay. Will try and let you know. Thanks.

So far I haven't seen this issue since the upgrade.

I'm glad to hear it. I'm going to close this for now in hopes that the issue is resolved. We'll reopen if this still persists for someone on latest release.

Was this page helpful?
0 / 5 - 0 ratings