Distributed: Unable to start a Client after update to latest dask

Created on 19 Jul 2017  路  18Comments  路  Source: dask/distributed

@mrocklin suggested to file an issue here.

In a restarted notebook I run:

import distributed
client = distributed.Client()

And get hundreds of these errors:

tornado.application - ERROR - Exception in Future <tornado.concurrent.Future object at 0x7fc162990be0> after timeout
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 910, in error_callback
    future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda/lib/python3.6/site-packages/distributed/nanny.py", line 300, in start
    yield self._wait_until_running()
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/opt/conda/lib/python3.6/site-packages/distributed/nanny.py", line 386, in _wait_until_running
    raise ValueError("Worker not started")
ValueError: Worker not started
tornado.application - ERROR - Exception in Future <tornado.concurrent.Future object at 0x7fc1629b82b0> after timeout
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 910, in error_callback
    future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda/lib/python3.6/site-packages/distributed/nanny.py", line 300, in start
    yield self._wait_until_running()
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/opt/conda/lib/python3.6/site-packages/distributed/nanny.py", line 386, in _wait_until_running
    raise ValueError("Worker not started")
ValueError: Worker not started

I can reproduce this with:

docker run -it --rm quantumtinkerer/jupyter-research:latest bash

# create a new env or just use the current one where `distributed` is already installed
conda create --yes -n dask python=3.6 dask distributed
source activate dask

python
import distributed
c = distributed.Client()

The Docker image is based on jupyter/docker-stacks/base-notebook.

The weird thing is that it only happens on our server where we have a Jupyterhub. When I try it on a different machine there doesn't seem to be an issue.

Any idea on how I can debug this?

Most helpful comment

My first guess would be some networking issue. You might try the following:

  1. Use Client(processes=False) which will avoid networking issues entirely
  2. Try setting up a dask-scheduler and dask-worker processes manually to see if that produces more fine-grained error messages

All 18 comments

My first guess would be some networking issue. You might try the following:

  1. Use Client(processes=False) which will avoid networking issues entirely
  2. Try setting up a dask-scheduler and dask-worker processes manually to see if that produces more fine-grained error messages

Thanks for the suggestions.

Client(processes=False) works without any problems:

In [1]: from distributed import Client

In [2]: c = Client(processes=False)

In [3]: x = c.map(lambda x: x, range(10))

In [4]: x[0].result()
Out[4]: 0

Using your second suggestion doesn't give any error messages and works:

tinkerer@831d7a9c2063:~$ dask-scheduler
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO -   Scheduler at:     tcp://172.18.0.3:8786
distributed.scheduler - INFO -        http at:              0.0.0.0:9786
distributed.scheduler - INFO -       bokeh at:             0.0.0.0:54793
distributed.scheduler - INFO - Local Directory:    /tmp/scheduler-y3mgi77n
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Register tcp://172.18.0.3:36821
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.18.0.3:36821
distributed.scheduler - INFO - Receive client connection: Client-0fc0b012-6c87-11e7-85c4-0242ac120003
distributed.scheduler - INFO - Receive client connection: Client-13d6ffe6-6c87-11e7-85c4-0242ac120003
tinkerer@831d7a9c2063:~$ dask-worker 172.18.0.3:8786
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.18.0.3:60427'
distributed.worker - INFO -       Start worker at:     tcp://172.18.0.3:36821
distributed.worker - INFO -              nanny at:           172.18.0.3:60427
distributed.worker - INFO -               http at:           172.18.0.3:36811
distributed.worker - INFO -              bokeh at:           172.18.0.3:8789
distributed.worker - INFO - Waiting to connect to:      tcp://172.18.0.3:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                         48
distributed.worker - INFO -                Memory:                   80.94 GB
distributed.worker - INFO -       Local Directory:            worker-hovrtlmh
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:            tcp://172.18.0.3:8786
distributed.worker - INFO - -------------------------------------------------
In [1]: from distributed import Client

In [2]: c = Client('172.18.0.3:8786')

In [3]: x = c.map(lambda x: x, range(10))

In [4]: x[0].result()
Out[4]: 0

You might now try with addresses like localhost:8786 and 127.0.0.1:8786 which Client() might be using instead.

localhost:8786 and 127.0.0.1:8786 both work without problems too.

Perhaps it's a problem with starting processes within Python within your container? Honestly I'm not sure what else to check. You could try experimenting with LocalCluster directly (this is what Client() creates) to try to isolate the issue.

Maybe this is a more useful error message:

from distributed import LocalCluster
c = LocalCluster(n_workers=1)
ConnectionRefusedError                    Traceback (most recent call last)
<ipython-input-2-4b4cba0717b2> in <module>()
      1 from distributed import LocalCluster
----> 2 c = LocalCluster(n_workers=1)

/opt/conda/lib/python3.6/site-packages/distributed/deploy/local.py in __init__(self, n_workers, threads_per_worker, processes, loop, start, ip, scheduler_port, silence_logs, diagnostics_port, services, worker_services, nanny, **worker_kwargs)
    114 
    115         if start:
--> 116             sync(self.loop, self._start, ip)
    117 
    118         clusters_to_close.add(self)

/opt/conda/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    232         e.wait(1000000)
    233     if error[0]:
--> 234         six.reraise(*error[0])
    235     else:
    236         return result[0]

/opt/conda/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    684         if value.__traceback__ is not tb:
    685             raise value.with_traceback(tb)
--> 686         raise value
    687 
    688 else:

/opt/conda/lib/python3.6/site-packages/distributed/utils.py in f()
    221                 raise RuntimeError("sync() called from thread of running loop")
    222             yield gen.moment
--> 223             result[0] = yield make_coro()
    224         except Exception as exc:
    225             logger.exception(exc)

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053 
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

/opt/conda/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1061                     if exc_info is not None:
   1062                         try:
-> 1063                             yielded = self.gen.throw(*exc_info)
   1064                         finally:
   1065                             # Break up a reference to itself

/opt/conda/lib/python3.6/site-packages/distributed/deploy/local.py in _start(self, ip)
    146         yield self._start_all_workers(
    147             self.n_workers, ncores=self.threads_per_worker,
--> 148             services=self.worker_services, **self.worker_kwargs)
    149 
    150         self.status = 'running'

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053 
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

/opt/conda/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1061                     if exc_info is not None:
   1062                         try:
-> 1063                             yielded = self.gen.throw(*exc_info)
   1064                         finally:
   1065                             # Break up a reference to itself

/opt/conda/lib/python3.6/site-packages/distributed/deploy/local.py in _start_all_workers(self, n_workers, **kwargs)
    152     @gen.coroutine
    153     def _start_all_workers(self, n_workers, **kwargs):
--> 154         yield [self._start_worker(**kwargs) for i in range(n_workers)]
    155 
    156     @gen.coroutine

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053 
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

/opt/conda/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in callback(f)
    826             for f in children:
    827                 try:
--> 828                     result_list.append(f.result())
    829                 except Exception as e:
    830                     if future.done():

/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

/opt/conda/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1061                     if exc_info is not None:
   1062                         try:
-> 1063                             yielded = self.gen.throw(*exc_info)
   1064                         finally:
   1065                             # Break up a reference to itself

/opt/conda/lib/python3.6/site-packages/distributed/deploy/local.py in _start_worker(self, port, processes, death_timeout, **kwargs)
    171               death_timeout=death_timeout,
    172               silence_logs=self.silence_logs, **kwargs)
--> 173         yield w._start()
    174 
    175         self.workers.append(w)

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053 
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

/opt/conda/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1061                     if exc_info is not None:
   1062                         try:
-> 1063                             yielded = self.gen.throw(*exc_info)
   1064                         finally:
   1065                             # Break up a reference to itself

/opt/conda/lib/python3.6/site-packages/distributed/nanny.py in _start(self, addr_or_port)
    133 
    134         logger.info('        Start Nanny at: %r', self.address)
--> 135         response = yield self.instantiate()
    136         if response == 'OK':
    137             assert self.worker_address

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053 
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

/opt/conda/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1061                     if exc_info is not None:
   1062                         try:
-> 1063                             yielded = self.gen.throw(*exc_info)
   1064                         finally:
   1065                             # Break up a reference to itself

/opt/conda/lib/python3.6/site-packages/distributed/nanny.py in instantiate(self, comm)
    187             try:
    188                 yield gen.with_timeout(timedelta(seconds=self.death_timeout),
--> 189                                        self.process.start())
    190             except gen.TimeoutError:
    191                 yield self._close(timeout=self.death_timeout)

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053 
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

/opt/conda/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1061                     if exc_info is not None:
   1062                         try:
-> 1063                             yielded = self.gen.throw(*exc_info)
   1064                         finally:
   1065                             # Break up a reference to itself

/opt/conda/lib/python3.6/site-packages/distributed/nanny.py in start(self)
    296         self.stopped = Event()
    297         self.status = 'starting'
--> 298         yield self.process.start()
    299         if self.status == 'starting':
    300             yield self._wait_until_running()

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053 
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

/opt/conda/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

/opt/conda/lib/python3.6/site-packages/distributed/process.py in _call_and_set_future(loop, future, func, *args, **kwargs)
     23 def _call_and_set_future(loop, future, func, *args, **kwargs):
     24     try:
---> 25         res = func(*args, **kwargs)
     26     except:
     27         # Tornado futures are not thread-safe, need to

/opt/conda/lib/python3.6/site-packages/distributed/process.py in _start()
    115 
    116         def _start():
--> 117             process.start()
    118             state.is_alive = True
    119             state.pid = process.pid

/opt/conda/lib/python3.6/multiprocessing/process.py in start(self)
    103                'daemonic processes are not allowed to have children'
    104         _cleanup()
--> 105         self._popen = self._Popen(self)
    106         self._sentinel = self._popen.sentinel
    107         _children.add(self)

/opt/conda/lib/python3.6/multiprocessing/context.py in _Popen(process_obj)
    289         def _Popen(process_obj):
    290             from .popen_forkserver import Popen
--> 291             return Popen(process_obj)
    292 
    293     class ForkContext(BaseContext):

/opt/conda/lib/python3.6/multiprocessing/popen_forkserver.py in __init__(self, process_obj)
     33     def __init__(self, process_obj):
     34         self._fds = []
---> 35         super().__init__(process_obj)
     36 
     37     def duplicate_for_child(self, fd):

/opt/conda/lib/python3.6/multiprocessing/popen_fork.py in __init__(self, process_obj)
     18         sys.stderr.flush()
     19         self.returncode = None
---> 20         self._launch(process_obj)
     21 
     22     def duplicate_for_child(self, fd):

/opt/conda/lib/python3.6/multiprocessing/popen_forkserver.py in _launch(self, process_obj)
     49             set_spawning_popen(None)
     50 
---> 51         self.sentinel, w = forkserver.connect_to_new_process(self._fds)
     52         util.Finalize(self, os.close, (self.sentinel,))
     53         with open(w, 'wb', closefd=True) as f:

/opt/conda/lib/python3.6/multiprocessing/forkserver.py in connect_to_new_process(self, fds)
     64             raise ValueError('too many fds')
     65         with socket.socket(socket.AF_UNIX) as client:
---> 66             client.connect(self._forkserver_address)
     67             parent_r, child_w = os.pipe()
     68             child_r, parent_w = os.pipe()

ConnectionRefusedError: [Errno 111] Connection refused

Indeed. It looks like multiprocessing's forkserver solution isn't working in your docker container. You could try to diagnose this issue, or you could set the following in your ~/.dask/config.yaml file

multiprocessing-method: fork

@basnijholt did the fix above work for you? Any thoughts on why your docker container might not support multiprocessing's forkserver context?

Hi @mrocklin, the config setting didn't fix the issue. Also, multiprocessing's forkserver seems to work fine outside of dask.

I don't know whether I made a mistake before or if it's because I rebuild the image, but

from distributed import LocalCluster
c = LocalCluster(n_workers=1)

now always seems to work. However, by removing the argument n_workers=1 I get the errors I posted above.

I really have no idea on how to further debug this.

I found that the error message in https://github.com/dask/distributed/issues/1173 is the same as here. Also https://github.com/dask/distributed/issues/1176 looks like it is related.

Could it be related?

I am trying to find out where it went wrong.

I have master dask and in going from distributed 1.16.3 (which works without any problems!) to 1.17.0, I get this error message:

tinkerer@49db00b00191:~$ python -c "from distributed import Client; Client()"
^Ltornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7efcf864c378>, <tornado.concurrent.Future object at 0x7efcf863d9e8>)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 605, in _run_callback
    ret = callback()
  File "/opt/conda/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 626, in _discard_future_result
    future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/home/tinkerer/distributed/distributed/deploy/local.py", line 205, in _stop_worker
    self.workers.remove(w)
ValueError: list.remove(x): x not in list
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7efcf85fd620>, <tornado.concurrent.Future object at 0x7efcf85ebcc0>)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 605, in _run_callback
    ret = callback()
  File "/opt/conda/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 626, in _discard_future_result
    future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/home/tinkerer/distributed/distributed/deploy/local.py", line 205, in _stop_worker
    self.workers.remove(w)
ValueError: list.remove(x): x not in list

distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:36245, Worker: tcp://127.0.0.1:34719
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:36836, Worker: tcp://127.0.0.1:42068
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:47697, Worker: tcp://127.0.0.1:47945
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:45307, Worker: tcp://127.0.0.1:41254
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:48735, Worker: tcp://127.0.0.1:50321
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:47502, Worker: tcp://127.0.0.1:59905
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:40890, Worker: tcp://127.0.0.1:58914
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:42038, Worker: tcp://127.0.0.1:52788
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:38179, Worker: tcp://127.0.0.1:59325
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:51669, Worker: tcp://127.0.0.1:58987
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:56143, Worker: tcp://127.0.0.1:36458
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:37794, Worker: tcp://127.0.0.1:44148
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:55608, Worker: tcp://127.0.0.1:42182
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:40623, Worker: tcp://127.0.0.1:53282
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:34862, Worker: tcp://127.0.0.1:47097
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:36134, Worker: tcp://127.0.0.1:58843
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:43778, Worker: tcp://127.0.0.1:46244
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:43686, Worker: tcp://127.0.0.1:58723
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:41067, Worker: tcp://127.0.0.1:46625
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:54479, Worker: tcp://127.0.0.1:42986
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:54245, Worker: tcp://127.0.0.1:58110
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:33818, Worker: tcp://127.0.0.1:49849
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:37342, Worker: tcp://127.0.0.1:35576
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:41074, Worker: tcp://127.0.0.1:46154
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:43502, Worker: tcp://127.0.0.1:37528
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:59109, Worker: tcp://127.0.0.1:59705
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:36833, Worker: tcp://127.0.0.1:58870
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:60854, Worker: tcp://127.0.0.1:42344
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:47033, Worker: tcp://127.0.0.1:42909
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:40017, Worker: tcp://127.0.0.1:60163
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:56715, Worker: tcp://127.0.0.1:51788
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:41448, Worker: tcp://127.0.0.1:34166
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:41054, Worker: tcp://127.0.0.1:49984
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:55017, Worker: tcp://127.0.0.1:35094
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:53124, Worker: tcp://127.0.0.1:39542
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:51436, Worker: tcp://127.0.0.1:41198
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:48015, Worker: tcp://127.0.0.1:58791
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:51187, Worker: tcp://127.0.0.1:45604
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:37403, Worker: tcp://127.0.0.1:51274
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:39201, Worker: tcp://127.0.0.1:38590
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:58616, Worker: tcp://127.0.0.1:43231
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:40599, Worker: tcp://127.0.0.1:60809
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:55720, Worker: tcp://127.0.0.1:60737
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:59056, Worker: tcp://127.0.0.1:52085
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7efcf8718b70>, <tornado.concurrent.Future object at 0x7efcf8709be0>)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 605, in _run_callback
    ret = callback()
  File "/opt/conda/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 626, in _discard_future_result
    future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/home/tinkerer/distributed/distributed/deploy/local.py", line 205, in _stop_worker
    self.workers.remove(w)
ValueError: list.remove(x): x not in list

and that last error message repeated many times until the kernel unblocks.

Then when I go to 1.17.1 these errors repeat and the kernel is blocked indefinitely:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tinkerer/distributed/distributed/core.py", line 424, in send_recv_from_rpc
    result = yield send_recv(comm=comm, op=key, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/tinkerer/distributed/distributed/core.py", line 310, in send_recv
    response = yield comm.read()
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/tinkerer/distributed/distributed/comm/tcp.py", line 166, in read
    convert_stream_closed_error(e)
  File "/home/tinkerer/distributed/distributed/comm/tcp.py", line 106, in convert_stream_closed_error
    raise CommClosedError(str(exc))
distributed.comm.core.CommClosedError: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 910, in error_callback
    future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/tinkerer/distributed/distributed/core.py", line 427, in send_recv_from_rpc
    % (e, key,))
distributed.comm.core.CommClosedError: Stream is closed: while trying to call remote method 'register'
tornado.application - ERROR - Exception in Future <tornado.concurrent.Future object at 0x7f9638185e80> after timeout
Traceback (most recent call last):
  File "/home/tinkerer/distributed/distributed/comm/tcp.py", line 152, in read
    n_frames = yield stream.read_bytes(8)
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
tornado.iostream.StreamClosedError: Stream is closed

I hope this can give you a hint of where to look further.

EDIT

Actually I checked commit by commit and found that ef0c397865faa8c9ee68faa2e557d99933a9e066 is the last working commit.

I should have a bit of time to look at this starting tomorrow.

Sorry for the long delay. Working through a backlog of issues.

This worked fine for me:

(dask) tinkerer@061c3142985e:~$ python
Python 3.6.2 | packaged by conda-forge | (default, Jul 23 2017, 22:59:30) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import distributed
>>> c = distributed.Client()
>>> 

Taking a closer look at the more recent comments here now.

It looks like this error ValueError: list.remove(x): x not in list was resolved a while ago (though after 1.17.0)

I don't see anything in the commit that follows that commit, d1b4dee692e92de36fac036f15a86b30ac6bed4c, that would cause the behavior that you're seeing. I don't suppose you're able to help me find a different reproducible example?

@basnijholt thank you for access to your system. I tried updating to master with !pip install git+https://github.com/dask/distributed.git --upgrade and things seem to work now:

image

Yes, absolutely great!

I did try it with master last week.

Commit 7985689ad26e02f90d16494581d9729f49d9bbf2 seems to be the last non-working commit and 4fc2dbd2e22dae05efce5df854ba7f584e7961d6 fixes it.

Thanks a lot, I guess this issue is solved 馃憤

It's a bit concerning that we don't know exactly what happened, but yes, I'm happy that this is resolved as well. Thank you for reporting and again you have my apologies for the long delay.

Was this page helpful?
0 / 5 - 0 ratings