@mrocklin suggested to file an issue here.
In a restarted notebook I run:
import distributed
client = distributed.Client()
And get hundreds of these errors:
tornado.application - ERROR - Exception in Future <tornado.concurrent.Future object at 0x7fc162990be0> after timeout
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 910, in error_callback
future.result()
File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/opt/conda/lib/python3.6/site-packages/distributed/nanny.py", line 300, in start
yield self._wait_until_running()
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
yielded = self.gen.send(value)
File "/opt/conda/lib/python3.6/site-packages/distributed/nanny.py", line 386, in _wait_until_running
raise ValueError("Worker not started")
ValueError: Worker not started
tornado.application - ERROR - Exception in Future <tornado.concurrent.Future object at 0x7fc1629b82b0> after timeout
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 910, in error_callback
future.result()
File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/opt/conda/lib/python3.6/site-packages/distributed/nanny.py", line 300, in start
yield self._wait_until_running()
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
yielded = self.gen.send(value)
File "/opt/conda/lib/python3.6/site-packages/distributed/nanny.py", line 386, in _wait_until_running
raise ValueError("Worker not started")
ValueError: Worker not started
I can reproduce this with:
docker run -it --rm quantumtinkerer/jupyter-research:latest bash
# create a new env or just use the current one where `distributed` is already installed
conda create --yes -n dask python=3.6 dask distributed
source activate dask
python
import distributed
c = distributed.Client()
The Docker image is based on jupyter/docker-stacks/base-notebook.
The weird thing is that it only happens on our server where we have a Jupyterhub. When I try it on a different machine there doesn't seem to be an issue.
Any idea on how I can debug this?
My first guess would be some networking issue. You might try the following:
Client(processes=False) which will avoid networking issues entirelyThanks for the suggestions.
Client(processes=False) works without any problems:
In [1]: from distributed import Client
In [2]: c = Client(processes=False)
In [3]: x = c.map(lambda x: x, range(10))
In [4]: x[0].result()
Out[4]: 0
Using your second suggestion doesn't give any error messages and works:
tinkerer@831d7a9c2063:~$ dask-scheduler
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Scheduler at: tcp://172.18.0.3:8786
distributed.scheduler - INFO - http at: 0.0.0.0:9786
distributed.scheduler - INFO - bokeh at: 0.0.0.0:54793
distributed.scheduler - INFO - Local Directory: /tmp/scheduler-y3mgi77n
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Register tcp://172.18.0.3:36821
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.18.0.3:36821
distributed.scheduler - INFO - Receive client connection: Client-0fc0b012-6c87-11e7-85c4-0242ac120003
distributed.scheduler - INFO - Receive client connection: Client-13d6ffe6-6c87-11e7-85c4-0242ac120003
tinkerer@831d7a9c2063:~$ dask-worker 172.18.0.3:8786
distributed.nanny - INFO - Start Nanny at: 'tcp://172.18.0.3:60427'
distributed.worker - INFO - Start worker at: tcp://172.18.0.3:36821
distributed.worker - INFO - nanny at: 172.18.0.3:60427
distributed.worker - INFO - http at: 172.18.0.3:36811
distributed.worker - INFO - bokeh at: 172.18.0.3:8789
distributed.worker - INFO - Waiting to connect to: tcp://172.18.0.3:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 48
distributed.worker - INFO - Memory: 80.94 GB
distributed.worker - INFO - Local Directory: worker-hovrtlmh
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.18.0.3:8786
distributed.worker - INFO - -------------------------------------------------
In [1]: from distributed import Client
In [2]: c = Client('172.18.0.3:8786')
In [3]: x = c.map(lambda x: x, range(10))
In [4]: x[0].result()
Out[4]: 0
You might now try with addresses like localhost:8786 and 127.0.0.1:8786 which Client() might be using instead.
localhost:8786 and 127.0.0.1:8786 both work without problems too.
Perhaps it's a problem with starting processes within Python within your container? Honestly I'm not sure what else to check. You could try experimenting with LocalCluster directly (this is what Client() creates) to try to isolate the issue.
Maybe this is a more useful error message:
from distributed import LocalCluster
c = LocalCluster(n_workers=1)
ConnectionRefusedError Traceback (most recent call last)
<ipython-input-2-4b4cba0717b2> in <module>()
1 from distributed import LocalCluster
----> 2 c = LocalCluster(n_workers=1)
/opt/conda/lib/python3.6/site-packages/distributed/deploy/local.py in __init__(self, n_workers, threads_per_worker, processes, loop, start, ip, scheduler_port, silence_logs, diagnostics_port, services, worker_services, nanny, **worker_kwargs)
114
115 if start:
--> 116 sync(self.loop, self._start, ip)
117
118 clusters_to_close.add(self)
/opt/conda/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
232 e.wait(1000000)
233 if error[0]:
--> 234 six.reraise(*error[0])
235 else:
236 return result[0]
/opt/conda/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
684 if value.__traceback__ is not tb:
685 raise value.with_traceback(tb)
--> 686 raise value
687
688 else:
/opt/conda/lib/python3.6/site-packages/distributed/utils.py in f()
221 raise RuntimeError("sync() called from thread of running loop")
222 yield gen.moment
--> 223 result[0] = yield make_coro()
224 except Exception as exc:
225 logger.exception(exc)
/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
1053
1054 try:
-> 1055 value = future.result()
1056 except Exception:
1057 self.had_exception = True
/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
236 if self._exc_info is not None:
237 try:
--> 238 raise_exc_info(self._exc_info)
239 finally:
240 self = None
/opt/conda/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)
/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
1061 if exc_info is not None:
1062 try:
-> 1063 yielded = self.gen.throw(*exc_info)
1064 finally:
1065 # Break up a reference to itself
/opt/conda/lib/python3.6/site-packages/distributed/deploy/local.py in _start(self, ip)
146 yield self._start_all_workers(
147 self.n_workers, ncores=self.threads_per_worker,
--> 148 services=self.worker_services, **self.worker_kwargs)
149
150 self.status = 'running'
/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
1053
1054 try:
-> 1055 value = future.result()
1056 except Exception:
1057 self.had_exception = True
/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
236 if self._exc_info is not None:
237 try:
--> 238 raise_exc_info(self._exc_info)
239 finally:
240 self = None
/opt/conda/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)
/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
1061 if exc_info is not None:
1062 try:
-> 1063 yielded = self.gen.throw(*exc_info)
1064 finally:
1065 # Break up a reference to itself
/opt/conda/lib/python3.6/site-packages/distributed/deploy/local.py in _start_all_workers(self, n_workers, **kwargs)
152 @gen.coroutine
153 def _start_all_workers(self, n_workers, **kwargs):
--> 154 yield [self._start_worker(**kwargs) for i in range(n_workers)]
155
156 @gen.coroutine
/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
1053
1054 try:
-> 1055 value = future.result()
1056 except Exception:
1057 self.had_exception = True
/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
236 if self._exc_info is not None:
237 try:
--> 238 raise_exc_info(self._exc_info)
239 finally:
240 self = None
/opt/conda/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)
/opt/conda/lib/python3.6/site-packages/tornado/gen.py in callback(f)
826 for f in children:
827 try:
--> 828 result_list.append(f.result())
829 except Exception as e:
830 if future.done():
/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
236 if self._exc_info is not None:
237 try:
--> 238 raise_exc_info(self._exc_info)
239 finally:
240 self = None
/opt/conda/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)
/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
1061 if exc_info is not None:
1062 try:
-> 1063 yielded = self.gen.throw(*exc_info)
1064 finally:
1065 # Break up a reference to itself
/opt/conda/lib/python3.6/site-packages/distributed/deploy/local.py in _start_worker(self, port, processes, death_timeout, **kwargs)
171 death_timeout=death_timeout,
172 silence_logs=self.silence_logs, **kwargs)
--> 173 yield w._start()
174
175 self.workers.append(w)
/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
1053
1054 try:
-> 1055 value = future.result()
1056 except Exception:
1057 self.had_exception = True
/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
236 if self._exc_info is not None:
237 try:
--> 238 raise_exc_info(self._exc_info)
239 finally:
240 self = None
/opt/conda/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)
/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
1061 if exc_info is not None:
1062 try:
-> 1063 yielded = self.gen.throw(*exc_info)
1064 finally:
1065 # Break up a reference to itself
/opt/conda/lib/python3.6/site-packages/distributed/nanny.py in _start(self, addr_or_port)
133
134 logger.info(' Start Nanny at: %r', self.address)
--> 135 response = yield self.instantiate()
136 if response == 'OK':
137 assert self.worker_address
/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
1053
1054 try:
-> 1055 value = future.result()
1056 except Exception:
1057 self.had_exception = True
/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
236 if self._exc_info is not None:
237 try:
--> 238 raise_exc_info(self._exc_info)
239 finally:
240 self = None
/opt/conda/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)
/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
1061 if exc_info is not None:
1062 try:
-> 1063 yielded = self.gen.throw(*exc_info)
1064 finally:
1065 # Break up a reference to itself
/opt/conda/lib/python3.6/site-packages/distributed/nanny.py in instantiate(self, comm)
187 try:
188 yield gen.with_timeout(timedelta(seconds=self.death_timeout),
--> 189 self.process.start())
190 except gen.TimeoutError:
191 yield self._close(timeout=self.death_timeout)
/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
1053
1054 try:
-> 1055 value = future.result()
1056 except Exception:
1057 self.had_exception = True
/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
236 if self._exc_info is not None:
237 try:
--> 238 raise_exc_info(self._exc_info)
239 finally:
240 self = None
/opt/conda/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)
/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
1061 if exc_info is not None:
1062 try:
-> 1063 yielded = self.gen.throw(*exc_info)
1064 finally:
1065 # Break up a reference to itself
/opt/conda/lib/python3.6/site-packages/distributed/nanny.py in start(self)
296 self.stopped = Event()
297 self.status = 'starting'
--> 298 yield self.process.start()
299 if self.status == 'starting':
300 yield self._wait_until_running()
/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
1053
1054 try:
-> 1055 value = future.result()
1056 except Exception:
1057 self.had_exception = True
/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
236 if self._exc_info is not None:
237 try:
--> 238 raise_exc_info(self._exc_info)
239 finally:
240 self = None
/opt/conda/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)
/opt/conda/lib/python3.6/site-packages/distributed/process.py in _call_and_set_future(loop, future, func, *args, **kwargs)
23 def _call_and_set_future(loop, future, func, *args, **kwargs):
24 try:
---> 25 res = func(*args, **kwargs)
26 except:
27 # Tornado futures are not thread-safe, need to
/opt/conda/lib/python3.6/site-packages/distributed/process.py in _start()
115
116 def _start():
--> 117 process.start()
118 state.is_alive = True
119 state.pid = process.pid
/opt/conda/lib/python3.6/multiprocessing/process.py in start(self)
103 'daemonic processes are not allowed to have children'
104 _cleanup()
--> 105 self._popen = self._Popen(self)
106 self._sentinel = self._popen.sentinel
107 _children.add(self)
/opt/conda/lib/python3.6/multiprocessing/context.py in _Popen(process_obj)
289 def _Popen(process_obj):
290 from .popen_forkserver import Popen
--> 291 return Popen(process_obj)
292
293 class ForkContext(BaseContext):
/opt/conda/lib/python3.6/multiprocessing/popen_forkserver.py in __init__(self, process_obj)
33 def __init__(self, process_obj):
34 self._fds = []
---> 35 super().__init__(process_obj)
36
37 def duplicate_for_child(self, fd):
/opt/conda/lib/python3.6/multiprocessing/popen_fork.py in __init__(self, process_obj)
18 sys.stderr.flush()
19 self.returncode = None
---> 20 self._launch(process_obj)
21
22 def duplicate_for_child(self, fd):
/opt/conda/lib/python3.6/multiprocessing/popen_forkserver.py in _launch(self, process_obj)
49 set_spawning_popen(None)
50
---> 51 self.sentinel, w = forkserver.connect_to_new_process(self._fds)
52 util.Finalize(self, os.close, (self.sentinel,))
53 with open(w, 'wb', closefd=True) as f:
/opt/conda/lib/python3.6/multiprocessing/forkserver.py in connect_to_new_process(self, fds)
64 raise ValueError('too many fds')
65 with socket.socket(socket.AF_UNIX) as client:
---> 66 client.connect(self._forkserver_address)
67 parent_r, child_w = os.pipe()
68 child_r, parent_w = os.pipe()
ConnectionRefusedError: [Errno 111] Connection refused
Indeed. It looks like multiprocessing's forkserver solution isn't working in your docker container. You could try to diagnose this issue, or you could set the following in your ~/.dask/config.yaml file
multiprocessing-method: fork
@basnijholt did the fix above work for you? Any thoughts on why your docker container might not support multiprocessing's forkserver context?
Hi @mrocklin, the config setting didn't fix the issue. Also, multiprocessing's forkserver seems to work fine outside of dask.
I don't know whether I made a mistake before or if it's because I rebuild the image, but
from distributed import LocalCluster
c = LocalCluster(n_workers=1)
now always seems to work. However, by removing the argument n_workers=1 I get the errors I posted above.
I really have no idea on how to further debug this.
I found that the error message in https://github.com/dask/distributed/issues/1173 is the same as here. Also https://github.com/dask/distributed/issues/1176 looks like it is related.
Could it be related?
I am trying to find out where it went wrong.
I have master dask and in going from distributed 1.16.3 (which works without any problems!) to 1.17.0, I get this error message:
tinkerer@49db00b00191:~$ python -c "from distributed import Client; Client()"
^Ltornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7efcf864c378>, <tornado.concurrent.Future object at 0x7efcf863d9e8>)
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 605, in _run_callback
ret = callback()
File "/opt/conda/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 626, in _discard_future_result
future.result()
File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
yielded = self.gen.send(value)
File "/home/tinkerer/distributed/distributed/deploy/local.py", line 205, in _stop_worker
self.workers.remove(w)
ValueError: list.remove(x): x not in list
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7efcf85fd620>, <tornado.concurrent.Future object at 0x7efcf85ebcc0>)
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 605, in _run_callback
ret = callback()
File "/opt/conda/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 626, in _discard_future_result
future.result()
File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
yielded = self.gen.send(value)
File "/home/tinkerer/distributed/distributed/deploy/local.py", line 205, in _stop_worker
self.workers.remove(w)
ValueError: list.remove(x): x not in list
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:36245, Worker: tcp://127.0.0.1:34719
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:36836, Worker: tcp://127.0.0.1:42068
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:47697, Worker: tcp://127.0.0.1:47945
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:45307, Worker: tcp://127.0.0.1:41254
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:48735, Worker: tcp://127.0.0.1:50321
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:47502, Worker: tcp://127.0.0.1:59905
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:40890, Worker: tcp://127.0.0.1:58914
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:42038, Worker: tcp://127.0.0.1:52788
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:38179, Worker: tcp://127.0.0.1:59325
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:51669, Worker: tcp://127.0.0.1:58987
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:56143, Worker: tcp://127.0.0.1:36458
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:37794, Worker: tcp://127.0.0.1:44148
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:55608, Worker: tcp://127.0.0.1:42182
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:40623, Worker: tcp://127.0.0.1:53282
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:34862, Worker: tcp://127.0.0.1:47097
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:36134, Worker: tcp://127.0.0.1:58843
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:43778, Worker: tcp://127.0.0.1:46244
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:43686, Worker: tcp://127.0.0.1:58723
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:41067, Worker: tcp://127.0.0.1:46625
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:54479, Worker: tcp://127.0.0.1:42986
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:54245, Worker: tcp://127.0.0.1:58110
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:33818, Worker: tcp://127.0.0.1:49849
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:37342, Worker: tcp://127.0.0.1:35576
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:41074, Worker: tcp://127.0.0.1:46154
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:43502, Worker: tcp://127.0.0.1:37528
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:59109, Worker: tcp://127.0.0.1:59705
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:36833, Worker: tcp://127.0.0.1:58870
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:60854, Worker: tcp://127.0.0.1:42344
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:47033, Worker: tcp://127.0.0.1:42909
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:40017, Worker: tcp://127.0.0.1:60163
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:56715, Worker: tcp://127.0.0.1:51788
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:41448, Worker: tcp://127.0.0.1:34166
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:41054, Worker: tcp://127.0.0.1:49984
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:55017, Worker: tcp://127.0.0.1:35094
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:53124, Worker: tcp://127.0.0.1:39542
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:51436, Worker: tcp://127.0.0.1:41198
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:48015, Worker: tcp://127.0.0.1:58791
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:51187, Worker: tcp://127.0.0.1:45604
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:37403, Worker: tcp://127.0.0.1:51274
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:39201, Worker: tcp://127.0.0.1:38590
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:58616, Worker: tcp://127.0.0.1:43231
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:40599, Worker: tcp://127.0.0.1:60809
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:55720, Worker: tcp://127.0.0.1:60737
distributed.nanny - CRITICAL - Unable to unregister with scheduler None. Nanny: tcp://127.0.0.1:59056, Worker: tcp://127.0.0.1:52085
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7efcf8718b70>, <tornado.concurrent.Future object at 0x7efcf8709be0>)
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 605, in _run_callback
ret = callback()
File "/opt/conda/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 626, in _discard_future_result
future.result()
File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
yielded = self.gen.send(value)
File "/home/tinkerer/distributed/distributed/deploy/local.py", line 205, in _stop_worker
self.workers.remove(w)
ValueError: list.remove(x): x not in list
and that last error message repeated many times until the kernel unblocks.
Then when I go to 1.17.1 these errors repeat and the kernel is blocked indefinitely:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/tinkerer/distributed/distributed/core.py", line 424, in send_recv_from_rpc
result = yield send_recv(comm=comm, op=key, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/home/tinkerer/distributed/distributed/core.py", line 310, in send_recv
response = yield comm.read()
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/home/tinkerer/distributed/distributed/comm/tcp.py", line 166, in read
convert_stream_closed_error(e)
File "/home/tinkerer/distributed/distributed/comm/tcp.py", line 106, in convert_stream_closed_error
raise CommClosedError(str(exc))
distributed.comm.core.CommClosedError: Stream is closed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 910, in error_callback
future.result()
File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/home/tinkerer/distributed/distributed/core.py", line 427, in send_recv_from_rpc
% (e, key,))
distributed.comm.core.CommClosedError: Stream is closed: while trying to call remote method 'register'
tornado.application - ERROR - Exception in Future <tornado.concurrent.Future object at 0x7f9638185e80> after timeout
Traceback (most recent call last):
File "/home/tinkerer/distributed/distributed/comm/tcp.py", line 152, in read
n_frames = yield stream.read_bytes(8)
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
tornado.iostream.StreamClosedError: Stream is closed
I hope this can give you a hint of where to look further.
Actually I checked commit by commit and found that ef0c397865faa8c9ee68faa2e557d99933a9e066 is the last working commit.
I should have a bit of time to look at this starting tomorrow.
Sorry for the long delay. Working through a backlog of issues.
This worked fine for me:
(dask) tinkerer@061c3142985e:~$ python
Python 3.6.2 | packaged by conda-forge | (default, Jul 23 2017, 22:59:30)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import distributed
>>> c = distributed.Client()
>>>
Taking a closer look at the more recent comments here now.
It looks like this error ValueError: list.remove(x): x not in list was resolved a while ago (though after 1.17.0)
I don't see anything in the commit that follows that commit, d1b4dee692e92de36fac036f15a86b30ac6bed4c, that would cause the behavior that you're seeing. I don't suppose you're able to help me find a different reproducible example?
@basnijholt thank you for access to your system. I tried updating to master with !pip install git+https://github.com/dask/distributed.git --upgrade and things seem to work now:

Yes, absolutely great!
I did try it with master last week.
Commit 7985689ad26e02f90d16494581d9729f49d9bbf2 seems to be the last non-working commit and 4fc2dbd2e22dae05efce5df854ba7f584e7961d6 fixes it.
Thanks a lot, I guess this issue is solved 馃憤
It's a bit concerning that we don't know exactly what happened, but yes, I'm happy that this is resolved as well. Thank you for reporting and again you have my apologies for the long delay.
Most helpful comment
My first guess would be some networking issue. You might try the following:
Client(processes=False)which will avoid networking issues entirely