When using c = Client() from a script, I get this (on an 8 core machine):
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
and then it hangs.
Strange thing is, it does work OK from ipython or from a live python session. But if I create a script with the following code and try to run it ("python3 test.py") it hangs:
from distributed import Client, local_client
client = Client()
Breaking with ctrl-c gives the following traceback:
Traceback (most recent call last):
File "test_fib.py", line 3, in <module>
client = Client()
Traceback (most recent call last):
File "<string>", line 1, in <module>
Interrupted, shutting down File "/usr/lib64/python3.5/multiprocessing/forkserver.py", line 164, in main
File "/usr/lib/python3.5/site-packages/distributed/client.py", line 366, in __init__
rfds = [key.fileobj for (key, events) in selector.select()]
File "/usr/lib64/python3.5/selectors.py", line 441, in select
self.start(timeout=timeout)
File "/usr/lib/python3.5/site-packages/distributed/client.py", line 396, in start
sync(self.loop, self._start, **kwargs)
File "/usr/lib/python3.5/site-packages/distributed/utils.py", line 142, in sync
e.wait(1000000)
fd_event_list = self._epoll.poll(timeout, max_ev)
KeyboardInterrupt
File "/usr/lib64/python3.5/threading.py", line 549, in wait
signaled = self._cond.wait(timeout)
File "/usr/lib64/python3.5/threading.py", line 297, in wait
gotit = waiter.acquire(True, timeout)
KeyboardInterrupt
/usr/lib64/python3.5/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 192 leaked semaphores to clean up at shutdown
len(cache))
Interactive (ipython) sessions only _seem_ to work. When submitting a task, I get the same error, and tasks never finish...
Is it possible that you have a dask-scheduler process running somewhere that you're not aware of?
Thought about that too, but:
ps ax | grep python
2099 ? S 0:51 /usr/bin/python /usr/bin/cherrytree
18422 pts/3 S+ 0:00 grep python
vincent@localhost:~/PycharmProjects/modis/modask$ python3 test.py
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
^CTraceback (most recent call last):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib64/python3.5/multiprocessing/forkserver.py", line 164, in main
File "test.py", line 2, in <module>
Interrupted, shutting down
c = Client()
File "/usr/lib/python3.5/site-packages/distributed/client.py", line 366, in __init__
rfds = [key.fileobj for (key, events) in selector.select()]
self.start(timeout=timeout)
File "/usr/lib/python3.5/site-packages/distributed/client.py", line 396, in start
sync(self.loop, self._start, **kwargs)
File "/usr/lib64/python3.5/selectors.py", line 441, in select
fd_event_list = self._epoll.poll(timeout, max_ev)
File "/usr/lib/python3.5/site-packages/distributed/utils.py", line 142, in sync
e.wait(1000000)
KeyboardInterrupt
File "/usr/lib64/python3.5/threading.py", line 549, in wait
signaled = self._cond.wait(timeout)
File "/usr/lib64/python3.5/threading.py", line 297, in wait
gotit = waiter.acquire(True, timeout)
KeyboardInterrupt
/usr/lib64/python3.5/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 192 leaked semaphores to clean up at shutdown
len(cache))
I even tried rebooting, but still get the same error...
OK, here's my diagnosis. Correct me if I'm wrong.
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Scheduler at: 192.168.0.102:8786
distributed.scheduler - INFO - http at: 192.168.0.102:9786
distributed.bokeh.application - INFO - Web UI: http://192.168.0.102:8787/status/
distributed.scheduler - INFO - -----------------------------------------------
- I can start a dask-worker only if I do not specify I want more than 1 process. Running 'dask-worker' winthout any options at the commandline works ok, but gives me 8 threads, single process:
dask-worker 192.168.0.102:8786 distributed.nanny - INFO - Start Nanny at: 192.168.0.102:42355 distributed.worker - INFO - Start worker at: 192.168.0.102:43587 distributed.worker - INFO - bokeh at: 192.168.0.102:8789 distributed.worker - INFO - http at: 192.168.0.102:42791 distributed.worker - INFO - nanny at: 192.168.0.102:42355 distributed.worker - INFO - Waiting to connect to: 192.168.0.102:8786 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 8 distributed.worker - INFO - Memory: 20.13 GB distributed.worker - INFO - Local Directory: /tmp/nanny-ek_4y9c5 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Registered to: 192.168.0.102:8786 distributed.worker - INFO - ------------------------------------------------- distributed.core - INFO - Connection from 192.168.0.102:55708 to Worker distributed.nanny - INFO - Nanny 192.168.0.102:42355 starts worker process 192.168.0.102:43587
dask-worker 192.168.0.102:8786 --nprocs 2 distributed.nanny - INFO - Start Nanny at: 192.168.0.102:38651 distributed.nanny - INFO - Start Nanny at: 192.168.0.102:34199 distributed.worker - INFO - Start worker at: 192.168.0.102:40403 distributed.worker - INFO - bokeh at: 192.168.0.102:8789 distributed.worker - INFO - http at: 192.168.0.102:39409 distributed.worker - INFO - nanny at: 192.168.0.102:38651 distributed.worker - INFO - Waiting to connect to: 192.168.0.102:8786 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 4 distributed.worker - INFO - Memory: 10.07 GB distributed.worker - INFO - Local Directory: /tmp/nanny-ls73mwak distributed.worker - INFO - ------------------------------------------------- bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8789 is already in use distributed.worker - INFO - Registered to: 192.168.0.102:8786 distributed.worker - INFO - ------------------------------------------------- distributed.core - INFO - Connection from 192.168.0.102:36504 to Worker distributed.worker - INFO - Start worker at: 192.168.0.102:34675 distributed.worker - INFO - bokeh at: 192.168.0.102:38211 distributed.worker - INFO - http at: 192.168.0.102:40115 distributed.worker - INFO - nanny at: 192.168.0.102:34199 distributed.worker - INFO - Waiting to connect to: 192.168.0.102:8786 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 4 distributed.worker - INFO - Memory: 10.07 GB distributed.worker - INFO - Local Directory: /tmp/nanny-l5qz8na2 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Registered to: 192.168.0.102:8786 distributed.worker - INFO - ------------------------------------------------- distributed.core - INFO - Connection from 192.168.0.102:40528 to Worker distributed.nanny - INFO - Nanny 192.168.0.102:38651 starts worker process 192.168.0.102:40403 distributed.nanny - INFO - Nanny 192.168.0.102:34199 starts worker process 192.168.0.102:34675
Are you on Linux? Try netstat -tnlp to see which process is listening on the 8787 port. You may want to run that command as root to get more information.
Also, a simple suggestion. Instead of:
from distributed import Client, local_client
client = Client()
Have you tried:
from distributed import Client, local_client
if __name__ == "__main__":
client = Client()
Thanks for the suggestions. Yes, I have tried that. And different things. And netstat, too. Of course there is nothing on port 8787 when I have no scheduler running, and there is the scheduler's bokeh server on it when I have it running. It is the second (and successive) worker process (not thread!) that triggers the error. Did you see me previous comment https://github.com/dask/distributed/issues/726#issuecomment-265390727 ? I think that diagnoses the situation pretty well...
There are two distinct bokeh servers now, one on the scheduler, 8787, and one on the worker, 8789.
Each worker spins up its own bokeh server, first on 8789 if available, and then on a random port, if unavailable. This is why you see two messages:
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8789 is already in use
distributed.worker - INFO - bokeh at: 192.168.0.102:38211
While I agree that the error is perhaps scarier than we would like, this shouldn't affect your normal operation.
Ah, good to know. Maybe change the "critical" into "warning" or "info" :-)
Leaves the question why a simple file with just this:
from distributed import Client
c = Client()
print(c)
fails (by just hanging). The "print(c)" line is never reached. Probably not related to the "Cannot start bokeh server..." lines it spews out then, but also not quite what I'd expect.
O wait, that does work when I put it in a "if __name__ == "__main__": block.
so many subleties... sorry for the noise! Suppose this can be closed then.
Interesting example. Slightly modified:
from distributed import Client
print(1)
c = Client()
print(2)
print(c)
mrocklin@carbon:~$ python foo.py
1
1
1
1
1
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
This shows that something is going on when forking the process. Generally you're right that it seems unwise to have a live client within the importable module side of the code.
Regarding changing CRITICAL to WARNING that'll be a bit tricky. We don't control Bokeh's logging levels.
Of course.
OK, for now I'll just use the workaround to never have a live client within the importable part.
Thing is, this actually came up when I was trying to get to the root cause of my calculations with dask failing. I just now have found out that that was due to me using delayed within a local_client block. Now that I've changed that into using f=local_client.submit and f.result(), instead of creating a (complex) delayed and calling d.compute() on that. Using the submit vocabulary works, using delayed's caused trouble.
Q: is using delayed within local_client blocks supposed to work?
Q2: if yes, I can open a new issue and try to come up with a small reproducible example. Let me know if that would be worth the effort.
Yes, everything you do with a normal client should also work with a local_client
A reproducible error would be of great value here.
So, there are several concerns here:
if __name__ == "__main__" idiom should be mentioned in the docsI managed to get a reproducible example for the delayed from local_client issue. I opened a new issue: #733 .
Most helpful comment
Also, a simple suggestion. Instead of:
Have you tried: