Distributed: "Cannot start Bokeh server, port 8787 is already in use" while instantiating a Client

Created on 6 Dec 2016  路  16Comments  路  Source: dask/distributed

When using c = Client() from a script, I get this (on an 8 core machine):

bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use

and then it hangs.

Strange thing is, it does work OK from ipython or from a live python session. But if I create a script with the following code and try to run it ("python3 test.py") it hangs:

from distributed import Client, local_client

client = Client()

Breaking with ctrl-c gives the following traceback:

Traceback (most recent call last):
  File "test_fib.py", line 3, in <module>
    client = Client()
Traceback (most recent call last):
  File "<string>", line 1, in <module>

Interrupted, shutting down  File "/usr/lib64/python3.5/multiprocessing/forkserver.py", line 164, in main
  File "/usr/lib/python3.5/site-packages/distributed/client.py", line 366, in __init__

    rfds = [key.fileobj for (key, events) in selector.select()]
  File "/usr/lib64/python3.5/selectors.py", line 441, in select
    self.start(timeout=timeout)
  File "/usr/lib/python3.5/site-packages/distributed/client.py", line 396, in start
    sync(self.loop, self._start, **kwargs)
  File "/usr/lib/python3.5/site-packages/distributed/utils.py", line 142, in sync
    e.wait(1000000)
    fd_event_list = self._epoll.poll(timeout, max_ev)
KeyboardInterrupt
  File "/usr/lib64/python3.5/threading.py", line 549, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib64/python3.5/threading.py", line 297, in wait
    gotit = waiter.acquire(True, timeout)
KeyboardInterrupt
/usr/lib64/python3.5/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 192 leaked semaphores to clean up at shutdown
  len(cache))

Most helpful comment

Also, a simple suggestion. Instead of:

from distributed import Client, local_client

client = Client()

Have you tried:

from distributed import Client, local_client

if __name__ == "__main__":
    client = Client()

All 16 comments

Interactive (ipython) sessions only _seem_ to work. When submitting a task, I get the same error, and tasks never finish...

Is it possible that you have a dask-scheduler process running somewhere that you're not aware of?

Thought about that too, but:

ps ax | grep python
 2099 ?        S      0:51 /usr/bin/python /usr/bin/cherrytree
18422 pts/3    S+     0:00 grep python
vincent@localhost:~/PycharmProjects/modis/modask$ python3 test.py
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
^CTraceback (most recent call last):
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib64/python3.5/multiprocessing/forkserver.py", line 164, in main
  File "test.py", line 2, in <module>

Interrupted, shutting down
    c = Client()
  File "/usr/lib/python3.5/site-packages/distributed/client.py", line 366, in __init__
    rfds = [key.fileobj for (key, events) in selector.select()]
    self.start(timeout=timeout)
  File "/usr/lib/python3.5/site-packages/distributed/client.py", line 396, in start
    sync(self.loop, self._start, **kwargs)
  File "/usr/lib64/python3.5/selectors.py", line 441, in select
    fd_event_list = self._epoll.poll(timeout, max_ev)
  File "/usr/lib/python3.5/site-packages/distributed/utils.py", line 142, in sync
    e.wait(1000000)
KeyboardInterrupt
  File "/usr/lib64/python3.5/threading.py", line 549, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib64/python3.5/threading.py", line 297, in wait
    gotit = waiter.acquire(True, timeout)
KeyboardInterrupt
/usr/lib64/python3.5/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 192 leaked semaphores to clean up at shutdown
  len(cache))

I even tried rebooting, but still get the same error...

OK, here's my diagnosis. Correct me if I'm wrong.

  • I can start a dask-scheduler from the command line without problems:
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO -   Scheduler at:        192.168.0.102:8786
distributed.scheduler - INFO -        http at:        192.168.0.102:9786
distributed.bokeh.application - INFO - Web UI: http://192.168.0.102:8787/status/
distributed.scheduler - INFO - -----------------------------------------------
- I can start a dask-worker only if I do not specify I want more than 1 process. Running 'dask-worker' winthout any options at the commandline works ok, but gives me 8 threads, single process:
dask-worker 192.168.0.102:8786
distributed.nanny - INFO -         Start Nanny at:        192.168.0.102:42355
distributed.worker - INFO -       Start worker at:        192.168.0.102:43587
distributed.worker - INFO -              bokeh at:        192.168.0.102:8789
distributed.worker - INFO -               http at:        192.168.0.102:42791
distributed.worker - INFO -              nanny at:        192.168.0.102:42355
distributed.worker - INFO - Waiting to connect to:        192.168.0.102:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          8
distributed.worker - INFO -                Memory:                   20.13 GB
distributed.worker - INFO -       Local Directory:        /tmp/nanny-ek_4y9c5
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:        192.168.0.102:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Connection from 192.168.0.102:55708 to Worker
distributed.nanny - INFO - Nanny 192.168.0.102:42355 starts worker process 192.168.0.102:43587
  • If I start a dask-worker, requesting more than 1 process, things go wrong:
dask-worker 192.168.0.102:8786 --nprocs 2
distributed.nanny - INFO -         Start Nanny at:        192.168.0.102:38651
distributed.nanny - INFO -         Start Nanny at:        192.168.0.102:34199
distributed.worker - INFO -       Start worker at:        192.168.0.102:40403
distributed.worker - INFO -              bokeh at:        192.168.0.102:8789
distributed.worker - INFO -               http at:        192.168.0.102:39409
distributed.worker - INFO -              nanny at:        192.168.0.102:38651
distributed.worker - INFO - Waiting to connect to:        192.168.0.102:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                   10.07 GB
distributed.worker - INFO -       Local Directory:        /tmp/nanny-ls73mwak
distributed.worker - INFO - -------------------------------------------------
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8789 is already in use
distributed.worker - INFO -         Registered to:        192.168.0.102:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Connection from 192.168.0.102:36504 to Worker
distributed.worker - INFO -       Start worker at:        192.168.0.102:34675
distributed.worker - INFO -              bokeh at:        192.168.0.102:38211
distributed.worker - INFO -               http at:        192.168.0.102:40115
distributed.worker - INFO -              nanny at:        192.168.0.102:34199
distributed.worker - INFO - Waiting to connect to:        192.168.0.102:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                   10.07 GB
distributed.worker - INFO -       Local Directory:        /tmp/nanny-l5qz8na2
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:        192.168.0.102:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Connection from 192.168.0.102:40528 to Worker
distributed.nanny - INFO - Nanny 192.168.0.102:38651 starts worker process 192.168.0.102:40403
distributed.nanny - INFO - Nanny 192.168.0.102:34199 starts worker process 192.168.0.102:34675
  • If I start dask-worker with --nprocs 8, indeed I get 7 times the "CRITICAL - Cannot start Bokeh server, port 8789 is already in use" message. To me that indicates that each worker process tries to claim the same port, but of course only the first succeeds, so all subsequent processes fail to start a bokeh server on the same port.

Are you on Linux? Try netstat -tnlp to see which process is listening on the 8787 port. You may want to run that command as root to get more information.

Also, a simple suggestion. Instead of:

from distributed import Client, local_client

client = Client()

Have you tried:

from distributed import Client, local_client

if __name__ == "__main__":
    client = Client()

Thanks for the suggestions. Yes, I have tried that. And different things. And netstat, too. Of course there is nothing on port 8787 when I have no scheduler running, and there is the scheduler's bokeh server on it when I have it running. It is the second (and successive) worker process (not thread!) that triggers the error. Did you see me previous comment https://github.com/dask/distributed/issues/726#issuecomment-265390727 ? I think that diagnoses the situation pretty well...

There are two distinct bokeh servers now, one on the scheduler, 8787, and one on the worker, 8789.

Each worker spins up its own bokeh server, first on 8789 if available, and then on a random port, if unavailable. This is why you see two messages:

bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8789 is already in use
distributed.worker - INFO -              bokeh at:        192.168.0.102:38211

While I agree that the error is perhaps scarier than we would like, this shouldn't affect your normal operation.

Ah, good to know. Maybe change the "critical" into "warning" or "info" :-)

Leaves the question why a simple file with just this:

from distributed import Client
c = Client()
print(c)

fails (by just hanging). The "print(c)" line is never reached. Probably not related to the "Cannot start bokeh server..." lines it spews out then, but also not quite what I'd expect.

O wait, that does work when I put it in a "if __name__ == "__main__": block.
so many subleties... sorry for the noise! Suppose this can be closed then.

Interesting example. Slightly modified:

from distributed import Client
print(1)
c = Client()
print(2)
print(c)
mrocklin@carbon:~$ python foo.py
1
1
1
1
1
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use
bokeh.server.server - CRITICAL - Cannot start Bokeh server, port 8787 is already in use

This shows that something is going on when forking the process. Generally you're right that it seems unwise to have a live client within the importable module side of the code.

Regarding changing CRITICAL to WARNING that'll be a bit tricky. We don't control Bokeh's logging levels.

Of course.
OK, for now I'll just use the workaround to never have a live client within the importable part.

Thing is, this actually came up when I was trying to get to the root cause of my calculations with dask failing. I just now have found out that that was due to me using delayed within a local_client block. Now that I've changed that into using f=local_client.submit and f.result(), instead of creating a (complex) delayed and calling d.compute() on that. Using the submit vocabulary works, using delayed's caused trouble.
Q: is using delayed within local_client blocks supposed to work?
Q2: if yes, I can open a new issue and try to come up with a small reproducible example. Let me know if that would be worth the effort.

Yes, everything you do with a normal client should also work with a local_client

A reproducible error would be of great value here.

So, there are several concerns here:

  • the if __name__ == "__main__" idiom should be mentioned in the docs
  • worker processes' embedded Bokeh servers should probably not try to all listen on the same (8789) port or, if they should, they should try to tone down logging on the Bokeh side

I managed to get a reproducible example for the delayed from local_client issue. I opened a new issue: #733 .

Was this page helpful?
0 / 5 - 0 ratings

Related issues

stuartarchibald picture stuartarchibald  路  24Comments

TomAugspurger picture TomAugspurger  路  24Comments

seanlaw picture seanlaw  路  30Comments

piprrr picture piprrr  路  61Comments

muammar picture muammar  路  46Comments