Distributed: Potential race condition in Nanny

Created on 9 Jul 2020 · 23Comments · Source: dask/distributed

Hi everybody, since a few days we're seeing "random" failures in our CI due to distributed emitting:

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 1>>
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
    return self.callback()
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/distributed/nanny.py", line 414, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'

This feels like a race condition in some situation, e.g. closing of the Nanny because the periodic callbacks are still running but Nanny.process is already None.

I think (still investigating) we've started seeing this only after 2.20 was released.

I cannot attach an MFE yet simply because we don't have one 😬 we only experience this on CI. I'd welcome any sort of feedback. I can try to give some context though: the error is triggered in a Jupyter Notebook by a cell which calls scipy.minimize from a DASK worker (cell 14 here -- look for optimize.minimize). I doubt it's ever going to be useful, but here's an excerpt of the raw log (part of which I pasted above): note that it repeats over and over again for hundreds/thousands of lines...

Thanks!

Source

FabioLuporini

👍1

Most helpful comment

I just ran into this issue on my local machine, so it is alive and well.

from distributed import Client
client = Client(asynchronous=False)

Results in this being outputted over and over again:

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 3>>
Traceback (most recent call last):
  File "/home/arthur/.local/lib/python3.8/site-packages/tornado/ioloop.py", line 907, in _run
    return self.callback()
  File "/home/arthur/.local/lib/python3.8/site-packages/distributed/nanny.py", line 414, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'

Known Causes:

Running this when os.getcwd() is in a directory without write privileges triggers the error every time.
My Django project via PyCharm's console also triggers it every time.
- Investigating why is still ongoing.

Version Information

help(distributed) gives version 2.30.0
Ubuntu 20.04.1 LTS
Kernel 5.4.0-48-generic
Dask installed via pip3 install dask[complete] as root
free -h:

              total        used        free      shared  buff/cache   available
Mem:           15Gi        10Gi       318Mi       762Mi       4.8Gi       4.1Gi
Swap:          31Gi       1.0Gi        31Gi

EmperorArthur on 12 Oct 2020

👍2

All 23 comments

Thank you for raising this issue @FabioLuporini .

At first glance I don't see any obvious cause. A minimal reproducer would be useful. Alternatively, given that this started affecting you only recently, you could also try git bisect on this repository to see if you can isolate the commit that caused this change. This might be hard though if you can only reproduce the issue in CI.

mrocklin on 9 Jul 2020

I could...but it's gonna take forever :grimacing:

so, after an afternoon of CI re-spins, I think we're quite confident to say that 2.19 doesn't exhibit this behaviour

I guess we're gonna enforce <2.20 for now

FabioLuporini on 9 Jul 2020

I could...but it's gonna take forever 😬

There are only 13 commits between 2.19 and 2.20. It might not be terrible.

On Thu, Jul 9, 2020 at 9:49 AM Fabio Luporini notifications@github.com
wrote:

I could...but it's gonna take forever 😬

so, after an afternoon of CI re-spins, I think we're quite confident to
say that 2.19 doesn't exhibit this behaviour

I guess we're gonna enforce <2.20 for now

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/dask/distributed/issues/3955#issuecomment-656236554,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AACKZTCDK24XT7IAD5WIOADR2XYIXANCNFSM4OVSA4DQ
.

mrocklin on 9 Jul 2020

OK, I thought it was worse than that. Do you suggest I do it on all commits or just on the merge commits (haven't checked the commit history, will do shortly) ?

FabioLuporini on 9 Jul 2020

We squash-and-merge, so every commit in history is valid and clean

On Thu, Jul 9, 2020 at 10:01 AM Fabio Luporini notifications@github.com
wrote:

OK, I thought it was worse than that. Do you suggest I do it on all
commits or just on the merge commits (haven't checked the commit history,
will do shortly) ?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/dask/distributed/issues/3955#issuecomment-656242216,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AACKZTBHXFLWWERVVZDJBO3R2XZVZANCNFSM4OVSA4DQ
.

mrocklin on 9 Jul 2020

I'm also having this issue. I just wanted to add that we're experiencing this about 1 in 30 times.

jrdzha on 24 Jul 2020

Thanks @jrdzha . Do you have time to try the git-bisect process? I would expect this to take around 30-45 minutes.

mrocklin on 24 Jul 2020

@mrocklin I'm quite busy for the next few days, but I'll take a look after that.

jrdzha on 28 Jul 2020

@mrocklin sorry -- we did try it (OK, not really git bisect for the reasons explained above, but a conceptually analogous thing). We though bumped into other issues, and I eventually forgot to update this thread. I'll let @rhodrin explain our findings 'cause I don't quite remember what the latest news from this front are

FabioLuporini on 28 Jul 2020

@mrocklin Yes, sorry for the delayed response. So we set up an Actions matrix to run our tests many times for each commit between 2.19 and 2.20, and later some earlier and later versions.

Firstly, failures are rare and not particularly consistent (possibly related to load on the github hosted runners). One day we may see a failure 1 in 10 times, the next day we could see ~100 runs with no failure. We managed to produce failures on, I think, all commits between 2.19 and 2.20 and even into some post 2.20 commits. We then tried to produce failures on some older versions, but that day we saw absolutely no failures on any version between 2.16 and 2.21 - I think we ran the tests ~80 times on each version. After that I took a break from this.

So in summary, this one is a real mystery. But my inclining is that it's something memory related: we only managed to produce failures when running a large suite of tests, repeatedly running the failing test individually never produced any failures. Also, I've not managed to reproduce this on my local machine (which is pretty decent).

rhodrin on 28 Jul 2020

@rhodrin Yes, for our testing the fails seem to occur randomly. We run our tests using LocalCluster on a 'local' machine (not our Kubernetes cluster) though and they are failing there.

jrdzha on 29 Jul 2020

OK, it seems like the next thing to do here then is for someone to produce a reproducible example.

mrocklin on 29 Jul 2020

Just wanted to chime in that we are consistently seeing the nanny NoneType errors when running LocalCudaCluster/LocalCluster jobs on cloud instances (code gets wrapped into a container/AWS SageMaker Estimator) however the issue seems to not occur when running with containers on a local machine. Interestingly these errors are non-critical and the computations continue to completion, however the errors dominate/clobber the output.

nanny_issues_7_30_2020.txt

https://github.com/rapidsai/cloud-ml-examples/blob/master/aws/notebooks/sagemaker_rapids_hpo.ipynb
https://github.com/rapidsai/cloud-ml-examples/blob/master/aws/code/rapids_cloud_ml.py

miroenev on 31 Jul 2020

cc @quasiben

mrocklin on 31 Jul 2020

@miroenev it looks like your issue comes is explained at the top of your log output:

FileExistsError: [Errno 17] File exists: '/opt/ml/output/dask-worker-space

There have been some changes to dask-worker directory creation. @FabioLuporini can you upgrade to latest (2.22.0) and try again ?

quasiben on 4 Aug 2020

@quasiben I think they are separate errors.

My hypothesis is that the dask_worker_space issue is due to the way SageMaker attaches the same dataset S3 volume to multiple instances when running HPO. I've since tried to set the temporary directory location in a place that SageMaker's documentation suggests will not overlap between instances but its not entirely clear whether this fixes the dask_worker_space 'error.' We are less concerned about this since it has minimal impact on the output logs (only shows up once) and in practice does not seem to impact the performance of the upstream dask computations.

The nanny memory_monitor error however is quite annoying and dominates the output logs. I've tried to remove it from the logging configuration but it still persists.

https://github.com/rapidsai/cloud-ml-examples/blob/42d8ab333f8750beafd8d59a7f86991207c1488c/aws/code/rapids_cloud_ml.py#L430

miroenev on 4 Aug 2020

@quasiben We're now using 2.22.0. We haven't seen this issue yet, but will report back if we do.

rhodrin on 5 Aug 2020

I just ran into this issue on my local machine, so it is alive and well.

from distributed import Client
client = Client(asynchronous=False)

Results in this being outputted over and over again:

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 3>>
Traceback (most recent call last):
  File "/home/arthur/.local/lib/python3.8/site-packages/tornado/ioloop.py", line 907, in _run
    return self.callback()
  File "/home/arthur/.local/lib/python3.8/site-packages/distributed/nanny.py", line 414, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'

Known Causes:

Running this when os.getcwd() is in a directory without write privileges triggers the error every time.
My Django project via PyCharm's console also triggers it every time.
- Investigating why is still ongoing.

Version Information

help(distributed) gives version 2.30.0
Ubuntu 20.04.1 LTS
Kernel 5.4.0-48-generic
Dask installed via pip3 install dask[complete] as root
free -h:

              total        used        free      shared  buff/cache   available
Mem:           15Gi        10Gi       318Mi       762Mi       4.8Gi       4.1Gi
Swap:          31Gi       1.0Gi        31Gi

EmperorArthur on 12 Oct 2020

👍2

I recommend getting logs and then posting them here

client = Client(asynchronous=False, silence_logs=False)

mrocklin on 12 Oct 2020

Huh, now those logs are interesting:

I suspect it has to do with how PyCharm configures everything. Especially since I confirmed I can start it via a shell/pipenv.

Edit: I have changed the "ProjectName", but everything else is exactly as the system output it. Including the "" directory.

import os
os.getcwd()
'/home/arthur/src/ProjectName'
from distributed import Client
client = Client(asynchronous=False, silence_logs=False)
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:42219
distributed.scheduler - INFO -   dashboard at:            127.0.0.1:8787
distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:34001'
distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:45005'
distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:45505'
distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:41067'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/lib/python3.8/runpy.py", line 264, in run_path
    code, fname = _get_code_from_file(run_name, path_name)
  File "/usr/lib/python3.8/runpy.py", line 234, in _get_code_from_file
    with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/arthur/src/ProjectName/<input>'
distributed.nanny - INFO - Worker process 200663 exited with status 1
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/lib/python3.8/runpy.py", line 264, in run_path
    code, fname = _get_code_from_file(run_name, path_name)
  File "/usr/lib/python3.8/runpy.py", line 234, in _get_code_from_file
    with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/arthur/src/ProjectName/<input>'
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:45005'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/lib/python3.8/runpy.py", line 264, in run_path
    code, fname = _get_code_from_file(run_name, path_name)
  File "/usr/lib/python3.8/runpy.py", line 234, in _get_code_from_file
    with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/arthur/src/ProjectName/<input>'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/lib/python3.8/runpy.py", line 264, in run_path
    code, fname = _get_code_from_file(run_name, path_name)
  File "/usr/lib/python3.8/runpy.py", line 234, in _get_code_from_file
    with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/arthur/src/ProjectName/<input>'
distributed.nanny - INFO - Worker process 200661 exited with status 1
distributed.nanny - INFO - Worker process 200666 exited with status 1
distributed.nanny - INFO - Worker process 200667 exited with status 1
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:34001'
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:41067'
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:45505'
distributed.scheduler - INFO - Receive client connection: Client-7ec868f1-0d58-11eb-8f4e-b7ca8d0cd9a4
distributed.core - INFO - Starting established connection
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 3>>
Traceback (most recent call last):
  File "/home/arthur/.local/share/virtualenvs/ProjectName-GY1GZPBf/lib/python3.8/site-packages/tornado/ioloop.py", line 907, in _run
    return self.callback()
  File "/home/arthur/.local/share/virtualenvs/ProjectName-GY1GZPBf/lib/python3.8/site-packages/distributed/nanny.py", line 414, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'

With the last "tornado.application Error" repeating endlessly until I killed the console.

EmperorArthur on 13 Oct 2020

I've been able to turn this error on if I set silence_logs='error'.

This example doesn't work on my local:

import contextlib
from concurrent.futures._base import as_completed

import dask
import distributed
from distributed.cfexecutor import ClientExecutor


@contextlib.contextmanager
def executor():
    with dask.config.set({"distributed.worker.daemon": True}):
        with distributed.LocalCluster(
                n_workers=5,
                processes=True,
                silence_logs='error'
        ) as cluster:
            with distributed.Client(cluster, ) as client:
                yield ClientExecutor(client)

def add(z):
    return z+ 1

if __name__ == '__main__':

    with executor() as pool:
        futures = [pool.submit(add, k) for k in range(10)]
        for f in as_completed(futures):
            print(f.result())

Failing with:

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 3>>
Traceback (most recent call last):
  File "/home/cjwright/mc/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
    return self.callback()
  File "/home/cjwright/mc/lib/python3.8/site-packages/distributed/nanny.py", line 414, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'

This works though:

import contextlib
from concurrent.futures._base import as_completed

import dask
import distributed
from distributed.cfexecutor import ClientExecutor


@contextlib.contextmanager
def executor():
    with dask.config.set({"distributed.worker.daemon": True}):
        with distributed.LocalCluster(
                n_workers=5,
                processes=True,
                # silence_logs='error'
        ) as cluster:
            with distributed.Client(cluster, ) as client:
                yield ClientExecutor(client)

def add(z):
    return z+ 1

if __name__ == '__main__':

    with executor() as pool:
        futures = [pool.submit(add, k) for k in range(10)]
        for f in as_completed(futures):
            print(f.result())

Does that work for a reproducible example?

(dask version 2020.12.0)

CJ-Wright on 2 Jan 2021

❤1

I meet a very similar problem. I ask many nodes to use a single dask-worker-space in the shared file system. When a node want to retire its own workers from cluster.adapt(), it will accidentally kill workers of other nodes.The problem gone when I set a local_directory for each nodes.

I am not sure whether this "using a single dask-worker-space among nodes" is a bad code should be avoided from the beginning. Or this is because of bugs when dealing with "using a single dask-worker-space among nodes". If it's the first case, I would recommend to support this, because when I want a quick use of Dask on a cluster, my intuition won't tell me: "set the local_directory first, otherwise the program will crash". If it's the second cases, I would like to help you to recreate this bug on your side.

yuhengy on 3 Apr 2021

@CJ-Wright when I run your first code snippet with the latest dask and distributed release (v2021.04.0). I see AttributeError you posted, however at the top of the traceback I see

Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/Users/james/miniforge3/envs/distributed/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Users/james/miniforge3/envs/distributed/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/james/projects/dask/distributed/distributed/process.py", line 192, in _run
    target(*args, **kwargs)
  File "/Users/james/projects/dask/distributed/distributed/nanny.py", line 726, in _run
    logger.setLevel(silence_logs)
  File "/Users/james/miniforge3/envs/distributed/lib/python3.8/logging/__init__.py", line 1409, in setLevel
    self.level = _checkLevel(level)
  File "/Users/james/miniforge3/envs/distributed/lib/python3.8/logging/__init__.py", line 194, in _checkLevel
    raise ValueError("Unknown level: %r" % level)
ValueError: Unknown level: 'error'

However setting silence_logs=logging.ERROR or silence_logs=False runs successfully. Is this also the case for you?