Hi everybody, since a few days we're seeing "random" failures in our CI due to distributed emitting:
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 1>>
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
return self.callback()
File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/distributed/nanny.py", line 414, in memory_monitor
process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'
This feels like a race condition in some situation, e.g. closing of the Nanny because the periodic callbacks are still running but Nanny.process is already None.
I think (still investigating) we've started seeing this only after 2.20 was released.
I cannot attach an MFE yet simply because we don't have one 😬 we only experience this on CI. I'd welcome any sort of feedback. I can try to give some context though: the error is triggered in a Jupyter Notebook by a cell which calls scipy.minimize from a DASK worker (cell 14 here -- look for optimize.minimize). I doubt it's ever going to be useful, but here's an excerpt of the raw log (part of which I pasted above): note that it repeats over and over again for hundreds/thousands of lines...
Thanks!
Thank you for raising this issue @FabioLuporini .
At first glance I don't see any obvious cause. A minimal reproducer would be useful. Alternatively, given that this started affecting you only recently, you could also try git bisect on this repository to see if you can isolate the commit that caused this change. This might be hard though if you can only reproduce the issue in CI.
I could...but it's gonna take forever :grimacing:
so, after an afternoon of CI re-spins, I think we're quite confident to say that 2.19 doesn't exhibit this behaviour
I guess we're gonna enforce <2.20 for now
I could...but it's gonna take forever 😬
There are only 13 commits between 2.19 and 2.20. It might not be terrible.
On Thu, Jul 9, 2020 at 9:49 AM Fabio Luporini notifications@github.com
wrote:
I could...but it's gonna take forever 😬
so, after an afternoon of CI re-spins, I think we're quite confident to
say that 2.19 doesn't exhibit this behaviourI guess we're gonna enforce <2.20 for now
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/dask/distributed/issues/3955#issuecomment-656236554,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AACKZTCDK24XT7IAD5WIOADR2XYIXANCNFSM4OVSA4DQ
.
OK, I thought it was worse than that. Do you suggest I do it on all commits or just on the merge commits (haven't checked the commit history, will do shortly) ?
We squash-and-merge, so every commit in history is valid and clean
On Thu, Jul 9, 2020 at 10:01 AM Fabio Luporini notifications@github.com
wrote:
OK, I thought it was worse than that. Do you suggest I do it on all
commits or just on the merge commits (haven't checked the commit history,
will do shortly) ?—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/dask/distributed/issues/3955#issuecomment-656242216,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AACKZTBHXFLWWERVVZDJBO3R2XZVZANCNFSM4OVSA4DQ
.
I'm also having this issue. I just wanted to add that we're experiencing this about 1 in 30 times.
Thanks @jrdzha . Do you have time to try the git-bisect process? I would expect this to take around 30-45 minutes.
@mrocklin I'm quite busy for the next few days, but I'll take a look after that.
@mrocklin sorry -- we did try it (OK, not really git bisect for the reasons explained above, but a conceptually analogous thing). We though bumped into other issues, and I eventually forgot to update this thread. I'll let @rhodrin explain our findings 'cause I don't quite remember what the latest news from this front are
@mrocklin Yes, sorry for the delayed response. So we set up an Actions matrix to run our tests many times for each commit between 2.19 and 2.20, and later some earlier and later versions.
Firstly, failures are rare and not particularly consistent (possibly related to load on the github hosted runners). One day we may see a failure 1 in 10 times, the next day we could see ~100 runs with no failure. We managed to produce failures on, I think, all commits between 2.19 and 2.20 and even into some post 2.20 commits. We then tried to produce failures on some older versions, but that day we saw absolutely no failures on any version between 2.16 and 2.21 - I think we ran the tests ~80 times on each version. After that I took a break from this.
So in summary, this one is a real mystery. But my inclining is that it's something memory related: we only managed to produce failures when running a large suite of tests, repeatedly running the failing test individually never produced any failures. Also, I've not managed to reproduce this on my local machine (which is pretty decent).
@rhodrin Yes, for our testing the fails seem to occur randomly. We run our tests using LocalCluster on a 'local' machine (not our Kubernetes cluster) though and they are failing there.
OK, it seems like the next thing to do here then is for someone to produce a reproducible example.
Just wanted to chime in that we are consistently seeing the nanny NoneType errors when running LocalCudaCluster/LocalCluster jobs on cloud instances (code gets wrapped into a container/AWS SageMaker Estimator) however the issue seems to not occur when running with containers on a local machine. Interestingly these errors are non-critical and the computations continue to completion, however the errors dominate/clobber the output.
https://github.com/rapidsai/cloud-ml-examples/blob/master/aws/notebooks/sagemaker_rapids_hpo.ipynb
https://github.com/rapidsai/cloud-ml-examples/blob/master/aws/code/rapids_cloud_ml.py
cc @quasiben
@miroenev it looks like your issue comes is explained at the top of your log output:
FileExistsError: [Errno 17] File exists: '/opt/ml/output/dask-worker-space
There have been some changes to dask-worker directory creation. @FabioLuporini can you upgrade to latest (2.22.0) and try again ?
@quasiben I think they are separate errors.
My hypothesis is that the dask_worker_space issue is due to the way SageMaker attaches the same dataset S3 volume to multiple instances when running HPO. I've since tried to set the temporary directory location in a place that SageMaker's documentation suggests will not overlap between instances but its not entirely clear whether this fixes the dask_worker_space 'error.' We are less concerned about this since it has minimal impact on the output logs (only shows up once) and in practice does not seem to impact the performance of the upstream dask computations.
The nanny memory_monitor error however is quite annoying and dominates the output logs. I've tried to remove it from the logging configuration but it still persists.
@quasiben We're now using 2.22.0. We haven't seen this issue yet, but will report back if we do.
I just ran into this issue on my local machine, so it is alive and well.
from distributed import Client
client = Client(asynchronous=False)
Results in this being outputted over and over again:
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 3>>
Traceback (most recent call last):
File "/home/arthur/.local/lib/python3.8/site-packages/tornado/ioloop.py", line 907, in _run
return self.callback()
File "/home/arthur/.local/lib/python3.8/site-packages/distributed/nanny.py", line 414, in memory_monitor
process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'
os.getcwd() is in a directory without write privileges triggers the error every time.help(distributed) gives version 2.30.0pip3 install dask[complete] as rootfree -h: total used free shared buff/cache available
Mem: 15Gi 10Gi 318Mi 762Mi 4.8Gi 4.1Gi
Swap: 31Gi 1.0Gi 31Gi
I recommend getting logs and then posting them here
client = Client(asynchronous=False, silence_logs=False)
Huh, now those logs are interesting:
I suspect it has to do with how PyCharm configures everything. Especially since I confirmed I can start it via a shell/pipenv.
Edit: I have changed the "ProjectName", but everything else is exactly as the system output it. Including the "" directory.
import os
os.getcwd()
'/home/arthur/src/ProjectName'
from distributed import Client
client = Client(asynchronous=False, silence_logs=False)
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://127.0.0.1:42219
distributed.scheduler - INFO - dashboard at: 127.0.0.1:8787
distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:34001'
distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:45005'
distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:45505'
distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:41067'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.8/runpy.py", line 264, in run_path
code, fname = _get_code_from_file(run_name, path_name)
File "/usr/lib/python3.8/runpy.py", line 234, in _get_code_from_file
with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/arthur/src/ProjectName/<input>'
distributed.nanny - INFO - Worker process 200663 exited with status 1
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.8/runpy.py", line 264, in run_path
code, fname = _get_code_from_file(run_name, path_name)
File "/usr/lib/python3.8/runpy.py", line 234, in _get_code_from_file
with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/arthur/src/ProjectName/<input>'
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:45005'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.8/runpy.py", line 264, in run_path
code, fname = _get_code_from_file(run_name, path_name)
File "/usr/lib/python3.8/runpy.py", line 234, in _get_code_from_file
with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/arthur/src/ProjectName/<input>'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.8/runpy.py", line 264, in run_path
code, fname = _get_code_from_file(run_name, path_name)
File "/usr/lib/python3.8/runpy.py", line 234, in _get_code_from_file
with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/arthur/src/ProjectName/<input>'
distributed.nanny - INFO - Worker process 200661 exited with status 1
distributed.nanny - INFO - Worker process 200666 exited with status 1
distributed.nanny - INFO - Worker process 200667 exited with status 1
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:34001'
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:41067'
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:45505'
distributed.scheduler - INFO - Receive client connection: Client-7ec868f1-0d58-11eb-8f4e-b7ca8d0cd9a4
distributed.core - INFO - Starting established connection
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 3>>
Traceback (most recent call last):
File "/home/arthur/.local/share/virtualenvs/ProjectName-GY1GZPBf/lib/python3.8/site-packages/tornado/ioloop.py", line 907, in _run
return self.callback()
File "/home/arthur/.local/share/virtualenvs/ProjectName-GY1GZPBf/lib/python3.8/site-packages/distributed/nanny.py", line 414, in memory_monitor
process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'
With the last "tornado.application Error" repeating endlessly until I killed the console.
I've been able to turn this error on if I set silence_logs='error'.
This example doesn't work on my local:
import contextlib
from concurrent.futures._base import as_completed
import dask
import distributed
from distributed.cfexecutor import ClientExecutor
@contextlib.contextmanager
def executor():
with dask.config.set({"distributed.worker.daemon": True}):
with distributed.LocalCluster(
n_workers=5,
processes=True,
silence_logs='error'
) as cluster:
with distributed.Client(cluster, ) as client:
yield ClientExecutor(client)
def add(z):
return z+ 1
if __name__ == '__main__':
with executor() as pool:
futures = [pool.submit(add, k) for k in range(10)]
for f in as_completed(futures):
print(f.result())
Failing with:
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 3>>
Traceback (most recent call last):
File "/home/cjwright/mc/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
return self.callback()
File "/home/cjwright/mc/lib/python3.8/site-packages/distributed/nanny.py", line 414, in memory_monitor
process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'
This works though:
import contextlib
from concurrent.futures._base import as_completed
import dask
import distributed
from distributed.cfexecutor import ClientExecutor
@contextlib.contextmanager
def executor():
with dask.config.set({"distributed.worker.daemon": True}):
with distributed.LocalCluster(
n_workers=5,
processes=True,
# silence_logs='error'
) as cluster:
with distributed.Client(cluster, ) as client:
yield ClientExecutor(client)
def add(z):
return z+ 1
if __name__ == '__main__':
with executor() as pool:
futures = [pool.submit(add, k) for k in range(10)]
for f in as_completed(futures):
print(f.result())
Does that work for a reproducible example?
(dask version 2020.12.0)
I meet a very similar problem. I ask many nodes to use a single dask-worker-space in the shared file system. When a node want to retire its own workers from cluster.adapt(), it will accidentally kill workers of other nodes.The problem gone when I set a local_directory for each nodes.
I am not sure whether this "using a single dask-worker-space among nodes" is a bad code should be avoided from the beginning. Or this is because of bugs when dealing with "using a single dask-worker-space among nodes". If it's the first case, I would recommend to support this, because when I want a quick use of Dask on a cluster, my intuition won't tell me: "set the local_directory first, otherwise the program will crash". If it's the second cases, I would like to help you to recreate this bug on your side.
@CJ-Wright when I run your first code snippet with the latest dask and distributed release (v2021.04.0). I see AttributeError you posted, however at the top of the traceback I see
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
File "/Users/james/miniforge3/envs/distributed/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/Users/james/miniforge3/envs/distributed/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/Users/james/projects/dask/distributed/distributed/process.py", line 192, in _run
target(*args, **kwargs)
File "/Users/james/projects/dask/distributed/distributed/nanny.py", line 726, in _run
logger.setLevel(silence_logs)
File "/Users/james/miniforge3/envs/distributed/lib/python3.8/logging/__init__.py", line 1409, in setLevel
self.level = _checkLevel(level)
File "/Users/james/miniforge3/envs/distributed/lib/python3.8/logging/__init__.py", line 194, in _checkLevel
raise ValueError("Unknown level: %r" % level)
ValueError: Unknown level: 'error'
However setting silence_logs=logging.ERROR or silence_logs=False runs successfully. Is this also the case for you?
Most helpful comment
I just ran into this issue on my local machine, so it is alive and well.
Results in this being outputted over and over again:
Known Causes:
os.getcwd()is in a directory without write privileges triggers the error every time.Version Information
help(distributed)gives version 2.30.0pip3 install dask[complete]as rootfree -h: