Scikit-image: When running in multi-process Pooch tries to create a directory that already exist

Created on 8 May 2020  路  25Comments  路  Source: scikit-image/scikit-image

Description

Previous version was good, but I cannot use new scikit-learn in a docker when running first time in multi-processing setting. Pooch tries to create same directory in first run in __init__ and that results in crash.
It is not a good idea to try to create directory in __init__, specially without any lock. Even with the lock you should be careful not to create the lock on a filesystem that does not support file locks.

Way to reproduce

  1. In a docker, when skimage is never imported before
  2. Import skimage in multiple processes at once
  3. This is a race condition on which process creates the cache first
# Place the full code we need to recreate your issue here
from skimage import transform
File "/miniconda/envs/py37/lib/python3.7/site-packages/skimage/__init__.py", line 133, in <module>
    from skimage import transform
  File "/miniconda/envs/py37/lib/python3.7/site-packages/skimage/__init__.py", line 133, in <module>
    from .data import data_dir
  File "/miniconda/envs/py37/lib/python3.7/site-packages/skimage/data/__init__.py", line 86, in <module>
    from .data import data_dir
  File "/miniconda/envs/py37/lib/python3.7/site-packages/skimage/data/__init__.py", line 86, in <module>
    urls=registry_urls,
  File "/miniconda/envs/py37/lib/python3.7/site-packages/pooch/core.py", line 371, in create
    urls=registry_urls,
  File "/miniconda/envs/py37/lib/python3.7/site-packages/pooch/core.py", line 371, in create
    path = make_local_storage(path, env, version)
  File "/miniconda/envs/py37/lib/python3.7/site-packages/pooch/utils.py", line 256, in make_local_storage
    path = make_local_storage(path, env, version)
    os.makedirs(path)  File "/miniconda/envs/py37/lib/python3.7/site-packages/pooch/utils.py", line 256, in make_local_storage

  File "/miniconda/envs/py37/lib/python3.7/os.py", line 221, in makedirs
    os.makedirs(path)
  File "/miniconda/envs/py37/lib/python3.7/os.py", line 221, in makedirs
        mkdir(name, mode)mkdir(name, mode)

FileExistsErrorFileExistsError: : [Errno 17] File exists: '/root/.cache/scikit-image/0.17.0'
[Errno 17] File exists: '/root/.cache/scikit-image/0.17.0'
Traceback (most recent call last):
  File "/miniconda/envs/py37/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/miniconda/envs/py37/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/miniconda/envs/py37/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/miniconda/envs/py37/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)

Version information

# Paste the output of the following python commands
from __future__ import print_function
import sys; print(sys.version)
import platform; print(platform.platform())
import skimage; print("scikit-image version: {}".format(skimage.__version__))
import numpy; print("numpy version: {}".format(numpy.__version__))
3.7.6 (default, Jan  8 2020, 19:59:22)
[GCC 7.3.0]
Linux-4.4.0-176-generic-x86_64-with-debian-stretch-sid
scikit-image version: 0.17.0
numpy version: 1.18.1
bug

Most helpful comment

No mkdir will not use locks. Just importing should not try to create files/folders.
Any other function, if it touches the filesystem, can be protected by a barrier/filelock by the caller.
So it should be documented which functions write to disk (read is fine).

All 25 comments

Thanks for the report @dashesy! We'll try to get a fix for this out ASAP, but as a preliminary workaround, can you run skimage.data.download_all() in a separate process first?

@leouieda, any ideas for how best to deal with this?

@jni one thing that might help is to wrap the pooch.create call in a function instead of using it as a global. For example:

def get_pooch():
    return pooch.create(...)

def download_all():
    pooch = get_pooch()
    ...

This is not ideal since it would only delay the crash to the first call of a download function. But at least it wouldn't crash on import.

@dashesy do you have any suggestion for a safe and cross-platform way to create the cache directory when using multiprocessing? This is definitely a test case we should have added.

I opened fatiando/pooch#170 to see what we can do on our end to fix this issue and prevent it from happening to others. Any ideas would be much appreciated.

Thanks for the prompt reply!
I think delaying download_all will fix this the best practical way.

  1. It is not very clean to try to fix this with a separate process first, because usually the import from skimage import transform is done on the module level, but the fix would require calling barrier, then import in master process, then let other processes import, a little ugly for imports.
  2. Another workaround is to call the import just once if skimage is baked in the docker (not installed afterwards).

  3. A cross-platform way of creating a folder also is not very straight-forward, it will require file locks and is flimsy at best.

So the best solution IMO is to delay the folder creation until download_all is called, and to solve that in multi-process scenario would be easier because barrier could be used before calling dawnload_all. So it would need just documentation, because there are many ways to call barrier that skimage may not rely on just one backend. We use skimage for randomizing resizing algorithm, so we never will need that.

since you are running this on docker, can you force remove pooch. As of 0.17, I think it is only used to run part of the tests.

conda remove pooch --force

I'm not a big fan of including pooch by default. It pulls in the whole network stack, but I feel like that is the spirit of conda-forge.

@dashesy ah, so the proximal problem is that pooch tries to create a cache directory at import time, is that right? So @leouieda's solution of delaying fetcher creation would work for you, right?

def get_pooch():
    return pooch.create(...)

def download_all():
    pooch = get_pooch()
    ...

? If you can confirm locally that this fixes the problem, we can get a fix in 0.17.2 probably before the weekend is up. =) Thanks for your patience with this!

What if pooch just added the exist ok =True flag? Would that solve things too?

If that fix works, we could also try to postpone the folder creation from pooch.create to pooch.Pooch.fetch (only when the download is actually triggered). This would require more effort in Pooch and might take a while since I'm a bit short on time. The quickest thing would be what @jni mentioned.

What if pooch just added the exist ok =True flag? Would that solve things too?

That might also work. We tried that initially but I don't remember why we removed it (might have been a 2.7 compatibility issue).

fatiando/pooch#171 adds the exists_ok argument and a test to make sure it works. I'll make a bug fix release of pooch as soon as that is ready. If that works, there might not be a need to make changes to skimage at the moment.

@dashesy PR fatiando/pooch#171 seems to have worked. I could recreate your error and adding exist_ok to os.makedirs fixed the errors. Would it be possible for you to try installing pooch from the master branch (https://github.com/fatiando/pooch/) to see if it solves your problem? If it does, I can make a release today still.

'exist_ok' will only delay the inevitable in the race condition. You will get an IO error because it really doesn't call a barrier, or file lock.

Delaying the fetcher will fix this.

So mkdir doesn't acquire file locks for itself?

Download all is not the only way to get files.
Pooch is called transparently whenever you try to get any other dataset as well.

Is it important that all fetching is multiprocess safe to you, or just importing?

No mkdir will not use locks. Just importing should not try to create files/folders.
Any other function, if it touches the filesystem, can be protected by a barrier/filelock by the caller.
So it should be documented which functions write to disk (read is fine).

Ok, I can see your point. The reason why we decided to create a data dir is twofold:

  1. We wanted data_dir to be a real directory, at all times, so that the users may use old behavior that used file system operations to list available datasets. https://github.com/scikit-image/scikit-image/pull/3945#issuecomment-498141893
  2. We wanted data_dir to be populated with a README file to explain why some data was missing. https://github.com/scikit-image/scikit-image/pull/3945#issuecomment-509641378

I'm not sure how conditioning this on a a call to a function accomplishes these two tasks.

We could do something for python 3.7+, using __getattr__, but not for python 3.6.

I think the workaround is that in your docker creation procession, at the very end, run

python -c "import skimage"

once.

You don't need to download all the datasets, and needlessly grow the size of your docker image.

It will grow it by a few MB, but at least not with the ALL the datasets.

@hmaarrfk @dashesy fatiando/pooch#173 should fix this issue for good. It delays the creation of the data folder until a file is actually downloaded. So it would be safe to call pooch.create in the module level. Just waiting on a review before merging (any feedback you might have is appreciated).

ha! I assumed that was already the case.

The issue for us is that we actually fetch a README.md to populate the directory. on bootup.

I actually think your PR https://github.com/fatiando/pooch/pull/171 fixed the multiple file creation issue in parallel issue.

We probably have an issue with the README.md creation.

Were you able to recreate the OP's issue with your test case?

Using mkdir(..., exist_ok=True) seemed to solve the directory creation for me.

Readme creation will probably result in a file corruption, but that is not a big issue because it is not a data.

@dashesy I take your thumbs up to be that https://github.com/fatiando/pooch/pull/171 did fix your particular crashes upon startup?

If so, I'm going to try and synthesize this discussion (along with some other thoughts of mine) in a separate issue.

Note that:

  • 0.17.1 had a hard dependency on pooch
  • 0.17.2 has an optional dependency on pooch

My target is that in 0.17, we don't change the bahavior too much, so that we can focus on a longer term solution for 0.18. That said, I don't want to leave you stuck on 0.16 if there is no workaround.

Yes I tried and it did not crash (I copied over the changes to these files as patch instead of building from master). So it looks good, thanks.
Will open another issue later if I see it in real scenario.

Can we close this issue then? Which would be amazing! =)

I was kinda keeping it open as a reminder, but I guess the original issue has been closed

We have summarized the key points in #4719

@hmaarrfk I was able to reproduce the issue by running in multiple processes/threads. But as @dashesy mentioned, that would just be masking the issue that we shouldn't be touching the file system at import time.

I'll make a pooch 1.1.1 release today which should fix the issues on our side as well.

Thank you for your fast response, @leouieda!

Was this page helpful?
0 / 5 - 0 ratings