Sometimes I'm working on a python package which is just a directory with contents constantly being altered. I just want to send that directory to my dask workers without making a zipfile or egg or whatever. Currently I use these functions to get the job done:
def fn_to_targz_string(fn):
with io.BytesIO() as bt:
with tarfile.open(fileobj=bt,mode='w:gz') as tf:
tf.add(fn,arcname=os.path.basename(fn))
bt.seek(0)
s=bt.read()
return s
def extract_targz_string(s,*args,**kwargs):
import io,tarfile
with io.BytesIO() as bt:
bt.write(s)
bt.seek(0)
with tarfile.open(fileobj=bt,mode='r:gz') as tf:
tf.extractall(*args,**kwargs)
If we used tarfile in this way for Client.upload_file, it wouldn't matter whether the user wanted to upload a file or a directory. It should just work.
I believe the only necessary changes would be replacing the beginning of Client._upload_file with something like fn_to_targz_string (shown above) and the beginning of Worker.upload_file with something like extract_targz_string (shown above). If you wanted to be complete, you could also add something to the belly of Worker.upload_file so that the module would be properly inserted into names_to_import.
Generally adding directory support to upload file seems sensible to me.
However I think that relying on tar-gz might be problematic just because these aren't universally deployed. (Dask runs on on windows clusters). Even though it's quite old, perhaps the zipfile module in the stdlib could be used here? We could also just open up all of the files and send all of their contents in a big nested dict.
tarfile has been in the standard library since python 2.3... do you know if it one of the optional components or is it mandatory? I kind of assumed that even if the computer doesn't have its own zlib library, the zlib code gets baked into CPython. I don't have a windows machine so I can't easily check.
Ah, I didn't realize that. Yeah, if it's in the standard library then I'm
all for tar files.
On Thu, Mar 2, 2017 at 2:34 PM, Jackson Loper notifications@github.com
wrote:
tarfile has been in the standard library
https://docs.python.org/2/library/tarfile.html since python 2.3... do
you know if it one of the optional components or is it mandatory? I kind of
assumed that even if the computer doesn't have its own zlib library, the
zlib code gets baked into CPython. I don't have a windows machine so I
can't easily check.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/dask/distributed/issues/925#issuecomment-283755731,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AASszNgrdc5WJLeuOOjibFXIbattEpe9ks5rhxnEgaJpZM4MRVV0
.
@jacksonloper is uploading directories something that you have time to contribute?
Sure, but I can't test windows machines :(.
Woops wrong button. I'll submit a PR.
That's just fine
any update? I found this PR https://github.com/dask/distributed/pull/939 but it was closed. I'm wondering if there's a way to upload a directory and import only importable files in that directory.
Any update for this? Uploading a project folder for the workers to use is a very common use case and currently there is no easy way to do it
People keep asking if there is any update, but clearly this issue hasn't been updated. If you are interested in working on resolving this then that would be great!
I might take a look at implementing this. @mrocklin from an API design standpoint, would you rather have this functionality built into the existing upload_file function or create a new upload_dir function?
My preference would be the same function if it can be done. You might also
want to be aware of https://github.com/dask/distributed/pull/4238
by @ian-r-rose which is currently modifying the upload_file function
(although in a different way)
On Thu, Nov 12, 2020 at 8:22 AM Trevin Gandhi notifications@github.com
wrote:
I might take a look at implementing this. @mrocklin
https://github.com/mrocklin from an API design standpoint, would you
rather have this functionality built into the existing upload_file
function or create a new upload_dir function?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dask/distributed/issues/925#issuecomment-726182894,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AACKZTGPIRKET4OQWYJTHADSPQDT5ANCNFSM4DCFKV2A
.
Does it make sense to have upload_file overloaded for directories? I understand the preference for it being in one code path, but from a developer experience/ergonomics it seems more straightforward to either have a wrapper function called upload_dir (or similar)
After thinking about this more, it does seem that having a separate upload_dir function would be useful beyond just clarity and ergonomics. In my case, for example, the code is set up like:
src/
foo/
bar/
baz/
a.py
b.py
and Python files import each other using import src.baz.a. When I upload files to dask workers, I want to only upload the baz folder but preserve the src/baz hierarchy.
I was thinking an API like def upload_dir(path_to_dir, path_prefix="") (kwarg naming tbd) would be helpful in this case. To go through some examples:
upload_dir('src/baz') would upload just the baz directory and add baz to the Python path on the workersupload_dir('src/baz', path_prefix='src/') would upload thebazdirectory to asrc/bazdirectory on the workers and addsrc/baz` to the python pathEdit:
On second thought, it might even be more intuitive to have upload_dir('src/baz') upload the _contents_ of the baz folder with no parent directory, and upload_dir('src/baz', path_prefix='src/baz') would upload the contents of baz into a 'src/baz' folder on the remote workers.
@mrocklin thoughts on the above design?
Most helpful comment
any update? I found this PR https://github.com/dask/distributed/pull/939 but it was closed. I'm wondering if there's a way to upload a directory and import only importable files in that directory.