Dask: How to specify the directory that dask uses for partd?

Created on 14 Oct 2016  路  5Comments  路  Source: dask/dask

Apparently, dask writes to the /tmp folder during disk based shuffle operations. On the system that I am using, this folder is mounted on a very small partition (30GB), causing the following error after some calculations:

IOError: [Errno 28] No space left on device

Traceback

File "[path_to_anaconda]/lib/python2.7/site-packages/dask/async.py", line 263, in execute_task
result = _execute_task(task, data)
File "[path_to_anaconda]/lib/python2.7/site-packages/dask/async.py", line 245, in _execute_task
return func(_args2)
File "[path_to_anaconda]/lib/python2.7/site-packages/dask/dataframe/shuffle.py", line 395, in shuffle_group_3
p.append(d, fsync=True)
File "[path_to_anaconda]/lib/python2.7/site-packages/partd/encode.py", line 25, in append
self.partd.append(data, *_kwargs)
File "[path_to_anaconda]/lib/python2.7/site-packages/partd/file.py", line 41, in append
f.write(v)

How can I specify the folder that dask uses for the shuffle? What else could I do to avoid this problem? I do not have administrative privileges, therefore mounting /tmp to something bigger is not an option.

So far, I only saw the /tmp folder grow bigger. At which point does dask delete the files?

Most helpful comment

You can set this for dask with the following:

dask.set_options(temporary_directory='/path/to/tmp')

This can also operate as a context manager.

All 5 comments

Setting the environment variable TMPDIR seems to be one way to influence this behaviour.

In the file file.py in the partd repository, I found the following code:

class File(Interface):
    def __init__(self, path=None):
        if not path:
             path = tempfile.mkdtemp('.partd')

So as long as dask does not set the path explicitly, I guess this should always work, because tempfile would look up the location of temp-folder in the TMPDIR environment variable.

Doesn't the environmental variable also influence other applications as well? Seems like it would be best to be able to optionally specify a temp directory within dask itself, so that it can be decoupled from the rest of the system. Dask may be working with really huge files that will not fit on (for example) the default temp directory location in Windows on a small SSD C: drive.

You can set this for dask with the following:

dask.set_options(temporary_directory='/path/to/tmp')

This can also operate as a context manager.

Now it's

dask.config.set({'temporary_directory': '/path/to/tmp'})

Also, note that setting DASK_TEMPORARY_DIRECTORY doesn't work, not sure why.

Thanks for the update @letmaik.

Also, note that setting DASK_TEMPORARY_DIRECTORY doesn't work

Hmmm that seems strange. I'm not able to reproduce:

In [1]: import os

In [2]: os.environ.get("DASK_TEMPORARY_DIRECTORY")

In [3]: os.environ["DASK_TEMPORARY_DIRECTORY"] = "foo"

In [4]: import dask

In [5]: dask.config.get("temporary_directory")
Out[5]: 'foo'
Was this page helpful?
0 / 5 - 0 ratings