dask.delayed not using dask.cache inside delayed function

Created on 15 Jan 2018 · 4Comments · Source: dask/dask

Is this behaviour correct? I would expect the calls to f inside g to be using the cache.

dask version 0.16.1

import dask.delayed
import dask.cache

_cache = dask.cache.Cache(1e9)
_cache.register()

@dask.delayed(pure=True)
def f(x, y=1):
    print('compute f {}'.format((x, y)))
    return x, y

@dask.delayed(pure=True)
def g(x, z, y=1):
    z(x)
    z(x)
    z(x)
    print('compute b {}'.format((x, y, z)))
    return x, y, z(x)

f(1).compute()
g(1, f).compute()

Output:

compute f (1, 1)
compute f (1, 1)
compute f (1, 1)
compute f (1, 1)
compute b (1, 1, <function f at 0x7f6fa4dc7a60>)
compute f (1, 1)

Source

cottrell

All 4 comments

Right, the cache is only able to capture functions that were part of a task graph, calling delayed functions from within delayed functions is not typical. I'm actually confused why those run at all rather than just creating delayed objects within your function.

mrocklin on 15 Jan 2018

👍1

Yes, this is expected behaviour. When you call g and pass delayed f as a parameter, the running function g will see the un-delayed, concreted version of f - a normal function, which it can then call. I would say, further to the reply above, that passing delayed functions as parameters to delayed functions is even rarer than attempting to call them. This begs the question, what is it that you are trying to achieve?

martindurant on 15 Jan 2018

👍1

Yes, we were actually observing this un-delayed, "croncreted" behaviour and wondering if this pattern was encouraged or discouraged. :) And basically you are saying this only accidentally works (sort of) since f itself is a delayed instance and has a f.compute() that returns the original callable. It is not currently obvious to me why f.compute() should return the callable. I also have not quite digested the fact that both the wrapped function delayed(f) and the wrapped function result delayed(f)(1) are both Delayed instances.

Sounds like the encouraged pattern is to only pass fully specified delayeds (all args given) to other delayeds to avoid any depth. Probably something using factories instead like this?

@dask.delayed(pure=True)
def g2(x, z, y=1):
    print('compute g2 {}'.format((x, y, z)))
    return x, y, z

def factory(x, z, y=1):
    return g2(x, f(x), y=y)

factory(3, f)

For what it is worth, this came up while tinkering/thinking about the following things in the context of caching:

a delayed function produces a list of args, we map another delayed over the list of those args
subclass/hijacking the caching to insert a persistence layer (for example, write/append parquet and return a convenience view/slice of the parquet chunk).

We are a fan of dirty/clean graphs to economize dev workflows where the data or the code underlying some parts of a dag evolves so we are probably coming at this from a strange angle.

cottrell on 15 Jan 2018

It is not currently obvious to me why f.compute() should return the callable. I also have not quite digested the fact that both the wrapped function delayed(f) and the wrapped function result delayed(f)(1) are both Delayed instances.

delayed(object) -> Delayed in all cases. If the underlying object is a callable you can call it to return a delayed computation, but you can also call delayed on objects and then call their methods. This is made efficient by having a few small Delayed subclasses to handle special cases (like wrapping a function)`.

Sounds like the encouraged pattern is to only pass fully specified delayeds (all args given) to other delayeds to avoid any depth.

You shouldn't call delayed functions inside delayed functions. In other projects, we've used the following pattern:

def f(x, y, z):
   ...

delayed_f = delayed(f)

@delayed
def g(x, y, z):
    return f(x, y, z) + 1  # call the non-delayed version in the delayed function

# Call the delayed version outside of delayed functions
delayed_f(1, 2, 3) + g(1, 2, 3)

a delayed function produces a list of args, we map another delayed over the list of those args

You can't iterate over a delayed value unless you know its length (see the nout parameter). In this case I'd recommend having non-delayed functions that return lists of delayed values, and delayed functions that operate on each of these values. (see e.g. the to_delayed/from_delayed functions in each collection). In general, you can't do anything that results in the shape of the graph being determined from results of a delayed function.

We are a fan of dirty/clean graphs to economize dev workflows where the data or the code underlying some parts of a dag evolves so we are probably coming at this from a strange angle.

Alternatively, you might find the futures interface more flexible if you need graphs that evolve over time.