Julia: API Request : Interrupt and terminate a task

Created on 27 Mar 2014  Â·  24Comments  Â·  Source: JuliaLang/julia

Most helpful comment

I think we should look closely at the Trio approach to I/O in the future—i.e. post 1.0 (so this may have to be optional in 1.x or it may have to wait until 2.0). It has some really nice properties, including:

  1. Every task spawned within a function finishes before that function returns unless you spawn the task within an explicit "task nursery" that outlives the spawning function.

  2. There's a natural parent task to handle every child task failure—no more task failures disappearing into the void. This has been a frequent point of contention between @JeffBezanson and myself; the Trio approach provides a nice clean solution that could make both of us happy.

  3. There's a clear structure for cancellation of tasks and subtasks: if you kill a task, that also kills all subtasks. In particular, this means that instead of having timeout arguments on every possible blocking operation, you can do external cancellation of blocking tasks correctly—this composes better and means you don't have to wait until a potentially indefinite chain of timers expires.

The most effective way forward may be to see if I can get @JeffBezanson and @njsmith in a room together some time since Nathaniel is a much more effective and explainer of and advocate for the Trio model than I am. (Hope you don't mind the ping, Nathaniel!)

All 24 comments

Related: #4037

Also related: #1700

We can already do this using schedule. The only problem is that the task can catch the exception and retry; there is no way to be sure the task ends. Maybe the interface can just be t.state = :done. We just need to update schedule to drop finished tasks.

We should probably still have terminate(t::Task) = (t.state = :done; :ok) defined. Just seems odd that in this particular case, we expect the user to access a member field directly, while in other cases, a Task object is effectively used as an opaque handle.

Nah, that won't be sufficient anyway. We should remove the task from whatever wait queue it is in, so it can be GC'd.

Bumping. Do we have this functionality yet? Do we want it?

Don't have it yet. We should. Killing a task should also release whatever resource it is waiting on - file fd, socket, remote reference, etc.

I don't think we should add this, since "releasing whatever resource" is generally impractical and buggy. I don't know of any APIs that have this sort of feature but don't warn you not to use it due to the infeasibility of cleaning up state afterwards:
http://docs.oracle.com/javase/1.5.0/docs/guide/misc/threadPrimitiveDeprecation.html
https://msdn.microsoft.com/en-us/library/windows/desktop/ms686717%28v=vs.85%29.aspx
https://internals.rust-lang.org/t/thread-cancel-support/3056 (rust doesn't have it, this is a discussion on why not)

After reading those links, an appropriate solution would be to define an interrupt(t::Task).

Currently it should just throw an InterruptException in the target task if it is waiting on I/O, remote reference, Condition, etc. For compute bound tasks, if and when we have tasks scheduled on different threads, it could send an interrupt signal to the specific thread (if possible).

+1 for interrupt(t::Task). @amitmurthy Do you have any idea of how to throw the InterruptException on the task? I feel like this solution would also help me figure out how to implement a stacktrace(t::Task) method (ie: run stacktrace() in the task for debugging).

stacktrace(::Task) is much easier than interrupt(::Task)

Another reference on this topic is https://news.ycombinator.com/item?id=13470452

Note, that the current preferred mechanism of aborting another Task is to close the shared resource and let the runtime clean it up synchronously. This has the benefit of being reliable, easy to code, and already exists. It also is less racy, since close is stateful (once closed, the resource remains closed) rather than an edge-driven event, and typically also an expected condition (so it doesn't require any extra effort to handle).

That works for libuv resources implementing close and Channels only. For tasks waiting on a remotecall or waiting on a Future/RemoteChannel users have no access to the Condition variables the task is waiting on. And implementing close(::Condition) which would invalidate all current and future calls on a Condition object I think is not correct. If we do that we may as well have interrupt(::Task) call close on the waiting condition which would bring us back to the issue of proper cleanup in the libuv case. Right?

No, it would still be different because it would no longer be specific to intended interruption. For example, you might end up aborting a call to close or showerror instead of the intended job.

I think the remotecall functions generally have async versions which return a handle to the Channel? I think in most other cases, the resource is passed in as an argument which gives the caller some leverage. In the worst case for remotecall, since the resource argument is worker-pid, you could rmprocs(p) to kill / close the connection to that remote worker.

I think the remotecall functions generally have async versions which return a handle to the Channel

The calls return a Future and a wait on a Future results in a remote task that waits on the backing channel waiting for data. On the caller we are waiting on a Condition which will be triggered by a response from the remote wait.

In a statement like @async remotecall_fetch(....) we only have access to a Task object. rmprocs(p) seems like an overkill but is probably the correct way to do it currently, as we don't have a means to interrupt the specific remote task.

Right, but I thought that's why remotecall is available. And terminating that Task wouldn't actually notify the remote worker to stop, but might confuse / corrupt it when it tries to report it's results. I realize that ensuring cancel-ability may require thinking about how it'll work and threading out handles to the objects that can be used to stop the work. But I don't see how it could be done any other way. There's no guarantee that in the @async example there that it's not actually implemented as @async wait(@async remotecall_fetch()) (hopefully not intentionally...), so all that killing the Task directly accomplishes is destroying the monitoring process.

Someone has written an article that draws an equivalence of @schedule-like behavior to the evils of goto.

https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful

There is a Julia-specific thread to this conversation at https://discourse.julialang.org/t/schedule-considered-harmful/10540

As someone who learned programming on old-fashioned BASICs like Applesoft BASIC on the Apple II+, and then had to move to procedural programming in C, I initially had the same gut-response to this article as I did back then. I'd call it "instinctive repulsion". But of course we do all now recognize goto as evil, so I reset my thinking and gave it an honest review.

In doing so, I have come to the conclusion that the article makes some great points. There is no good way to universally handle uncaught exceptions at the top of the @schedule other than to drop them on the floor. There isn't really a global (or even local) repository of outstanding Tasks that were created by @schedule, so you can't really even know what's running, or what may have been left out there by a black-box function call you have made.

But the reason I mention this article in this particular discussion is that I think having to support such unbounded @schedule calls may be one of the reasons that the ability to cancel a task is so hard. I haven't reviewed the Trio Python library itself or delved into the details of the Nursery concept as described other than to recognize it as similar to wrapping @async in @sync. But it does occur to me that there may be some concepts in there relating to "checkpoints" that begin to enable task cancels. (https://trio.readthedocs.io/en/latest/reference-core.html#checkpoints). Perhaps, if @schedule itself is dropped and the only way to schedule a task is to wrap an @async inside a @sync, then dealing with the resulting fallout cleaning up resources a task is holding on to may become easier.

Anyhow, just some food for thought here. I'm not necessarily proposing anything, but rather hoping to move the idea of task cancellation back into active discussion.

I think we should look closely at the Trio approach to I/O in the future—i.e. post 1.0 (so this may have to be optional in 1.x or it may have to wait until 2.0). It has some really nice properties, including:

  1. Every task spawned within a function finishes before that function returns unless you spawn the task within an explicit "task nursery" that outlives the spawning function.

  2. There's a natural parent task to handle every child task failure—no more task failures disappearing into the void. This has been a frequent point of contention between @JeffBezanson and myself; the Trio approach provides a nice clean solution that could make both of us happy.

  3. There's a clear structure for cancellation of tasks and subtasks: if you kill a task, that also kills all subtasks. In particular, this means that instead of having timeout arguments on every possible blocking operation, you can do external cancellation of blocking tasks correctly—this composes better and means you don't have to wait until a potentially indefinite chain of timers expires.

The most effective way forward may be to see if I can get @JeffBezanson and @njsmith in a room together some time since Nathaniel is a much more effective and explainer of and advocate for the Trio model than I am. (Hope you don't mind the ping, Nathaniel!)

The most effective way forward may be to see if I can get @JeffBezanson and @njsmith in a room together

Sounds like a fun time to me :-)

The most effective way forward may be to see if I can get @JeffBezanson and @njsmith in a room together some time since Nathaniel is a much more effective and explainer of and advocate for the Trio model than I am.

BTW, we've just created a virtual room for cross-language discussions of structured concurrency, in case anyone is interested: https://trio.discourse.group/c/structured-concurrency

Cool, Nathan, I’ve joined :)

Nice, I've joined as well. Reading over the blog post I think this approach makes a lot of sense. (As a side note — it also means that it was "correct" to inherit loggers from their parent task. Phew!)

I think cancellation in Trio works nicely because Python forces you to write await which becomes the (potential) checkpoints. As await can only be used inside functions marked by async, you don't need to care about cancellation inside blocking (non-async) functions.

Is there any plans/ideas for (1) how to make non-I/O (compute-intensive) functions cancellable and (2) how to mark them as such in a way that the callers can know that they have to prepare for cancellation? I guess passing around cancellation tokens (see also https://vorpus.org/blog/timeouts-and-cancellation-for-humans/) and handling exit manually as in Go's errgroup would be an option. But it sounds like a very clumsy API to use.

BTW, I played around a bit to see how Trio-like API would look like and how passing around "nursery" would work (although it's more like Go's errgroup as there is no cancellation support). Here is a demo:

function bg(nursery)
    @with_nursery nursery begin
        println("in nursery")
        Threads.@spawn begin
            sleep(0.1)
            Threads.@spawn begin
                sleep(0.1)
                println("world")
            end
            println("hello")
        end
    end
    println("out of nursery")
end

function demo()
    @sync begin
        println("launching background tasks")
        bg(@get_nursery)
        bg(@get_nursery)
        println("synchronising background tasks")
    end
    println("synchronized background tasks")
end


Quick-and-dirty implementation of @get_nursery and @with_nursery (no cancellation support at all!))

struct TaskVector
    tasks::Vector{Any}
    lock::Threads.SpinLock
end

TaskVector(tasks) = TaskVector(tasks, Threads.SpinLock())

function Base.push!(v::TaskVector, x)
    lock(v.lock)
    try
        return push!(v.tasks, x)
    finally
        unlock(v.lock)
    end
end

macro get_nursery()
    var = esc(Base.sync_varname)
    quote
        TaskVector($var)
    end
end

macro with_nursery(nursery, body)
    ex = quote
        let $(Base.sync_varname) = $nursery
            $body
        end
    end
    return esc(ex)
end

demo() should print

launching background tasks
in nursery
out of nursery
in nursery
out of nursery
synchronising background tasks
hello
hello
world
world
synchronized background tasks

I think Kotlin could be interesting here since it does structured concurrency without async/await keywords. Kotlin's approach to making computation cancellable is to call yield function or checking isActive variable. See: https://kotlinlang.org/docs/reference/coroutines/cancellation-and-timeouts.html#making-computation-code-cancellable

But it looks there is no mechanism for callers to know if the function is cancellable?

I started implementing a very minimal version of structured concurrency here https://github.com/tkf/Awaits.jl. Basically I implemented what I mentioned in the last bits of https://github.com/JuliaLang/julia/issues/32677#issuecomment-520115386. The idea is to define a macro @await body which is expanded to (roughly speaking)

ans = $body
if ans isa Exception
    cancel!($context)
    return ans
end
ans

where $context is a variable that tracks cancellation tokens (and tasks). There is also @go body for (a thread safe version of) @spawn @await body. Another important construct is @check which expands to

shouldstop($context) && return Cancelled()

where Cancelled <: Exception. This way, compute-intensive functions can define checkpoints manually by inserting @check. Those functions can also "throw" an error by returning an Exception when they hit a condition where the whole computation should stop. Callers of such cancellable functions can wrap the call with @await which then explicitly marks a checkpoint. Sub-computations can be cancelled individually by something like

context = @cancelscope begin
    @go ...
    @go ...
end
cancel!(context)

I also wrote some minimal documentation https://tkf.github.io/Awaits.jl/dev/ and tests https://github.com/tkf/Awaits.jl/blob/master/test/test_simple.jl

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dpsanders picture dpsanders  Â·  3Comments

iamed2 picture iamed2  Â·  3Comments

omus picture omus  Â·  3Comments

tkoolen picture tkoolen  Â·  3Comments

i-apellaniz picture i-apellaniz  Â·  3Comments