Julia: Taking Structured Concurrency Seriously

Created on 13 Sep 2019  Âˇ  73Comments  Âˇ  Source: JuliaLang/julia

In julia 1.3 the threadsafe runtime for truly parallel tasks has arrived which will greatly increase the interest in concurrent computation in Julia's core numerical and technical computing community. For the moment, Threads.@spawn is experimental, but now is a good time to think about APIs for expressing concurrent computation in a safe and composable way with the aim to use such an API as an ecosystem gets built on top of the new functionality.

There's been a lot of recent excitement and hard work on libraries to support so-called structured concurrency, with the claim that this is fundamentally a better model for composable concurrent computation.

I'm convinced that people are onto something here and that the overall ideas are an excellent fit for the Julia language. However, there's a lot of prior art to absorb so I have started a document with the aim to do two things:

  • Survey existing ideas and implementations
  • Explore how these ideas fit within the Julia language and runtime.

Click here to read the WIP document

Meta note

I'm using this julep as a guinea pig to try out a go-proposals style workflow, as discussed at https://github.com/JuliaLang/julia/issues/33239. So let's try discussing this here (non-substantive procedural comments to go to the Juleps repo.

O julep parallel

Most helpful comment

Cool. We shouldn't let Java beat us to this :)

All 73 comments

I'm copying some comments I made in https://github.com/c42f/Juleps

Discuss "black box rule" and robust cancellation separately?

I think it might be better to emphasize that structured concurrency has a few sub-concepts.

My understanding is that what njs calls _"black box rule"_ is a sub-concept of structured concurrency and it's more fundamental than robust cancellation. So, I'd suggest to introduce "black box rule" and cancellation as different sub-concepts from the get-go. I think this is particularly important in Julia because many compute-intensive task don't need cancellation and people may not want to have runtime and conceptual overhead of robust cancellation. An alternative term may be _call tree_.

One counter-example to my argument that "black box rule" is more fundamental is Go's errgroup and context. errgroup is the implementation of the "black box rule" while context is the implementation of robust cancellation. errgroup is implemented on top of context and it may contradict with my argument that "black box rule" (errgroup) is more fundamental than robust cancellation (errgroup). But I think the fact that they are implemented in separate packages in Go at least backs up my argument that those are different concepts.

So, maybe it is a good idea to organize the sections like this?

  • Structured Concurrency

    • Background terminology

    • The black box rule

    • Robust task cancellation

    • ...

A benefit of having different color

I suggested to add the following paragraph at the end of Syntax section https://github.com/c42f/Juleps/pull/2

On the other hand, a syntax such as async/await is a useful visual marker that emphasise when the cancellation can happen (await) and what functions are cancellable (async). Note that this does not have to be implemented at language level. For example, Go's context and errgroup enforces reader to recognize where the cancellation can happen (listening to the Done channel) and what function can be cancelled (accept Context as an argument).

Also quoting another comment:

I just noticed that you touched this already in "The challenge of preemptive cancellation" but with relation to preemptive cancellation. But maybe it's useful to touch it a bit here in a different context?

Are go statements harmful?

Copying comment from https://github.com/c42f/Juleps/pull/3

OK... I think this can be very controversial. But I suggest to add:

It is important to emphasize that, as _absence_ of goto statement that can cross function boundaries is the major feature of modern languages, the _absence_ of "go statement" (@spawn in Julia 1.3) is the major feature of the structured concurrency. This is because it is impossible for caller to know that a function follows structured concurrency without the language-level guarantee.

To emphasize that it has large impact in Julia than (say) Python, I also added:

Note that the situation in Julia is more serious compared to other languages which has functions with "two-colors" (see below) and plugable eventloop (e.g., Python) where it is possible to enforce the _absence_ of "go statement" at the library level.

Are go statements harmful?

I'm convinced they are harmful. But luckily this isn't a matter of opinion if we can agree that

  1. Structured concurrent programs are simpler to reason about because they are more restricted in form and have many benefits in composability.
  2. There is no significant penalty for giving up the flexibility of an unstructured approach.

The first of these doesn't seem controversial to me.

Lucky for us, the second can be attacked empirically due to the efforts in Trio: @njsmith has repeatedly stated that they've found no reason to back off from their insistence that all tasks are started from a nursery. The Kotlin people seem to agree (though they do have a nod to backward compatibility with the global coroutine scope). I think the other empirical answer we can have to this is to try implementing a suite of standard concurrency patterns ourselves in Awaits.jl or equivalent prototype.

A benefit of having different color

syntax such as async/await is a useful visual marker that emphasise when the cancellation can happen (await) and what functions are cancellable (await)

I agree this can be seen as a benefit and I've updated the document. In a cooperatively concurrent system it also marks where task switching can happen which can otherwise be a source of timing bugs, for example https://github.com/JuliaWeb/HTTP.jl/issues/382#issuecomment-497723980

Lucky for us, the second can be attacked empirically due to the efforts in Trio

I also add that Trio convinced Python core devs to add structured concurrency in Python itself: Yury Selivanov - Asyncio in Python 3 7 and 3 8 - YouTube. (Although it says asyncio.TaskGroup (≈ nursery) is likely to be in Python 3.8, I don't think it is added yet.)

when the cancellation can happen (await) and what functions are cancellable (await)

Oops, the second await should be async. Thanks for fixing it in the Julep :)

In a cooperatively concurrent system it also marks where task switching can happen which can otherwise be a source of timing bugs

Just to be clear, we are already in Go/Kotlin group as there is no async/await. We can still take Go's context approach to pass around cancellation tokens (it's the Done channel in Go's context). This would clarify cancellable functions from the call signature (you can't call it unless you pass the cancellation token). However, we can't use this mechanism for I/O functions because existing I/O functions do not have this signature and their color is the same as synchronous functions.

I requested to add that paragraph because I thought it was important to recognize the pros and cons of having async/await syntax to assess the advantage and disadvantage that already exist in Julia. I think the main discussion point might be "Should we pass around cancellation tokens explicitly?"

By the way, it'd be nice to have a list of compute-intensive applications that require cancellation.

One example I found/implemented was stoppable reduce in Transducers.jl. It can be used to implement parallel findfirst.

The following thread about how Rust's new async/await relates to SC needs a careful read. There's many interesting points related to cancellation and zero cost abstractions.

https://trio.discourse.group/t/structured-concurrency-in-rust/73

We can still take Go's context approach to pass around cancellation tokens (it's the Done channel in Go's context). This would clarify cancellable functions from the call signature (you can't call it unless you pass the cancellation token). However, we can't use this mechanism for I/O functions because existing I/O functions do not have this signature and their color is the same as synchronous functions.

We could add versions of IO which use this mechanism and make deprecating the old ones a long term (Julia 2.0) goal. So I think it's worth considering, if only to understand which points in the design space we're giving up on. Actually a full implementation of SC seems to be a julia 2.0 goal in any case because we already have bare @async and it would certainly be breaking to change how that works.

The action of passing around a context could be wrapped up with macros to form the expected lexically scoped chain of custody for cancellation without much syntactic overhead. This would indeed give us colored functions (colored by calling convention), but at least the syntactic price would be no worse than python and there's no language change required.

Here's a related question: are there any systems which don't have colored functions, and which also have a compelling cancellation story?

  • Modern Go has them with Contexts
  • Kotlin has them with suspend functions and coroutine builders
  • Python has them with async/await

Over on the Julep, @vchuravy notes about the Tapir PR:

Note that this is primarily about:

  • Compiler representation of tasks
  • Experimenting with Cilk-style semantics (e.g. may happen in parallel or what I have taken to call structured parallelism)

My thought was that the outcome of the Tapir effort could be visible to users if we find that restricting semantics in a targeted way would greatly improve optimization opportunities. For example, maybe it could enable us to transform symmetric coroutines into lightweight state machines in some cases?

@vchuravy I think it would be really valuable if you could provide a summary of how you see this effort fitting into the bigger picture.

We could add versions of IO which use this mechanism and make deprecating the old ones a long term (Julia 2.0) goal.

I didn't think of this. That's an interesting possibility.

we already have bare @async

Good point. I forgot about it. I still think we can pretend to have structured concurrency before 2.0 since it is very likely that people are lured to @spawn (or whatever the API we'll have) because of the potential performance benefits.

The action of passing around a context could be wrapped up with macros to form the expected _lexically scoped_ chain of custody for cancellation without much syntactic overhead.

That's actually the strategy I used in Awaits.jl.

are there any systems which _don't_ have colored functions, and which also have a compelling cancellation story?

The only thing somewhat related to this I know is thredo (built on top of curio which is an async I/O library that inspired trio). In EuroPython 2018 (David Beazley - Die Threads - YouTube), the author talked about designing a threading library from scratch, motivated by the two-color problem and also robust cancellation (the talk is mostly demo of thredo). thredo API does not enforce the "black box rule" but it has cancellation and also has (optional) ThreadGroup (≈ nursery). The cancellation can happen only if you use thread building blocks like Queue, Lock, etc. imported from thredo. But it seems that this is treated as limitation rather than a feature for marking cancellable functions. For example, it has a monkey-patching helper to replace stdlib sockets used in third-party libraries with the thredo implementation. It also has implication in the CPU-intensive code as it's cancellable only when it hits curio's eventloop. He is mentioning that this is actually a desirable feature and I guess we all agree randomly inserting checkpoint would be awful for performance-oriented language like Julia.

But I've never played with it and I don't know if there is a discussion on structured concurrency aspect of thredo. AFAICT this is an experimental library and I don't think it's used in the wild. I'm just mentioning it since it seems to fill the design space of "cancellation without color."

are there any systems which _don't_ have colored functions, and which also have a compelling cancellation story?

Java's Thread.interrupt was mentioned as an unsuccessful story in https://trio.discourse.group/t/structured-concurrency-in-rust/73/25

Hm. Cancellation seems to be a really, really hard problem, and I'm sure I don't understand how hard yet! But I'm naive enough to still think there's some hope, for example kill -9 is an example of timely, preemptive cancellation which actually works most of the time.

So what is it that makes kill -9 work when other things don't? It's (partly?) the strong isolation between resources of one process and another, allowing the resources to be gc'd by the OS kernel.

On slack, @vtjnash had some interesting things to say about that. To quote directly

I think the difference in practice is that sending kill -9 actually has the effect of closing all of the open resources. That’s a scenario the other processes are usually written to handle in some way.
That it also stops running the code is somewhat irrelevant, because it’s already been cut off from all side-effect channels

I think it's interesting is that in the real world cancellation is definitely a thing: if your process starts behaving weirdly you kill -9 it. If that locks up sibling processes on your OS, you restart the machine. If that causes a distributed system to fail, you might need to restart a whole bunch of machines.

There's plenty of messy examples outside of computing. Consider companies; if a business unit is failing to make profit, it's eventually restructured or terminated and its duties redistributed. If that causes the company to fails to service its debts, the company goes into bankruptcy and gets forcibly restructured or just plain killed off (leaving the investors to do the garbage collection).

It's all very messy, but I think there's some kind of interesting informal hierarchy of supervision going on, and I've got a feeling that this is where Erlang supervisors might come in. More reading to do...

Why are you considering preemptive cancellation? Isn't the difficulty of it the motivation for cooperative cancellation? I don't think kill -9 or even a reboot is a good model for a safe cancellation. For example, this won't work if you have a filesystem-based lock (and that's why you need to do rm -f .git/index.lock).

Even in the non-computing example, if a company with copyright or license for an artwork got a hard shutdown, it may not be able to release the resource (= copyright) and then make the artwork becomes unusable. (OK. I'm not a lawyer and I have no idea if that's a reasonable scenario.)

Why are you considering preemptive cancellation?

An excellent question. Maybe because I'm naive and I want things which I don't know are impossible yet ;-)

Let's lay out some desirable properties for cancelling things which I hope we can agree would be nice. This is not a judgement about whether these are mutually possible! I've labeled each property for sake of further discussion. I would like cancellation to be:

  • Graceful There should be a way to define actions to take for graceful shutdown. This is inherently specific to the code running, as described in https://trio.discourse.group/t/graceful-shutdown/93
  • Releasing Resources which were acquired must be released. There's various ways to handle this (code cooperatively cleaning things up, vs garbage collecting the resources). Your lock files example occurs because the supervisor (the OS) doesn't know about the lock file as a resource which must be reclaimed. On the other hand, the OS is very good at cleaning up things like file handles.
  • Timely It should occur by some deadline. Because buggy infinite loops are a fact of the world no matter how much we wish they'd go away.
  • Ergonomic The system needs to be both simple and have high perceived value so that the ecosystem converges on writing code which handles cancellation correctly. Very nicely put by njsmith here.

It seems there's several models of cancellation which could be considered which might satisfy some subset of these properties.

  1. Cooperative cancellation with task-defined handlers (Trio / deferred pthread_cancel style) where cancel points are highly restricted, and tasks may run cleanup code.
  2. Hard preemptive cancellation (kill -9 style) where you need enough isolation so that resources can be GC'd. Seems incompatible with cleanup defined in the task itself. Has several known in-process APIs which have been a historical disaster; Thread.stop etc.
  3. Preemptive cancellation with task-defined cleanup handlers (InterruptException style) — a weird mixture where cancel points are "almost" everywhere, but task-defined cleanup can still run. Known to be a reliability disaster if just tossed into a language without thinking very hard about what "almost" means. (Possibly rust-like?)
  4. Resource-based cancellation; @vtjnash pointed this out on slack — "Julia-style cancellation, aka OO cancellation. Any IO object can be cancelled by calling close, which initiates termination of outstanding requests". This seems to be a very different model conceptually as it focuses on closing resources rather than cancelling computations (though closing a resource may lead to cancelling a computation).

There is a formal structure of parallel programs that was developed for Cilk and is now used in Tapir/LLVM called serial semantics (also called the "serial elision" or "serial projection" property): this refers to a parallel program that could be correctly executed by running the tasks serially.

One nice thing about serial semantics is that it enables various analyses, e.g. efficient detection of race conditions, or static compiler optimizations in Tapir. So it is a nice property for a language to allow the programmer to express unambiguously.

Are serial semantics a strictly stronger restriction than "structured concurrency"?

@stevengj In Tapir PR https://github.com/JuliaLang/julia/pull/31086, @vchuravy said that

It is important to note that the semantics of this representation are parallel and not concurrent, by this extent this will not and cannot replace Julia Tasks.

So, I think it is not relevant for the cancellation part of structured concurrency. But I think it is related to the "black box rule" part (see my comment above https://github.com/JuliaLang/julia/issues/33248#issuecomment-531139728) of structured concurrency. I'm not familiar with the serial-projection property but this property seems to imply that "black box rule" must be followed (e.g., both of them disallow leaking tasks). However, as @vchuravy demonstrated you can write concurrent code following "black box rule":

In order to exemplify this issue see the following Julia task code:

@sync begin
    ch1 = Channel(0)
    ch2 = Channel(0)
    @async begin
        take!(ch1)
        put!(ch2, 1)
    end
    @async begin
        put!(ch1, 1)
        take!(ch2)
    end
end

Doing a serial projection of this code leads to a deadlock.

So, I think serial-projection property is strictly stronger requirement than "black box rule", too.

@c42f IIUC, I guess I can (almost) summarise your models by how Cancel-Triggered Exception (CTE) behaves, like this?

| | CTE can happen anywhere | CTE happen only at await etc. |
|-----------------|---------------------------|---------------------------------|
| CTE catchable | Model 3 (e.g., Java) | Model 1 (e.g., Trio) |
| not catchable | Model 2 (e.g. kill -9) | N/A |

But I don't understand what is "Julia-style." It seems to fit in Model 3. Also, I don't understand why "resource-based cancellation" is not doable in Model 1. Trio has "I/O resource" that can be used as communication (lock, channel, etc.) and they are aware of cancellation.

So, I think serial-projection property is strictly stronger requirement than "black box rule", too.

Okay, so should we be talking about ways for users to express the stronger semantics (serial projection) or the weaker one (black box rule)? The former seems to allow strictly more tools and optimizations, so maybe it is more useful?

I'm in favor of having dedicated syntaxes to express the serial-projection property and black box rule.

I wonder if it is possible to "just" introduce a different version of @sync. That is to say for concurrent tasks we use

@sync begin
    ...
end

and for parallel computations with the serial-projection property we use

@sync_tapir begin  # maybe there is a better name
    ...
end

The body ... contains @spawn etc. In particular, replacing @sync_tapir with @sync is valid (the other way is not always true). This way, we can reuse functions for parallel and concurrent programs.

Naively thinking, this seems to be possible if we create a local "task group context" (a more sophisticated version of the vector assigned to Base.sync_varname; aka nursery) of different types in @sync and @sync_tapir. It is better to use type-based solution (rather than, say, macro-based solution) because it let us use the pattern to pass around the task group context and launch tasks inside other functions without immediately synchronizing them inside those functions. (FYI, if you want see actual (toy) code example, see, e.g., https://github.com/tkf/Awaits.jl/blob/24712770164cbd22c4de2568dc4ace1c75553eca/test/test_simple.jl#L21-L44 and Awaits.@go.)

Yes it seems like the serial projection property is rather restrictive and can't replace the general use of Tasks. But it's also a rather natural fit for data parallel problems in technical computing where the amount of parallel compute available at runtime is irrelevant to the algorithm.

So naively I'd favor having syntax for both; with @sync_tapir being an assertion by the programmer that the serial projection is valid for the contained code.

it let us use the pattern to pass around the task group context and launch tasks inside other functions without immediately synchronizing them inside those functions

I'm not so sure about this: serial projection seems to be a local property of an algorithm, and passing a @sync_tapir-generated context to an algorithm which doesn't have the serial projection property could cause deadlock. But perhaps I'm misunderstanding.

So naively I'd favor having syntax for both; with @sync_tapir being an assertion by the programmer that the serial projection is valid for the contained code.

I am worried about the additional complexity for the user to reason about the difference between a task with SPP and concurrent task.
I am working on a couple of ideas, but I am hoping that we don't have to introduce a separate language concept. If that doesn't work out (you know research), we may want to consider having parallel constructs like for-loops with @threads or map to use tasks with SPP.

I totally agree that it's nice to have composable high-level abstractions like map, (parallel) foreach and reduce so that users can first try to solve their problem without getting into the details of parallel computing. (BTW, making reduce composable is where you start needing something like transducers :wink:)

But I'm very curious to know if all programs with SPP are expressable by the higher-order functions (or more like, by reduce, as other examples I brought up can be expressed in terms of it). If not, maybe it makes sense to have @sync_tapir as low-level "unsafe" construct?

if all programs with SPP are expressable by the higher-order functions

Actually, the fib example in the Tapir PR is a good example where reduce is not enough.

If not, maybe it makes sense to have @sync_tapir as low-level "unsafe" construct?

Yes I think this will be the case. But let's not get ahead of ourselves I have many issues to solve first :)

Personally, I think the 1.3 golang-style API with @spawn being analogous to Go's go statement is already significantly better than how most languages do it, and enough to single-handedly win over quite a few programmers frustrated with how the mainstream languages do it.

If we're going for structured concurrency as a possible way to handle task lifetimes, I think it's worth exploring the full parameter space of new ways to do it instead of specifically locking onto Trio's way of doing it.

For example, I think the biggest new trend in handling colored async code is algebraic effects (see https://overreacted.io/algebraic-effects-for-the-rest-of-us/ for an example of how they'd look in a dynamic language), where the task spawning is handled by an effect handler. This has the advantage of both being a much more general way of preserving the function abstraction on functions that do special things, rather than functions specifically. It's also nicer than trio's way of passing nurseries to inner scopes in the same way that async await or monadic programming is nicer than passing callbacks to inner scopes.

On the other hand, effects are not exactly something I'd expect the compiler team to want to implement in Julia in the near term. But they do illustrate the value in waiting before going for a specific highly opinionated way to handle concurrency before thinking carefully about interactions with other possible additions that could lead to a cleaner way to do it.

If we're going for structured concurrency as a possible way to handle task lifetimes, I think it's worth exploring the full parameter space of new ways to do it instead of specifically locking onto Trio's way of doing it.

Absolutely agreed, and thanks for the link. I think @andyferris was also encouraging me to look into effect systems but I'd largely forgotten about that good advice (sorry Andy :grimacing:). Reading the link, it appears that the Julia logging system is already implementing an effects system with the exact same nested task-local semantics you'd get from a real effects system! But without language support for effects it's more clunky than you'd like. In particular it's much more difficult to define new log handlers than it should be. I think this is because the logging system actually uses two different effects in concert corresponding to the shouldlog and handle_message functions.

These ideas also seem somewhat related to an idea I was tossing around for unifying exceptions with error codes via pattern matching (discussed at the bottom of the https://github.com/JuliaLang/julia/issues/7026 thread).

In all, very interesting; thank you!

Thank you @saolof, I was very vaguely aware of algebraic effect and the blog post was a nice introduction. I can imagine algebraic effect would be useful for other things like eliminating dynamic dispatch of log handlers:

(function (logger)
    try
        f()
    handle log :: LogEvent
        resume with handle(logger, log)
    end
end)(current_logger())

I agree with @c42f that it interacts a lot with the "chain of custody" proposal #7026. Also, I think naively applying this to I/O handler would give us "colored functions." That is to say, all the intermediate call chain has to "chain" that it would perform I/O.

But, IIUC, algebraic effect is mostly orthogonal to the discussion of structured concurrency. I think it's possible to have or not have structured concurrency with algebraic effect. That is to say, enforcing black box rule is equivalent to disallowing "global effect handler"; i.e., disallow spawn when there is no associated try-handle block (= nursery).


Speaking of colored function, I'm not feeling the async pressure | Armin Ronacher's Thoughts and Writings (Hacker News) presents yet another compelling reason to avoid introducing the colors. This post is about the importance of backpressure and how easy it is to shoot yourself in the foot with async programming. I think the part that is relevant to this issue is that it impossible to fix API in a backward compatible manner when you have colored functions and fixing it requires changing a sync function an async one (see the discussion around await writer.drain()).

For example, I think the biggest new trend in handling colored async code is algebraic effects (see https://overreacted.io/algebraic-effects-for-the-rest-of-us/ for an example of how they'd look in a dynamic language), where the task spawning is handled by an effect handler.

So I've done a bit more thinking/reading about this, and I can see how algebraic effects allow the user to define their own schedulers for tasks spawned in a given dynamic scope (see for example the start of this talk: https://www.janestreet.com/tech-talks/effective-programming/). That's extremely cool and interesting.

But I think @tkf has a good point:

But, IIUC, algebraic effect is mostly orthogonal to the discussion of structured concurrency. I think it's possible to have or not have structured concurrency with algebraic effect.

Yes, it's also unclear to me how algebraic effects ensure the key property of structured concurrency: control flow is naturally delimited by the local structure of the source code (aka the black-box rule). Rather, it seems that task lifetime when implemented with algebraic effects is controlled by the enclosing spawn-handler. The handler may well be near the root of the call tree which brings us back to mostly unstructured concurrency.

That is to say, enforcing black box rule is equivalent to disallowing "global effect handler"; i.e., disallow spawn when there is no associated try-handle block (= nursery).

I think the black box rule is much stricter than this: it requires that task lifetime is lexically rather than dynamically scoped?

it requires that task lifetime is lexically rather than dynamically scoped?

Ah, yes, you are right.

...So it ends up being a matter of lexical vs dynamic scoping. We already have some examples of dynamic control flow for concurrency with the async & sync macros, except that currently writing an async macro without an enclosing sync macro does not propagate an exception to the origin of the stack.

Handling concurrency with dynamic scoping has a few advantages:
1) It ensures that all tasks have a parent to propagate exceptions to at the moment of their creation (and fails early otherwise), which gives an easy way to cancel subtasks by throwing a cancellation exception at a yield point like in Trio. It makes ^C work as expected. It provides the expressiveness benefits of SC and a short happy eyeballs implementation.

2) Unlike lexically scoped SC, It allows higher order functions to have both synchronous and asynchronous functions passed to them without it having to _know_ whether it is being used in a synchronous or asynchronous context. Higher order functions which are pure except for calls to their arguments are effect-polymorphic by default. Which means that you completely avoid the issue mentioned in what color is your function?

So it does have the advantage of being very _expressive_ and minimizing verbosity.

With that said, when you consider interactions with lexical closures and callbacks, purely lexical scoping for trio-style nurseries does introduce the possibility of having two nurseries whose lifetimes overlap without one fully enclosing the other, which I don't think you can do with lexical scoping.

Handling concurrency with dynamic scoping has a few advantages:

  1. It ensures that all tasks have a parent to propagate exceptions to at the moment of their creation (and fails early otherwise), which gives an easy way to cancel subtasks by throwing a cancellation exception at a yield point like in Trio.

This is an advantage over unstructured concurrency and not over structured concurrency, right?

Also, I think dynamically scoped task group (= nursery) has a major disadvantage in that there is no way for a _caller_ to tell if a function can spawn tasks that may outlive the call (unless it is combined with chain of custody). On the other hand, with lexically scoped task group you are _forced_ to know that the function may spawn tasks as otherwise you can't call it.

2. Unlike lexically scoped SC, It allows higher order functions to have both synchronous and asynchronous functions passed to them without it having to _know_ whether it is being used in a synchronous or asynchronous context.

It's possible to have structured concurrency without colored function. Kotlin and Go with context and errgroup do it. So, I don't think this is a valid advantage of dynamically scoped task group over lexically scoped one.

Also, I think dynamically scoped task group (= nursery) has a major disadvantage in that there is no way for a caller to tell if a function can spawn tasks that may outlive the call (unless it is combined with chain of custody). On the other hand, with lexically scoped task group you are forced to know that the function may spawn tasks as otherwise you can't call it.

This is exactly what is meant by function color. It's quite desirable to have calls to pure (sync) and effectful (async) functions look exactly similar so that you can write higher order functions _once_ instead of having a thousand nearly-identical dispatches which are identical up to nursery/scope arguments depending on whether you are passing a sync or async function to it. With language support the compiler could still infer effect signatures for type stable code just like it can infer exception signatures, since effects are basically resumable exceptions.

As for dynamic vs lexical, I think it's largely a question of preference and of what will be the least surprising to users. Lexical has some nontrivial edge cases since you can end up passing around a closure over a nursery whose scope might have expired. If you want your cancellation mechanism to be done with (dynamically resolved) cancellation exceptions thrown from yield points, if that closure is called from a different coroutine you can end up with a cancellation exception being thrown in a task that isn't a subtask of the nursery that expired.

This is exactly what is meant by function color.

By different color I thought it usually means if the function yields or not (async or sync). Spawn or not is another property. But I get that function color can be generalized to it. That's why you can "fix" it by chain of custody.

It's quite desirable to have calls to pure (sync) and effectful (async) functions look exactly similar so that you can write higher order functions _once_ instead of having a thousand nearly-identical dispatches which are identical up to nursery/scope arguments depending on whether you are passing a sync or async function to it.

I don't think this is a problem since we can just use closures. Something like

withtaskgroup() do nursery
    foldl((a, b) -> f(nursery, a, b), xs; init=x0)
end

Lexical has some nontrivial edge cases since you can end up passing around a closure over a nursery whose scope might have expired.

You can make it hard to misuse. For example, one idea may be to avoid referring to the task group context/nursery explicitly like using Awaits.@go instead of withtaskgroup above. Also, compared to this edge case, I think black box rule is a fundamental property; we all assume it in synchronous programming.

I understand dynamically scoped task group can get rid of this edge case. However, as you said, this is at the cost of not being able to refer to multiple task groups. This property is actually very useful. For example, this is required for implementing early deterministic termination in Transducers.reduce. You can see that I am accessing multiple nurseries in this prototype implemented using Trio.

So it ends up being a matter of lexical vs dynamic scoping.

Yes, I think this is accurate. SC needs lexical scoping to ensure the black box rule.

This is exactly what is meant by function color. It's quite desirable to have calls to pure (sync) and effectful (async) functions look exactly similar so that you can write higher order functions _once_ instead of having a thousand nearly-identical dispatches which are identical up to nursery/scope arguments depending on whether you are passing a sync or async function to it.

Terminology-wise I've been thinking of "colored functions" as generic terminology for splitting the world of functions into disjoint sets based on any property or properties (async-related or not). That is, in the graph coloring sense where the call graph obeys certain constraints based on the color of functions within it. Perhaps this is not how other people use it though.

At the moment Julia does not have colored functions at the language level; instead it's similar to Go: any function can arbitrarily call @async. This is helpful for composability but it has various downsides which the Go community seem to be addressing by explicit context passing. That solves some problems but passing contexts around gives you colored functions again by virtue of the calling convention. With extra boilerplate :grimacing:

An interesting use case of unstructured async is waiting for data to come back from a remote call. For example, Signals.jl uses this to rather neatly hide some detail in async_remote.

https://github.com/TsurHerman/Signals.jl/blob/master/src/async_remote.jl

Of course, this exact API isn't possible with structured concurrency. At the very least you'd have to pass a nursery though I wonder whether there's a more natural equivalent in structured concurrency land. For that matter, what do actors look like with structured concurrency? It turns out that someone's already trying to answer some of these questions; here's an actor library built on top of trio:

https://github.com/goodboy/tractor

At a quick browse it looks to work out quite naturally. Also their todo's include "support for reactive programming primitives" so some kind of signals may make an appearance.

Isn't this hard to do in structured concurrency because it can hide exceptions? I'm not familiar with reactive programming, but if a value that causes an exception is pushed but never pulled by a downstream, I suppose the exception is not going to be reported anywhere?

Right, I think reactive programming is an interesting case because it's not clear how it relates to structured concurrency and it seems like there might be a mismatch. But after a bit of thought I've got a feeling that it's a superficial mismatch.

With an async reactive system you've got an event loop running somewhere and this loop would need to be explicitly started (hence provide a place to propagate unhanded exceptions, and to cancel the event loop). On the other hand, for handled exceptions within signal propagation it seems to be an API choice how the signals library handles those. They could be dropped, propagated through the signal graph as error values, or something else. Those choices are available regardless of having structured concurrency or not.

this loop would need to be explicitly started

Yeah, I think that's the only place structured concurrency makes it "harder." ATM you can do this

function __init__()
    @async eventloop()
end

https://github.com/TsurHerman/Signals.jl/blob/b377eaedcd007fc5d56abe73b242e014cc330936/src/eventloop.jl#L13-L15

If we have structured concurrency you need to do something like this

witheventloop() do
    # code using Signals.jl here
end
# (maybe unhandled errors thrown here)

Here's the swift concurrency manifesto. They are going with some variant of actors + other stuff. https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782

Great design discussion and review of other models from go to akka etc

And here's an actors implementation by @richiejp https://github.com/richiejp/Actors.jl

The Swift manifesto is interesting but doesn't seem to address structured concurrency at all.

Sorry for the noise, I just skimmed/ pattern matched and assumed it was exhaustive.

Thanks for bringing it to my attention in any case. I wouldn't expect structured concurrency in there—Swift is still addressing their basic concurrency design, they haven't gotten around to structured concurrency yet, although it would probably behoove them to consider it sooner rather than later.

I just noticed this, but it looks like Java (Loom) is going have structured concurrency (and nestable effect handler?)

Why Continuations are Coming to Java - YouTube

Cool. We shouldn't let Java beat us to this :)

OK, I bit the bullet and clicked through to the 45-minute video. He clarifies that it is about one-shot continuations, which we already have via Tasks. Continuations are even more flexible than tasks, so farther away from structured concurrency along that spectrum. His view seemed to be that structured concurrency primitives are good but orthogonal to the existence of Tasks themselves, which I agree with --- i.e. @sync and schedule both exist and you use them as desired.

the 45-minute video

Do you mean the Java one? I don't think it's a good introduction video to structured concurrency. It just briefly mentions structured concurrency. I think a good introduction would be Nathaniel J. Smith - Trio: Async concurrency for mere mortals - PyCon 2018 - YouTube.

(There are a bunch of links in Structured concurrency resources - Structured concurrency - Trio forum and @c42f's Julep. Though I don't know what is the most concise introduction.)

@sync and schedule both exist and you use them as desired

It was a bit of exaggeration when I said "Java (Loom) is going have structured concurrency." It's not very clear from the talk if they are going to introduce structured concurrency in a strict form; i.e., disallowing @spawn/@async/schedule without @sync.

To me it seems like we have "goto" now, and might add "while" loops in the future. What's the problem?

--- https://github.com/JuliaLang/julia/pull/35686#issuecomment-622650037

The problem is that, as soon as you have the "goto" _that can cross function boundaries_ (so not C-like goto; more like longjmp, I guess?), function call can't be used as the unit of reasoning for the flow of the program. It can return, throw, or continue execution anywhere. Furthermore the mere _existence_ of such API is enough to break the abstraction. As soon as it's provided by the language, you can't expect a function call to return (or throw) when it finishes its job.

It's the same for structured concurrency. As soon as @sync-less @spawn/@async/schedule is possible (i.e., the language violates "black box rule"), you can't rely on the function call to be the unit of reasoning of the program.

We've had Tasks since v0.1, so if you need structured concurrency before Tasks we are definitely a bit late.

--- https://github.com/JuliaLang/julia/pull/35686#issuecomment-622653980

Yeah, I agree that it's impossible to have strict structured concurrency in 1.x.

My rant https://github.com/JuliaLang/julia/pull/35686#issuecomment-622605368 was that stabilizing @spawn without structured concurrency introduces yet another API to be deprecated before getting to strict structured concurrency. @spawn is also arguably the most attractive API for starting a task because it let people use parallelism. So I thought there was a good chance that it attracts people to start using structured concurrency if (non-strict) structured concurrency was introduced before stabilizing @spawn.

BTW, I think the main primitive we need for implementing structured concurrency is something like schedule(task, exception; error=true) (or more like schedule_soon?) that works even when the caller does not own task. By "it works" I mean in the sense it doesn't break the state of the scheduler. I think it should be allowed to break user code that is not exception-safe even when it is doing I/O. How hard is to implement?

We also need to embed cancellation points in the synchronization APIs like locks and channels but I think it'd be less hard.

@c42f What do you think? What is the main primitives we need for implementing cancellation?

Do you mean the Java one? I don't think it's a good introduction video to structured concurrency. It just briefly mentions structured concurrency. I think a good introduction would be Nathaniel J. Smith - Trio: Async concurrency for mere mortals - PyCon 2018 - YouTube.

Yes, I read the blog post, watched the video, bought the t-shirt. Look, "goto considered harmful" was 1960s clickbait. I also disagree with the analogy to "goto". The problem with goto is that it violates the procedural abstraction: naming a textual line in a program is not sufficient to describe a control flow state when there is a stack. Continuations fix that problem by wrapping a whole control state (whatever it is) in a first class object. I don't consider it entirely accurate to say that continuations are an abstraction violation.

The dichotomy between structured and not-structured concurrency is more illusory than first advertised. As has been discussed here, passing around nurseries gives you basically unrestricted dynamic task spawning again.

Anyway, I'm in favor of adding nurseries in some way, and cancellation. It should be possible to put a task in a state such that it will throw an exception at the next cancel point. From what I can tell that's basically what C# does (and others)? But of course, you might never reach a cancel point, and even if you do the exception might be caught. That's ok with me, I think it's unavoidable, just another reason this is not the same as "delete goto and everything is fixed".

It strikes me that many of the things said about tasks here could also be said about just closures. They allow variable bindings and their values to outlive a function invocation, and for part of the function to be "re-entered" after it has returned, but nobody thinks that's an abstraction violation.

I don't consider it entirely accurate to say that continuations are an abstraction violation.

I agree. But the problem is not Task, it's schedule. Likewise, I don't think closures are problem because they don't get executed by themselves. The caller always controls the execution and can see the end of it.

As has been discussed here, passing around nurseries gives you basically unrestricted dynamic task spawning again.

I don't think it's unrestricted. You can always reason about the scope (= nursery) of the task lexically because you either see @sync or the nursery as an argument.

But of course, you might never reach a cancel point, and even if you do the exception might be caught.

Yes, I agree it's unavoidable and especially hard to do in Julia because nobody wants random cancellation points to be inserted in their carefully written tight loops. Though I think it's possible to reduce some pitfalls. For example, it's possible to make cancellation exception harder to be caught accidentally by something like "chain of custody" #7026 coupled with a better exception hierarchy.

For example, it's possible to make cancellation exception harder to be caught accidentally

That's an interesting idea --- we could have those exceptions run all finally blocks but always propagate to the end of a Task.

That's an interesting idea --- we could have those exceptions run all finally blocks but always propagate to the end of a Task.

Yeah agreed that is an interesting idea. On one hand, i'm inclined to say this feels like overkill, because people are already used to this kind of thing for things like InterruptException, but on the other hand, there are plenty of places that just check e isa IterruptException && rethrow(), which would now block the task cancellation, which would be a shame. So yeah, i see the appeal of skipping the catch blocks for a task cancellation. 👍 And/or having a special _kind_ of exception you can throw which is able to skip catch blocks (similar to how exit() certainly skips catch blocks 😛)


Just chiming in here to say I agree with @tkf's assessment! When you and Jameson and I discussed task cancellation, I also remarked that I think simply the ability to "cancel" a task such that it will throw an exception at the next yield-point would be enough to implement structured concurrency as discussed here. Basically everything else we already have.

This is particularly nice because when discussing task cancellation, @vtjnash's main concern was that you might cancel a task you directly spawned, say A, but not realizing that A has spawned B, and B continues running. He instead suggested that you want to close some resource which is threaded through from A to B and all tasks thereafter. But we noted that sometimes the resource in question is simply the CPU, and there is nothing else obvious to close.

The nice thing about structured concurrency here I guess, is that you can set things up such that the task will _also cancel all the tasks its waiting on_, so you can indeed close the whole set of spawned computation. This seems like a nice solution to the above problem, and I think the only thing missing before we can have this is the ability to safely throw an exception onto a Task. :)

@NHDaly Hi, good to have a CSP specialist chiming in :)

you can set things up such that the task will _also cancel all the tasks its waiting on_

Yeah, I agree that it's a nice property that this is the _default_ behavior. But I'd like to add that it doesn't stop you from writing more complex handling of tasks as you can pass around multiple nurseries, if you really need to do it. It's also nice that this complexity is apparent in the source code as all the nurseries are always lexically visible.


By the way, the idea of using "narrower" catch and combining it with clever exception hierarchy is simply a clone of how it works in Python (and trio.Cancelled):

 +-- SystemExit           # thrown by `exit()`
 +-- KeyboardInterrupt    # thrown by SIGINT
 +-- trio.Cancelled       # canceling tasks
 +-- ...
 +-- Exception            # "user errors" inherits Exception
      +-- NameError
      +-- ...

Also, I just noticed that there is a similar discussion in #15514. I think it's a reasonable solution even for InterruptException and exit() as well but @Keno seems to have another idea https://github.com/JuliaLang/julia/issues/35524#issuecomment-616687657?

For Ctrl-C handling in Structured Concurrency, Structured Concurrency Kickoff - Structured concurrency - Trio forum seems to be an interesting discussion (it started as a general discussion but it later focuses on Ctrl-C, from what I can tell by skimming it). (Edit: not much new info, I think)

@NHDaly Hi, good to have a CSP specialist chiming in :)

😛 Lol hardly! I've just been reading up on all this within the past year and a half or so as i've been learning more about concurrency and multithreading in julia. :) You might be thinking that because of https://github.com/NHDaly/CspExamples.jl, but that was something I wrote to _learn_ CSP and to learn about julia's concurrency, not because i was already an expert :)


Thanks for the links + context!

Hey guys sorry to be silent here for a while.

Yes, I agree it's unavoidable and especially hard to do in Julia because nobody wants random cancellation points to be inserted in their carefully written tight loops.

Agreed, although efficiency is just one aspect. The more difficult part is that inserting cancel points can break the programmer's expectations in terms of resource handling. I think I was harping on about this recently with the fairly trivial example of Base.lock() not being async signal-safe for InterruptException (ah yes, here it is: https://github.com/JuliaLang/julia/issues/35524#issuecomment-625538029). We can obviously fix Base, but I think this emphasizes that writing code which is safe for use with InterruptException is currently subtle.

There seem to be some options for cancel points:

  • They could be restricted to certain function calls where the programmer is already thinking about error handling (trio uses IO for this)
  • We could have language syntax + semantics which makes resource acquisition and cleanup atomic with respect to cancellation. For this to succeed, it would have to be sufficiently nice that users write cancel-safe code by default.
  • Rely on users inserting explicit cancel points into their code
  • Make cancellation a property of resources which are being used by the code, rather than of the code itself. (I include this because Jameson has mentioned it on multiple occasions, though I don't fully understand how it would work.)

As has been discussed here, passing around nurseries gives you basically unrestricted dynamic task spawning again.

I don't think it's unrestricted. You can always reason about the scope (= nursery) of the task lexically because you either see @sync or the nursery as an argument.

Yes, exactly!

I think it's interesting to compare to Go, where people started off allowing for goroutines to be spawned anywhere, but current practice converged on passing the context everywhere (I think? I'm not a Go programmer though.) The point is to have a middle ground which is less onerous and more composable than passing a context everywhere explicitly, while also avoiding the downsides of being able to implicitly start new branches of a computation with side effects which outlive the caller.

There seem to be some options for cancel points:

For completeness, I think we might need something like disable_sigint but more generic (not just for sigint). Trio has CancelScope.shield, asyncio has shield, Kotlin has NonCancellable, and it's planned(?) for Project Loom (Java).

  • We could have language syntax + semantics which makes resource acquisition and cleanup atomic with respect to cancellation.

Are you thinking something better than the do block? Maybe the f(...)! syntax https://github.com/JuliaLang/julia/issues/7721#issuecomment-170942938?

But I think it's reasonable to assume that it'd take some time to land. Meanwhile, how about discourage using lock(l)/trylock(l) and recommending/implementing lock(f, l)/trylock(f, l)?

Are you thinking something better than the do block? Maybe the f(...)! syntax #7721 (comment)?

Oh yes nice find, I don't think I've seen this discussion. It would be doubly nice if it could remove more uses of finalizers which are just generally problematic in many ways :)

Here's a conversation which touched on @sync vs Experimental.@sync from the Julia Computing slack (we didn't set out to discuss this, but it ended up relevant so I said I'd repost it here for everyone to read):

  • Chris Foster 4:55 PM
    @jeffbezanson Here's the trio issue by @njsmith describing what's wrong with trio's handling of child task errors: https://github.com/python-trio/trio/issues/611
    I think they reach mostly same conclusion we did today (I guess not by coincidence! I did read through that thread a while back.)
    Of course we've got the advantage that our error handling syntax is half baked right now :smile:
    So at least if we get the internal representation of concurrent errors correct we can eventually make a nice API to go with it.

  • Jeff Bezanson 1:24 AM
    To fill everyone in, Chris & I had a discussion about task error handling and we decided:

    • Make excstack a first-class object and replace the .backtrace field in Task with it. A bunch of the C code for munging it can go away and/or move to julia.
    • Combine CapturedException, CompositeException, and TaskFailedException into one thing similar to trio's MultiError. Need a name for this! PropagatedExceptions? ExceptionTree?
    • That new thing should always be used instead of passing tasks around, since Tasks are hard to serialize. Instead we need to process a RawExceptionStack when it's serialized. Not sure exactly how --- we could replace the raw stack with a processed one inside ExceptionTree, or the raw stack itself could be internally converted to a processed form.
  • Jameson Nash 4:43 AM
    I’m not sure I entirely agree about the last point, since it implies we can’t start printing information to the user about an error in one task until we have confirmation that all tasks in a sync block have finished. Accordingly, I think we may want to think about moving towards having show_error do more of the work to process it, so that sync doesn’t need to wait for it to be in a consumable form.

  • Jeff Bezanson 4:46 AM
    We don't need to ban printing tasks --- a task will still contain its exception and backtrace, and you can still look at them.
    sync can also return the raw trace data from tasks, it doesn't need to process them.

  • Jameson Nash 4:49 AM
    The current intent for Experimental.@sync however is to be unable to even get the trace data from all of the Tasks

  • Jeff Bezanson 4:59 AM
    It doesn't seem right to me to insist on that in every case. Surely there are cases where you need to potentially collect errors from multiple tasks?

  • Jameson Nash 5:07 AM
    You can’t have both that and early return, unless the return set is the list of Tasks which lets show_error then do advanced diagnostics

  • Jeff Bezanson 5:11 AM
    I think it's fine to have "return as soon as there is one failure, and tell me just that one failure". But I think what trio does is as soon as there is one failure, try to cancel all the other tasks, then wait for them, then gather exceptions from all the failed ones. So we might need to provide both options.
    Does show_error need anything from a Task besides the exception and backtrace?

  • Jameson Nash 5:19 AM
    Yeah, “try to cancel” is horribly problematic though on many levels

  • Jeff Bezanson 5:19 AM
    Ok but that is a separable issue --- you can also just do what @sync does today, with no canceling.

  • Jameson Nash 5:20 AM
    Yes, but then you can’t collect the errors. You either must return the Tasks, or just the first error.

  • Jeff Bezanson 5:21 AM
    Why? You could still wait for all of them to finish.

  • Jameson Nash 5:21 AM
    I think if it has the Task object itself, we may be able to do some more interesting work showing data dependency cycles, meaning we can print much more useful stacktraces
    Because not waiting for them all to finish is explicitly why Experimental.@sync exists as an improvement over @sync

  • Jeff Bezanson 5:23 AM
    So waiting for a set of tasks to finish is inherently wrong?
    I just don't buy that there is never a need to propagate exceptions from multiple tasks. It seems pretty fundamental in trio.

  • Jameson Nash 5:23 AM
    Not necessarily, that’s just what’s being proposed there
    And I agree we do want to get the exception from multiple tasks, I’m just saying we can’t have both the experimental design proposal, and follow the last bullet point (of not returning the Task itself)
    And if we return the Task itself, we can also print much more interesting backtraces, that the runtime system couldn
    have computed, but must be deferred until error printing time (when we can examine the heap)

  • Jeff Bezanson 5:28 AM
    Ok, well I'm not totally opposed to throwing the Tasks themselves; I see you could do things like check which of them might be waiting for others.
    And Tasks are convenient in that they already bundle an exception and a backtrace; otherwise we need to awkwardly pass around two values for each error.

  • Jeff Bezanson 5:36 AM
    Another question is, do we always have a Task whenever we want to propagate an exception, or do we sometimes need to just propagate a bare pair of exception+backtrace?

  • Jameson Nash 6:04 AM
    We might need to propagate a bare pair if the root task (the one with the sync call) failed

  • Chris Foster 10:26 PM

    Yeah, “try to cancel” is horribly problematic though on many levels

    I'd love to understand better why this is.
    Cancellation seems fairly fundamental to the desire for usable structured concurrency rather than do-what-the-heck-you-like concurrency.
    Experimental.@sync just gives up on scoping of child tasks so I'm not at all convinced it's a long term improvement over @sync

  • Chris Foster 10:43 PM
    I understand that cancellation is hard because resource management is hard (more or less). Preemptive cancellation leaves resources dangling so you really can't do that without complete isolation between tasks enforced by the runtime (erlang style) so the runtime can GC them. Clearly that's completely incompatible with the Julia task model.
    Ok, so what are the options for cooperative cancellation?
    There's the trio / pthreads / etc style where cancellation is level triggered and causes a small well-documented set of IO functions to return an error. That seems pretty reasonable for non-buggy programs which do IO.
    @jameson I think you said the pthreads-like option is insufficient last time we discussed this, but I don't think you explained exactly why.

  • Chris Foster 10:57 PM
    Problem is, we can't cancel compute with that model and a lot of compute is what people are likely to be doing in Julia...
    So is this the main sticking point: there's no known models for safely cancelling compute which would fit with our runtime?

  • Jameson Nash 11:08 PM
    I’ve just never seen a cancellation system that isn’t some combination of:

    • rife with mistakes in the choice of functions (IO is the worst possible choice, IMO)
    • fails to provide any useful bound on time to completion (this is the thing you actually want to cancel)
    • incompatible with proper cleanup of resources (esp. trio, which tries to pretend otherwise)
    • deprecated due to those issues
  • Chris Foster 11:12 PM
    Hmm, interesting. So why is IO the worst choice, and what would be a better one?
    Time to completion is a real killer.

  • Jameson Nash 11:13 PM
    Transaction memory would probably be the main one
    IO is the worst choice because it guarantees you’ll leave the other actors in a broken state
    The goal is to cancel pure work (which might include IO with nobody listening), but instead it cancels the work most likely to lead to permanent corruption
    (and not to say I don’t use kill -9 with abandon, but I usually then do need to go fix the filesystem)
    My proposal (if I had time to work on this more), would be to make sure we have a way to close the dead IO objects explicitly when the nursery is exited
    So I’d propose ignoring tasks completely, and make open the primitive that’s dynamically scoped inside a nursery
    For the same reason actually too that r = open(cat *); wait(r); read(r) is a deadlock, because it waits for the Task instead of the unit of work (the readall call needs to happen before the wait )

  • Chris Foster 11:21 PM
    So is your point something like... it's not so much nesting of task lifetime which is important in structuring concurrency, but something related — nesting of any side effects the tasks may have?

  • Jameson Nash 11:22 PM
    I also think this isn’t something you can retrofit into an existing language, but fortunately, Julia has strong safe management of IO resources and exposes those handles typically to the user, so it’s something we should be able to do
    Right.

  • Chris Foster 11:23 PM
    Ok very interesting. Very easy to confuse those two things, but it's side effects which really matter

  • Jameson Nash 11:23 PM
    Actor model (including Trio) gets you pretty far by asserting that one task must be equivalent to one side effect

  • Chris Foster 11:30 PM
    I'm just trying to digest all that. So the idea is that if tasks communicate exclusively with things which can be opened, then ensuring those channels are closed in a nested way can ensure side effects are nested?
    What about shared memory though?

  • Jameson Nash 11:30 PM
    But doesn’t seem to nest correctly? For example, what if I wanted to write:

    with(database) do:
        for client in accept(port):
            @async with(client) do:
                handle(client)
    
    • Chris Foster
      So, I feel like this is a really important point you're making here. But I don't understand it — would you be able to expand a little?

    • Jameson Nash
      If one client dies, we don’t want it taking down the entire database. But so on client death, it may need to do IO with the database to cleanup and notify other live clients

    • Jameson Nash
      But if the database dies, we need all of the tasks to exit without attempting to cleanup the database (which is already dead)

  • Jameson Nash 11:32 PM
    Shared memory I think also has a close function?
    Not all of them are created via the open verb as well. Some (like Channel) are simply constructed.

  • Chris Foster 11:35 PM

    Shared memory I think also has a close function?

    Sure, in principle, so no more sharing of normal Arrays? I'm not sure I'm understanding :slightly_smiling_face:

  • Jameson Nash 12:37 AM
    Ah, I thought you meant the stdlib of that name. I’ve done some design thinking on adding close to Array itself (currently available via empty! + reshape), but that seems tangential to this.

  • Chris Foster 12:51 AM
    Well I'm supposing that mutating memory (in a way which is visible to other tasks) should be correctly nested as much as other side effects should be.
    But I don't understand how that can be made practical.

  • Jeff Bezanson 1:00 AM
    This is a reason I think the current @sync is maybe not so crazy. If you can set up your tasks to work with cancellation, you can probably also set them up to just terminate. Cancellation just feels more magical and automated. Waiting for everything is a pain for debugging, but we can address that with better introspection and tooling i hope.

  • Jameson Nash 1:14 AM
    I think you might have close on a Lock, since that’s generally necessary the shared resource (the Array itself then being single-owner-at-a-time with exchange controlled by the lock)? I haven’t looked into that though.
    Yeah, we could probably implement deadlock detection, and then the current @sync design also might just magically work
    (deadlock here meaning detection that everyone in the sync-set is waiting for an internal, so non-IO, event)

I don't get why "IO is the worst choice." You should be ready for failures when dealing with IO. So, it seems to be the best choice.

Also, I'm not sure why it'll "leave the other actors in a broken state." Lexical scoping in structured concurrency is powerful exactly because it works nicely with lexically scoped resource management (Julia's do and Python's with).

I agree using resources as cancellation tokens is a good implementation strategy but I think it's rather a very low-level implementation detail. I think it's also hard to use in practice.


A case-study on using resources as cancellation tokens

In #34543, I wondered how to make Threads.foreach(f, channel::Channel) beahve nicely with errors. There are a few nice properties to have:

  1. When one task throws, all other tasks terminate reasonably soon (structured concurrency).
  2. Don't close the input channel.
  3. All items take!n from the input channel are processed.

Current implementation in #34543 is, roughly speaking, something like this

function tforeach1(f, channel::Channel; ntasks=Threads.nthreads())
    stop = Threads.Atomic{Bool}(false)
    @sync for _ in 1:ntasks
        Threads.@spawn try
            for item in channel
                f(item)
                stop[] && break
            end
        catch
            stop[] = true
            rethrow()
        end
    end
    return nothing
end

tforeach1 satisfies the property 2 and 3 but not 1. This is because a task can be stuck on take!(channel) in the line for item in channel.

Initially, I thought it might be better to use the channel as the resource that is closed to propagate the cancellation:

function tforeach2(f, channel::Channel; ntasks=Threads.nthreads())
    @sync for _ in 1:ntasks
        Threads.@spawn try
            for item in channel
                f(item)
            end
        catch
            close(channel)
            rethrow()
        end
    end
    return nothing
end

tforeach2 satisfies the property 1 and 3 but not 2. That is to say, we shouldn't use the resource we do not own as the cancellation token. So how about creating the resource that we "own"?

function tforeach3(f, channel::Channel; ntasks=Threads.nthreads())
    owned_channel = Channel() do owned_channel
        for item in channel
            put!(owned_channel, item)
        end
    end
    @sync for _ in 1:ntasks
        Threads.@spawn try
            for item in owned_channel
                f(item)
            end
        catch
            close(owned_channel)
            rethrow()
        end
    end
    return nothing
end

Unfortunately, tforeach3 satisfies the property 1 and 2 but not 3. The owned_channel may be closed after an itme is take!n from the input channel.

If we have something like Go's select (i.e., we have a way to say "exactly one of those effect happens"), it'd be possible to do this. But it's rather cumbersome:

function tforeach_safe(f, channel::Channel; ntasks = Threads.nthreads())
    done = Channel{Nothing}(ntasks)
    try
        @sync for _ in 1:ntasks
            Threads.@spawn while true
                @select begin
                    item = take!(channel) begin
                        # `take!(channel)` happened (but `take!(done)` didn't)
                        try
                            f(item)
                        catch
                            put!(done, nothing)
                            rethrow()
                        end
                    end
                    take!(done) begin
                        # `take!(done)` happened (but `take!(channel)` didn't)
                        break
                    end
                end
            end
        end
    finally
        close(done)
    end
end

(Note: more precisely we need to use maybe_take!(::Channel{T}) :: Union{Nothing,Some{T}} instead of take!(channel) to handle the case channel is closed. But it's a pseudo-code anyway.)

Actually, tforeach_safe is not enough because we don't control which take! is preferred in @select (at least if it is Go-like)... I guess we need something like

function tforeach_safe2(f, channel::Channel; ntasks = Threads.nthreads())
    taskref = Ref{Task}()
    done = Channel{Nothing}(ntasks)
    @sync try
        request = Channel(taskref = taskref) do request
            while true
                response = take!(request)
                @select begin
                    item = take!(channel) begin
                        put!(response, item)
                    end
                    take!(done) begin
                        close(response)
                        break
                    end
                end
            end
        end
        Base.@sync_add taskref[]
        for _ in 1:ntasks
            Threads.@spawn begin
                response = Channel(1)
                while true
                    try
                        put!(request, response)
                    catch
                        break  # closed by other worker tasks (or no more items)
                    end
                    item = try
                        take!(response)
                    catch
                        break  # closed by the `request` task
                    end
                    try
                        f(item)
                    catch
                        # Allow no more request (but still process
                        # requests that are already sent):
                        close(request)
                        put!(done, nothing)
                        rethrow()
                    end
                end
            end
        end
    finally
        close(done)
    end
    return nothing
end

Combine CapturedException, CompositeException, and TaskFailedException into one thing similar to trio's MultiError. Need a name for this! PropagatedExceptions? ExceptionTree?

Also, just wanted to chime-in regarding this point, and point out this package we ended up writing now that we're using more concurrency inside our code at RelationalAI:
https://github.com/NHDaly/ExceptionUnwrapping.jl

It's scary that some internal decisions to multithread some code can change the kinds of exceptions that are being thrown, so we're now using has_wrapped_exception(e, T) everywhere instead of e isa T in our catch blocks.

This seems relevant, though i think orthogonal, to the current discussion, so I just wanted to note it :)

I continue to try to find counter-examples to structured concurrency, or at least cases where it would uncomfortably constrain programming style. Recently I thought I found one but it turned out to be a false alarm. It's an interesting case to think through though.

The setting

For resource lifecycle handling, the open(thing) do resource ; ... end pattern is extremely useful because

  1. The user can't forget to call close
  2. The resource type owns the stack because it owns the open implementation; it can use block constructs like try ... finally, @sync, and control state can be kept in the stack. This seems closely related to the benefits of foldl based loops which @tkf has been emphasizing for a while.

The former is important for ergonomics, but the latter is what matters for structured concurrency as the open() may need to start async background tasks (eg, as a resource which communicates with a remote server). In strict structured concurrency these async background tasks can't escape the call to open(), so the pattern of passing a closure to be invoked with resource in the inner scope is very compatible with this.

For the same reason, the classic file IO like producer-consumer pattern of resource management with paired open/close can't be used in structured concurrency because it leaves no context to manage the child tasks:

resource = open(thing)
do_stuff(resource)
close(resource)

(That is, not without making the task nursery an explicit part of the open API's parameters, which is no good for composability as it essentially colors open() via calling convention.)

Why not simply transition to always using do blocks and closures for resource handling?

Well we should for most things, even just for the ergonomic benefits of writing code which always correctly closes resources.

But it's actually pretty inconvenient in the REPL! In the REPL you want to enter a context where resource is available and use it interactively. So the scoped resource management is pretty inconvenient in this important case.

I thought this was a problem but it actually isn't if the REPL itself maintains a nursery. One way out is to have a macro which, roughly speaking, turns the nested interface

open(thing) do resource
    do_stuff(resource)
end

into something which looks like the producer-consumer interface by introducing a task explicitly into the REPL nursery to manage the resource lifetime. Schematically,

ready = Channel()
done = Channel()
@async_in_ctx repl_nursery open(thing) do r
    global resource = r  # Or whatever is required to set this in the REPL context
    put!(ready, true)
    take!(done)
end

# REPL work ...
do_stuff(resource)

put!(done, true)

Working sketch

Here's a sketch that actually works in current Julia (though I resorted to bare @async because we don't have explicit nurseries, or a REPL which maintains one):

struct AsyncResource
    done::Channel
end

Base.close(r::AsyncResource) = put!(r.done, true)

macro asyncdo(ex)
    @assert ex.head == :(=)
    var = ex.args[1]
    call = ex.args[2]
    doblock = :(placeholder() do x
        # This `global` is problematic!
        # Should be an `outer` variable!?
        global $var = x
        put!(ready_done, true)
        @info "Ready $(current_task())"
        take!(ready_done)
        @info "Done $(current_task())"
        nothing
    end)
    doblock.args[1] = call
    quote
        ready_done = Channel()
        @async $doblock
        take!(ready_done)
        AsyncResource(ready_done)
    end
end

Kinda ugly usage example

julia> resource = @asyncdo io = open("resources.jl") # internally, calls do-based form of open()
[ Info: Ready Task (runnable) @0x00007ff3843549d0
AsyncResource(Channel{Any}(sz_max:0,sz_curr:0))

julia> read(io, 1)
1-element Array{UInt8,1}:
 0x23

julia> close(resource)
[ Info: Done Task (runnable) @0x00007ff3843549d0
true

julia> isopen(io)
false

Of course it looks pretty weird to use this for normal file streams as in this example. But I hope it makes sense if you imagine a world in which there's no such function open("file") but only the scoped form open(f, "file").

Any thoughts on how that might mesh with the io = open(path)! syntax proposal in https://github.com/JuliaLang/julia/issues/7721?

Any thoughts on how that might mesh with the io = open(path)! syntax proposal in #7721?

Aha, thanks for the very relevant link. I was hoping to re-read that discussion last week but I just couldn't find it. (More complete XRef: The relevant discussion of postfix-! syntax starts here https://github.com/JuliaLang/julia/issues/7721#issuecomment-170942938; Jeff suggested it could relate to a more general defer form here https://github.com/JuliaLang/julia/issues/7721#issuecomment-171004109.)

One interesting thing about defer-like approaches is that they avoid excessive nesting syntax (and, perhaps, strict nesting of resource lifetimes) and still provide guarantees about when cleanup will run. There's a lot to like about it, especially with the ! shorthand syntax.

On the other hand, having open(path)! surface syntax suggests that resources would be implemented as state machines with open and closed states rather than a higher order function accepting the user's callback. That's relevant because the following seem incompatible:

  1. Using explicit state machines for resource handling, where the resource implementer returns the resource from open().
  2. Structured concurrency with child task lifetimes which do not outlive function calls
  3. Ensuring that the use of concurrency within open() is an implementation detail. (AKA, avoiding colored functions; colored by the calling convention of whether open() accepts a nursery or not.)

I think there's various ways to attack this apparent problem.

  • Don't use explicit state machines for resource handling; invent some lowering for ! (or other convenient surface syntax) which lets the user use them like a state machine but the implementer implement them as a normal higher order function. FLoops.jl is some interesting prior art for this kind of approach. (The AsyncResource sketch shows one ugly and inefficient way that a callback-based open() can be transformed into a state machine. Perhaps there's something less bad!)
  • Naturally, one could ditch structured concurrency. Seems like a pity!
  • Colored functions; make the calling convention for "resource functions" include a nursery in some way. Could possibly be viable with the right lowering for !, as the appearance of ! in the source code seems like it would induce an explicit chain of custody. (Cf. the way that Kotlin's coroutine context is passed)

No doubt there's other options.

My current thinking is that the higher-order function pattern does have a limitation in that it is _too flexible_.

For example, in FGenerators.jl (the "sister project" of FLoops.jl), I added FGenerators.@with so that open(...) do pattern can be used with good performance-oriented iterations over "data collections with resources". This is required to avoid Boxing in the closures.

The closure problem itself may be solved at some point. But I think it's valuable to encode the basic property that the "do block" is executed _exactly once_. For example, Rust has a similar FnOnce trait (although Rust's resource management is usually RAII). Python's with block satisfies this property exactly. OTOH, in FGenerators.@with, the user has to manually make sure this is the case. This is obviously not ideal (and doing so rigorously ATM is virtually impossible because of multiple dispatch).

the open() may need to start async background tasks

3. Ensuring that the use of concurrency within open() is an implementation detail.

These points are interesting. In a way, Trio is not doing structured concurrency super strictly if you are viewing from this standing point. That is to say, it allows you to leak resource since you can call __aenter__. So, you can define open/close as state machine quite easily even when you need nursery as a sub-resource:

class ResourceWithNursery:

    async def __aenter__(self):
        self.manager = trio.open_nursery()
        nursery = await self.manager.__aenter__()
        nursery.start_soon(trio.sleep, 0.1)
        print("Enter")

    async def __aexit__(self, exc_type, exc_value, traceback):
        print("Exit")
        await self.manager.__aexit__(exc_type, exc_value, traceback)

Of course, most of the time Python programmers do not write resource managers by hand. Instead, it'd be derived from the generator syntax:

from contextlib import asynccontextmanager


@asynccontextmanager
async def resource_with_nursery():
    async with trio.open_nursery() as nursery:
        nursery.start_soon(trio.sleep, 0.1)
        print("Enter")
        yield
        print("Exit")

I think this is close to the spirit of "the implementer implement them as a normal higher order function." From my experience of playing with IRTools.jl a bit, I think something like @asynccontextmanager is totally possible.

But, the "lesson" I'd derive from this is rather that it is OK to provide "unsafe" API as long as the API has the clear indication of unsafeness to nudge the implementer to be very careful. Even Rust can't save you from miss-implemented drop anyway. The open/close state machine would be one of such "unsafe" APIs. To explicitly denote that certain APIs are low-level, I think Python does a good job at it with the "dunder" methods. I'm not suggesting dunder as the only solution but, since we already have __init__, it may not be so crazy to have __open__ and __close__. It gives you some alarm if you see code calling these functions manually.

A bit aside, but I think one of the designing challenges of open(...)!/defer-like approach is what to do when you want to control the lexical scope where the close must happen. For something like a mutex it matters a lot.

Other options are to disallow defer calls in global scope, ignore them (let finalizers handle it), or implicitly turn them into finalizers in global scope. Not sure if any of those approaches simplifies things.

At global scope it would seem logical to turn defer handlers into finalizers as there's no static lifetime information. I'm unsure how defer would interact with closure captures, but we should probably discuss that in #7721.

OTOH, in FGenerators.@with, the user has to manually make sure this is the case. This is obviously not ideal (and doing so rigorously ATM is virtually impossible because of multiple dispatch).

Interesting point. Maybe we could have a wrapper which dynamically asserts the call-once property, but can be optimized away when enough type information is available.

Some interesting discussion of Go's context here:

https://news.ycombinator.com/item?id=24323564#24327585

Swift now has a set of WIP proposals for concurrency in general, including one on structured concurrency. From a fairly quick read, their proposal seems very much similar to the Trio approach. However it's an interesting and very relevant read in its own right, as it's quite concise and discusses structured concurrency as a language feature rather than at the library level:

https://github.com/DougGregor/swift-evolution/blob/structured-concurrency/proposals/nnnn-structured-concurrency.md

I found the section on cancellation interesting. The proposed model seems quite standard: it's fully cooperative cancellation where they expect IO operations to be cancellable. To quote:

With that said, the general expectation is that asynchronous functions should attempt to respond to cancellation by promptly throwing or returning. In most functions, it should be sufficient to rely on lower-level functions that can wait for a long time (for example, I/O functions or Task.Handle.get()) to check for cancellation and abort early. Functions which perform a large amount of synchronous computation may wish to periodically check for cancellation explicitly.

Somewhat related to all this, it seems Swift will go down the path of colored functions — see the WIP async/await proposal.

I find these "standard" cancelation proposals (based from trio, inherited from C?) to not be very helpful. They are supposed to give you implicit cancelation after a timeout, then provide neither. They still only cancel after specific points explicitly check for the condition (when reaching IO operations) and they thus still require you to have resources—they just don't let you clean up resources properly (close is also a cancelation point in C, but also probably will leak memory and break other state since any cleanup is skipped). And worse, they seem to add yet another "color" of functions (yellow?): there's those that are canceled (nursery) and those that aren't (detached). The cancellation that I'm arguing we already have is thus equivalent in functionality but better on each axis: we define that the cancellation happens when interacting with a resource ✔️, except the check really is implicit ✔️(via close state checking) and fully scoped and doesn't require passing along additional handles ✔️(just using the ones it already has). This wasn't possible in C since there's no resource management of file descriptors, so pthreads has this sort of cancellation instead. But we don't need to keep inheriting the limitations of that past, if we have an IO, Channel, and Event system planned to handle this.

Right, I know you've said you don't like other cancellation systems on multiple occasions. To be clear, I'm not arguing that in this swift proposal Swift is doing it "the right way", only pointing out that it's interesting to see. BTW, did you read it yourself — I'd be curious if you think my summary is accurate?

The cancellation that I'm arguing we already have

My problem with the cancellation we already have is that using @sync correctly seems to be very tricky and manual — it's extremely easy for a task to accidentally crash and not have its siblings be cancelled. It's necessary to manually pass around cancellation tokens if there's no resource naturally available. For cancelling compute (and memory access), I wish there was a way to check "should this task be cancelled?" without creating a Channel or some other resource and passing it by hand.

Experimental.@sync intentionally leak tasks which may still be running, but I don't think this is a good solution unless the leaked tasks truly have no observable side effects. In current Julia there's many ways a task can have side effects visible to other tasks.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

stevengj picture stevengj  Âˇ  174Comments

StefanKarpinski picture StefanKarpinski  Âˇ  216Comments

jiahao picture jiahao  Âˇ  417Comments

IainNZ picture IainNZ  Âˇ  109Comments

StefanKarpinski picture StefanKarpinski  Âˇ  249Comments