Rfcs: A `yield` construct in the vein of C#

Created on 12 Oct 2014  Â·  75Comments  Â·  Source: rust-lang/rfcs

Issue by bstrie
_Friday Jul 12, 2013 at 13:46 GMT_

_For earlier discussion, see https://github.com/rust-lang/rust/issues/7746_

_This issue was labelled with: A-an-interesting-project, E-hard, I-wishlist in the Rust repository_


@thestinger has mentioned this several times as a wishlist item. This is a feature that greatly simplifies the implementation of external iterators by allowing them to be compiled to state machines at compile time (did I get that right?). See http://msdn.microsoft.com/en-us/library/vstudio/9k7k7cf0.aspx for reference.

While we shouldn't expect this for 1.0, it might be prescient to reserve the keyword right now.

Nominating for far-future milestone.

T-lang

Most helpful comment

Just to provide a counterpoint, introducing async/await keywords is (IMO) a growing language smell. It means that you're effectively partitioning all code into two competing classes (blocking/async), and that really causes some significant problems for large codebases. Please see http://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/ for more.

All 75 comments

I believe yield conveys pretty much the same meaning in other languages, including Python, Ruby and Lua. I think it's very appropriate to remove "C#" from the title.

@errordeveloper: It's not implemented with a compile-time state machine in Python and Ruby. C# in the title is definitely appropriate, because (AFAIK) there isn't another language with a comparable feature. The proposal is not about generators / coroutines built on context switches.

@thestinger sure, I didn't fully realise that.

A decently in depth discussion of yield support in C++: http://isocpp.org/files/papers/N4134.pdf

@sfackler great find!

If Am I understanding correct that someone needs to make a pull-request with an actual RFC write-up, and if so, has anyone here already started on that or at least intending to do so?

@errordeveloper I have started thinking about the implications of a 'yield' construct for borrow checking, which is something that past proposals have sort of glossed over. I haven't been too anxious to write up an RFC, because there is no chance that it will be considered before 1.0.

Is the fact that it's implemented as a compile-time state machine significant? The usage pattern to the developer is the same as in any other language, albeit with explicit typing.

It is, as it makes this feature most useful. State machines generated at
build time can be validated/safety-checked as well as optimised by the
compiler plugin that implements it.

On Wed, 19 Nov 2014 20:46 Lucretiel [email protected] wrote:

Is the fact that it's implemented as a compile-time state machine
significant? The usage pattern to the developer is the same as in any other
language, albeit with explicit typing.

—
Reply to this email directly or view it on GitHub
https://github.com/rust-lang/rfcs/issues/388#issuecomment-63710381.

That is an interesting bunch of libraries there! It looks like your monad macro has bind as a primitive like Haskell's. I wonder if performance could be improved by using map and join alone as primitives to remove some "administrative lambdas".

Either way (as pipes will contain FnOnces even if the core monad parts don't), it would be nice if there was some way to add an #[inline] to the function the once closure sugar desguars into. Maybe that might help with code locality.

Here the latest Kohlhoff paper "Resumable Expressions", a new lower-level c++ alternative to await that allows implementing yield at the library level and combines stackless and stackfull coroutines:
http://open-std.org/JTC1/SC22/WG21/docs/papers/2015/n4453.pdf

I can't claim to be much of a fan of this proposal. He completely ignores
what I think is the major advantage of the await/async (in python: yield
from/coroutine) model: explicit control flow shifts. One of the major side
benefits to single-threaded asynchronous design, as opposed to
multithreaded synchronous, is that the control flow is only ever in one
place at once. This means that a lot of the state-sync and atomicity issues
in multithreaded design can be avoided. Compounding this advantage is the
fact that, with await/yield from, all context shifts are explicit. In
effect this means the developer can guarantee that between this await and
that one, no other coroutine will touch any shared state.

As an example, I used this to my advantage when writing an IRC-like chat
room using python asyncio, where I had the list of logged in users as a
dict mapping user ids to connection objects. Because I could guarantee
exclusive access to this object in a given coroutine' between yield froms,
I didn't have to worry about locking it, creating a local copy of it, or
anything like that.

Unfortunately, in the proposed resumable model, this advantage is lost. Any
function could conceivably result in a context switch. The situation is
slightly improved by the fact that such functions are explicitly tagged
"resumable", but because merely calling the function can cause it to begin
running or suspending, there's no way to separate the act of instantiating
the function with that of stepping through it.

Worth noting: Python's PEP process just accepted an async/await PEP: https://www.python.org/dev/peps/pep-0492/

PEP 492 is pretty cool, I'm excited to try the next pre-release of Python 3.5.

In addition to generators and async functions in Javascript, there's also some work being done on async generators as well: https://github.com/jhusain/asyncgenerator. The author gave a talk about some related concepts in this talk, though the syntax is hidden on the last slide and not really explained: https://www.youtube.com/watch?v=gawmdhCNy-A.

@Lucretiel i find that being able to reuse existing functions_as is_ is a huge advantage of the "resumable expressions" proposal. The other proposals split the world and that is in my opinion a very bad thing.

The trouble is that you then lose what I think is one of the primary benefits of async programming: explicit context switch points. When literally any function can cause a context switch, you're basically not any better than mutlithreading- constantly having to use locks and other primitives to ensure all your state is safe. God help you if you get it wrong and have to try to debug it. With explict async, you know that no other coroutine will execute between one yield and the next, so you're safe to do whatever needs to be done, and be guaranteed that it will appear atomic to other coroutines.

I'll grant that being able to reuse everything has some surface appeal, but I think it's a problem waiting to happen. The reality is that concurrency requires a fundamentally different model of thinking- look at all the classic multithreading problems to see why- and explicit async reflects that model.

Just my two cents, though. It could be that rust, with its emphasis on static memory safety, doesn't need to worry as much as, say, C++.

Just to provide a counterpoint, introducing async/await keywords is (IMO) a growing language smell. It means that you're effectively partitioning all code into two competing classes (blocking/async), and that really causes some significant problems for large codebases. Please see http://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/ for more.

Just to provide a counterpoint, introducing async/await keywords is (IMO)
a growing language smell. It means that you're effectively partitioning all
code into two competing classes (blocking/async)

Yes, that basically splits the world.

On Thu, Sep 17, 2015 at 7:17 AM, JT Olds [email protected] wrote:

Just to provide a counterpoint, introducing async/await keywords is (IMO)
a growing language smell. It means that you're effectively partitioning all
code into two competing classes (blocking/async), and that really causes
some significant problems for large codebases. Please see
http://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/
for more.

—
Reply to this email directly or view it on GitHub
https://github.com/rust-lang/rfcs/issues/388#issuecomment-140972102.

Just to provide a counterpoint, introducing async/await keywords is (IMO) a growing language smell. It means that you're effectively partitioning all code into two competing classes (blocking/async)

I completely understand your point of view, having debated it a great deal myself, but I have to disagree. It is a totally different model with different semantics, and you cannot reasonably pretend to do async work synchronously and have be something that a mediocre programmer, such as myself, can reason about.

While a simple function is just a function, an async function isn't a function at all. The flow is _like_ a function, but the async and await keywords allow us to reason about what race conditions may happen in this function in a sane way, because we know where suspension will happen. The similarity to a function is on purpose, but they aren't the same thing, and for very good reason.

The fact that the later we have explicit async and await the more we will have libraries that end up with two versions of different functions, one synchronous and one asynchronous, is an indication that it's important to solve this sooner rather than later, so that library authors can have the tools they need to make that decision sooner rather than later. Then maybe they can choose to have only one version, whichever is appropriate for their work.

Because of Rust's memory safety guarantees, it's easy to conclude that we don't need explicit suspension points, and that Rust's ownership rules will sort out this mess, as it does for memory access. But there are plenty of race conditions that aren't memory-access data races, and programmers need as much help with those as they do with memory races.

So far, I haven't seen how Rust's ownership could make that part of explicit async programming unnecessary. If somebody can solve that, then I'd probably consider further whether explicit async and await keywords were really necessary. For now, I'm convinced that they are.

While a simple function is just a function, an async function isn't a function at all. The flow is like a function, but the async and await keywords allow us to reason about what race conditions may happen in this function in a sane way, because we know where suspension will happen. The similarity to a function is on purpose, but they aren't the same thing, and for very good reason.

Because of Rust's memory safety guarantees, it's easy to conclude that we don't need explicit suspension points, and that Rust's ownership rules will sort out this mess, as it does for memory access. But there are plenty of race conditions that aren't memory-access data races, and programmers need as much help with those as they do with memory races.

This seems like a surprising position to take. This doesn't strike me as a very good reason, but more akin to rationalization. As long as there's other threads running, you definitely don't have explicit suspension points. Your argument against yourself here is a good one. Rust is designed to make explicit suspension points unnecessary, which is a good thing. Single-threaded event loops don't make good use of multi-core parallelism.

The fact that the later we have explicit async and await the more we will have libraries that end up with two versions of different functions, one synchronous and one asynchronous, is an indication that it's important to solve this sooner rather than later, so that library authors can have the tools they need to make that decision sooner rather than later. Then maybe they can choose to have only one version, whichever is appropriate for their work.

Again, a surprising position to take. If the language handles async IO for you, then _every_ library always works for you, and there's no bifurcation. Imagine the following scenario: you're writing some small app at low scale. It's a killer app, it does well, and you need to scale it up. Bam, you need to switch to an async IO model (as indicated by https://news.ycombinator.com/item?id=10232131). Well shoot, that sucks. Okay, so you start porting everything, but you start hitting the long-tail of libraries. There's only one Rust UPNP library perhaps, and its author didn't need to scale like you do, and it blocks. Great, so now you either get to write a new library or make a threadpool. But it's not just the UPNP library, there's 20 other little libraries like this. You switch to the async version of some database library and everything starts looking good. A few weeks later you notice your app just keeps blocking up and no longer responding. Turns out one small error case in the database library uses the blocking version of the library and now your entire event loop is hosed.

This entire problem could be avoided completely.

Implementation-wise, it is not hard to convert an async function back to a blocking one with little to no overhead. It may be too late to do anything like this in Rust, but it would be interesting to make _everything_ async, and then let typical blocking calls use some syntactically lightweight (perhaps even invisible) way to force them to be synchronous.

This would give people the benefits of Go-like concurrency, without taking away the control and flexibility of async/await.

Heh, parts of this rant have been a long time coming :)

To those that want to combine everything, I am sympathetic, but it's very important to remember that these different implementation strategies have different tradeoffs, and that matter for a language like Rust. Much talk of continuation, threads, etc misses out that. For example, we can use:

The solution is not to try to paper over these distinction, but provide ways to be agnostic/polymorphic to which is used. For a first step, it's good to make higher-order functions agnostic in their arguments. For example a suspended thread in my forked libfringe can be treated as a normal closure if the current continuation is prepended as the first argument. Even better however, would to write polymorphic code that itself can be specialized to one of those types. We can go either the monadic or the algebraic affects route for this.

For those that care, I'd argue the root of the problem is that Rust (and C) implicitly manipulate the call stack, unlike a hypothetical typed assembly language. In such a typed assembly language, a contiguous call stack, a segmented call stack, webs of dynamically dispatched closures, finite state machines, etc, can all be on equal footing. Then everything we want here could be left to libraries and not the compiler, and all would be well.

It's infeasible to ditch LLVM and get really fancy dependent types in the at-all-near future (let me not suggest otherwise), so we will need compiler extensions/plugins. But I think with enough cleverness, we can integrate that with langauge-level abstractions like monad trait / effect system. I have some hunches on how that might be done, but they're pretty WIP, and I've already said some odd stuff, so I'll call it quits here.

@rpjohnst:

Implementation-wise, it is not hard to convert an async function back to a blocking one with little to no overhead. It may be too late to do anything like this in Rust, but it would be interesting to make everything async, and then let typical blocking calls use some syntactically lightweight (perhaps even invisible) way to force them to be synchronous.

This ability to run async code from synchronous code is somewhat assumed. The split the world issue still exists, though, and this doesn't do anything to solve it, unfortunately. The link (shared twice above by different people) titled _What Color is Your Function?_, explains that this correspondence exists. Though it could be a neat language feature if we did end up with async / await keywords.


@jtolds:

As long as there's other threads running, you definitely don't have explicit suspension points. Your argument against yourself here is a good one. Rust is designed to make explicit suspension points unnecessary, which is a good thing.

Using threads implies implicit suspension points. That doesn't mean there _aren't_ some explicit suspension points in there too, but you have to consider both complexities together when you do that, so it's something you likely wouldn't want to do without serious consideration.

Rust isn't designed to make explicit suspension points unnecessary AFAIK, it simply _doesn't have the feature_. A post-1.0 feature. Perhaps there's some future enhancement that would make it unnecessary, but I haven't heard of it yet.

Single-threaded event loops don't make good use of multi-core parallelism.

You are correct that single threaded event loops can't use multi-core parallelism. That's the whole point. And if you're using them, they don't need to, because the CPU isn't your bottleneck, IO is.

Again, a surprising position to take. If the language handles async IO for you, then every library always works for you, and there's no bifurcation.

_bifurcation_. And I learn a new word, thank you. :smiley:

It depends, of course, on what you mean by "handles it for you". Rust's type system can make things memory-safe, but there are other conditions that you have to solve yourself, because Rust _does not_ solve it for you.

Consider the dual lock deadlock. In single-threaded explicit-suspension-points world, handling such a situation is trivial (python-ish pseudo-code, I'm overlooking some problems Rust saves you from to show you one it _doesn't_):

import get_big_thing  # A big IO bound task.

# Both locks are required simultaneously to safely run get_big_thing
mutex1, mutex2 = Mutex(), Mutex()

async def this_way():
    lock1 = mutex1.lock()
    lock2 = mutex2.lock()
    return (await get_big_thing())
    # Locks released at end of scope

async def that_way():
    lock2 = mutex2.lock()
    lock1 = mutex1.lock()
    return (await get_big_thing())
    # Locks released at end of scope

# Some run loop runs these two coroutines

The problem here is a dead-lock. If a synchronous version of this code is run on multiple threads to make it parallel, it would be easy for the the first lock to be obtained by this_way, then the second lock to be obtained by that_way, and then neither would be able to continue, because both of their locks are already taken.

Using a single thread to run these coroutines (because our bottleneck is IO, not CPU) works really well, because the explicit await points mean that the programmer can easily identify the places where something else might happen first.

It's important to note that the purpose of this is not to make it easy to convert synchronous code to async code. The purpose is to allow the author to _reason about_ async code. The explicit suspension points help the author to do that.

This point is so important that Python chose _not_ to implement the awaitable protocol on the futures module, which uses threads / processes to resolve them. The async / await model only really works well when you're dealing _inside a single thread_, but it can be invaluable in that context.

Imagine the following scenario [snip long demonstration of the long-tail effect]

Yes, the long-tail effect really exists. The longer we wait before providing the choice, the worse it gets, because there's more code that has to be considered retroactively. I expect this to be a big burden in the Python community going forward. They still added the feature.

Unless there's another alternative that can make these deadlocks and race conditions that the borrow checker does not solve easier to reason about, this solution does indeed help programmers reason about what will happen in their program. Unless there is a better solution to the problem, all we would do by waiting is exacerbate the problem.

You are correct that single threaded event loops can't use multi-core parallelism. That's the whole point. And if you're using them, they don't need to, because the CPU isn't your bottleneck, IO is.

Not necessarily. Event loop/promise processing adds overhead, and there's a maximum amount of overhead a single core can handle, which limits the number of (possibly small) IO operations that can be handled. This isn't academic; the reason my company switched from Python+Twisted to Go was this reason. We were CPU bound when we should have been IO bound. That said, it was Python + Twisted, so I admit the overhead was unnecessarily large.

async def this_way():
    lock1 = mutex1.lock()
    lock2 = mutex2.lock()
    return (await get_big_thing())
    # Locks released at end of scope

In this example, I assume you mean to have lines like lock1 = await mutex1.lock() ? Surely you don't mean to use actual mutexes but instead mean async mutexes in your event loop code. Even with your description I admit I am baffled how async helps with deadlocks here. Either you mean real mutexes, in which case you have a blocked event loop waiting to happen and you didn't need the mutexes to begin with (cause there's no actual parallel multi-core contention), or you mean async mutexes, in which case you still get an application-level deadlock.

It's important to note that the purpose of this is not to make it easy to convert synchronous code to async code. The purpose is to allow the author to reason about async code. The explicit suspension points help the author to do that.

Not to belabor the point but so far I've only seen more confusion.

Yes, the long-tail effect really exists. The longer we wait before providing the choice, the worse it gets, because there's more code that has to be considered retroactively. I expect this to be a big burden in the Python community going forward. They still added the feature.

The initial release of Twisted was _12 years ago_. I know they're finally adding deferreds and event loops to the standard lib, but there's been prior art and experience with this stuff for a _long_ time.

Unless there's another alternative that can make these deadlocks and race conditions that the borrow checker does not solve easier to reason about...

I know this isn't news or anything, but it is impossible to solve all deadlocks, in a mathematical proof sort of way.

Anyway, I guess I should point out I understand the _reasons_ why Rust decided to not do green threads and why that results in people discussing async IO, but it's just super disappointing. When something is so close to perfect you really care about those last few blemishes.

Also, I want to point out that I totally give you that the async style does allow the programmer to see when she has full control of the thread of execution. That is a benefit. But everything is tradeoffs, and if I had to order my priorities that benefit would be very far down the priority list, certainly underneath being able to use all the third party libraries I need.

Additionally, I have seen subtle synchronization bugs introduced when someone has some ordered logic depending on complete control of the thread of execution, and then some later refactor comes in and turns a synchronous CPU-bound call into an async call, and then the synchronization logic isn't correspondingly updated. I think relying on having full control of the CPU is the wrong kind of thing to rely on, and personally I'd rather not fool people into thinking about it.

But yes, it is all about tradeoffs.

@jtolds

Event loop/promise processing adds overhead, and there's a maximum amount of overhead a single core can handle, which limits the number of (possibly small) IO operations that can be handled. This isn't academic; the reason my company switched from Python+Twisted to Go was this reason. We were CPU bound when we should have been IO bound.

Good point. One thought that I had was possibly using multiple threads run multiple loops, which can bring benefits of both if done with care. But yes, there are definitely good reasons to use threads (or perhaps processes), even with IO heavy workloads and explicit async loops, and extra care has to be taken in those cases.

In this example, I assume you mean to have lines like lock1 = await mutex1.lock() ?

Facepalm. While I did not mistype, you are certainly correct that these would have to be async mutexes. I was trying to be clever, but ended up demonstrating most aptly how splitting the world can be confusing.

I know this isn't news or anything, but it is impossible to solve all deadlocks, in a mathematical proof sort of way.

Also, I want to point out that I totally give you that the async style does allow the programmer to see when she has full control of the thread of execution. That is a benefit.

But yes, it is all about tradeoffs.

True, we're discussing which trade-offs are best to aid the programmer in reasoning about their code. Explicit suspension points are one way to do that, and they have been effective in doing so in several languages already. However, that doesn't mean that they are right for Rust, and that the trade-off would be worth it in Rust code.

As I aptly demonstrated, explicit suspension points, while helpful, still have to manage a great deal of synchronization complexity. Having those be explicitly defined doesn't change that complexity, but may perhaps limit the areas that you have to look for it.

In a language like Rust, with it's emphasis on type safety and safe abstractions, the added benefits of making the suspension points explicit may not be worth splitting the world for, even though it often is in languages where the programmer is left more to their own devices (such as Python). _Thank you_ for helping me understand that.

Anyway, I guess I should point out I understand the reasons why Rust decided to not do green threads and why that results in people discussing async IO, but it's just super disappointing.

I think that I understand the reasons too, but I'd be interested in reading more about that if you have some links to point to.

When something is so close to perfect you really care about those last few blemishes.

Indeed. Thank you for that care, and for the experience that you bring to the conversation. I've found it most valuable.

Back to the original topic of generators, is there any work done figuring out how the ownership rules might work? Generators are a great tool for making iterators easily, and I'd love to push the work forward.

When something is so close to perfect you really care about those last few blemishes.

Giving the programmer explicit control over allocation and synchronization in a language like Rust is not a blemish, even if it "splits the world." Rust already has several splits that could be unified if we were willing to take away that control.

First off, I just want to pause and say that @ryanhiebert is truly setting the example of how to have a discussion on the internet. What a gracious and kind guy! Thank you Ryan, talking has truly been a pleasure.

I think that I understand the reasons too, but I'd be interested in reading more about that if you have some links to point to.

I actually thought I understood better than I did until Steve Klabnik pointed out they tried a pluggable fiber system on this thread: https://lobste.rs/s/cvdbal/asynchronous_io_in_rust/comments/mpyfhf#c_mpyfhf
He suggested I read https://github.com/rust-lang/rfcs/blob/master/text/0230-remove-runtime.md, which I haven't done yet, but I suspect will be the best place to go check out.

Yup, you beat me to it!

I realise this might be little more than a "me too"/+1 comment, but the type of issue addressed by this RFC is something I deal with frequently. My main interest in Rust is as a replacement for C & C++ when writing kernel code. So far, I've not really made it beyond hello world with Rust in a kernel, but I'm slowly getting there with making it do real things. (Tangent: the main problem I seem to keep running into is that lots of useful functionality is in large libraries which have dependencies I can't fulfil in the parts of the library I'm not interested in.)

OS kernel code tends to be callback-heavy; C and pre-lambda C++ style callbacks are extra confusing, but even lambda/closure syntax only moves the code around a little bit, it doesn't make the control flow intent any clearer. (e.g. what would be a loop in blocking I/O looks nothing like it in callback-style async) Although thread stacks are tiny compared to userspace, they use non-pageable/wired memory, so idle threads consume more than just trivial resources. Meanwhile, simply creating a new thread isn't possible just anywhere. Lots of areas of code have strict rules about not blocking, and any (reliable) memory allocation might block, including a thread allocation. So you usually end up pre-allocating a context struct that tracks all the necessary state for an operation across callbacks, and pass that around. Compiler support for extracting that state struct and turning suspension points in (much clearer) blocking-style code into the appropriate callback function pointers would be amazing, assuming the memory allocation semantics can be controlled.

I suspect embedded developers might have similar requirements.

In summary, I don't care too much about the syntax of such a feature; I've been using C & C++ for the last 15 years, so it can't really get any worse. What I do care about, and what I hope might be taken into account when designing the feature, is that it:

  1. Doesn't mess with the stack.
  2. Doesn't depend on a large amount of library baggage to use productively.
  3. Gives the programmer control over memory allocation.
  4. Can be turned into C-compatible callback function pointers in some way where necessary.

(I've got an I/O-heavy kernel extension side project that I'm hoping to port to Rust & open source when it's done, which I'd be happy to use/provide as a testbed/demo for this sort of thing.)

Relevant article from Facebook about a template they made to simplify the use of async to hand off work that would normally block the thread. https://code.facebook.com/posts/1661982097368498/futures-for-c-11-at-facebook/

What's the state of this?

It's a feature that would be nice to have. There's no specific plan or RFC about this yet.

I've started prototyping this out in a plugin in https://github.com/erickt/stateful. It currently somewhat supports conditionals and loops, but not yet mutable variables :)

@erickt : Looks pretty cool, I'm glad to see that there's work happening on this. I'll be very interested to see how it evolves, and if there's any discussions on that repo, I'm following it, and am interested in participating. I'm not sure if I'll find time to help more than discussions, but I'll certainly participate in those.

Very interesting libraries by @freebroccolo ! And links by others. I'm sure folks here know about these, but..

There are a variety of interesting libraries employing mio, including mioco, which uses coroutines, and rotor, which makes specifying explicit state machines nicer.

In the abstract, I'm partial to the rotor approach of trying to make it pleasant to explicitly specify the state machine. It should be easier to debug, write tests for, quickcheck, etc., although obviously depends upon many factors. It'd be really cool if this stuff eventually coalesced into a sensible way to intermix explicit state machines and more implicit ones like coroutines.

@pmj most of my programming experience, especially with asynchronous patterns and io, has been web programming in dynamic languages like javascript, python and ruby; and I never realized that systems programming is also callback heavy. So thank you for joining the discussion.

  1. Doesn't mess with the stack.
  2. Gives the programmer control over memory allocation.

Can you please explain the above two points in more detail? I'm particularly curious to know what you mean by "messing with the stack".

@mikeyhew Well, systems programming tends to involve a fair amount of I/O.

At the OS kernel-to-peripheral device interface layer, you tend to have memory-mapped device memory & registers, DMA, and interrupts as your main communication methods. Interrupts are inherently asynchronous, and are typically routed to device drivers as callbacks (primary or secondary interrupt handlers), although there's normally not a clear 1:1 request:response ratio. Once you get to higher level interfaces, such as block device I/O, it's usually quite similar. Modern kernels also include a considerable amount of network protocol implementations, at both the relatively low level (Ethernet, IP, TCP) and the high level end of the spectrum (network-based file systems and block devices spring to mind).

  1. Doesn't mess with the stack.
  2. Gives the programmer control over memory allocation.

Can you please explain the above two points in more detail? I'm particularly curious to know what you mean by "messing with the stack".

  1. Many common OS kernels use the stack pointer register ($esp/$rsp on x86/64) to identify the currently running thread/task/process. They do this by defining a thread layout convention, e.g. each thread gets 16KiB, aligned to 16KiB virtual memory addresses, with a struct containing various information for identification, scheduling, etc. at the end or beginning of that block, and the rest of the memory is used as the kernel-space stack for that thread. So if the currently running thread needs to be identified at any point, you just round the stack pointer up or down to the nearest 16KiB boundary, and look for the thread struct there. This also means that if you mess around with the stack pointer (setting it to outside those 16KiB for example) this will cause a whole load of things to go wrong.
  2. Control over memory allocation - I'm specifically talking about heap memory here. Stack memory is in short supply anyway, so that's not much use. Kernel heap allocations are usually pooled. But if the pool for the requested size runs out, the allocator will need to find a spare physical memory page, map it into the kernel virtual address space, and return an address within that. Spare physical pages will also run out of course, so often this means a trip to the pager, which steals a physical page from some process's pageable memory to fulfil the request. This is fine, but requires a lot of coordination and synchronisation, especially if multiple CPUs are involved. (Which these days is the norm, not the exception.) As a result, there are situations where allocations that invoke the pager are dangerous as they could cause a system deadlock.

For example, let's take a block device driver. The job of this driver is to accept I/O requests (disk reads, disk writes, etc.) from whoever the client is (file systems, etc.), pass them to the device in question, and when the device is done processing the request, notify the client that it has completed. If as part of this process the driver were to allocate heap memory, this could cause deadlock. Why? Most modern OSes will page memory to disk (swap file or volume). Let's say the system is low on memory, so the pager needs to save some memory pages to disk. It requests a write from the disk driver. The disk driver maintains a list of requests, and needs to add this request, so it allocates some heap memory for it. Turns out the pool is empty, so let's look for a spare page. Nope, system is low on memory, so it invokes the pager to evict some pages. The pager decides it needs to write some pages to disk, which means generating a disk write request. So it notifies the block device driver… etc. Boom!

So you have to write this type of code in a way that doesn't trigger paging. Which is fine, as long as your programming language makes it clear what code will or will not generate allocations. Some higher level language features will intrinsically require memory - which is absolutely fine, as long as the code can control the allocations. Many higher level languages will allocate heap memory implicitly, which rules them out entirely, but if allocation and use times can be split, there's no reason to not offer the language feature. So for example, if the device driver can determine how much memory will be required for each request, it can pre-allocate memory for a fixed number of requests.

I hope that explanation makes sense and gives you more of an idea why such an async feature would be interesting to system code, but what the extra requirements are to make it useful in that situation. In particular, if there's no way to tell how big an async continuation context will be until you actually try to create one, then that's too late for preallocating that memory.

I think one requirement here is that the continuation context must be Sized – otherwise it cannot be preallocated.

As far as generating C-style function pointers: The best I can see is for the compiler to either generate an iterator (that's what C# does) or to generate something like an iterator, but heterogeneously typed. The user would need to convert it in to C function pointers manually.

@pmj Can you give an example of code that you would like to be able to write?

What is the status of this? Also, should we close this in favor of #1081?

If not, I might be interested in working on it. I've got ideas, but this doesn't seem too hard. That said, I haven't read the Rust compiler at all or even done compiler work, so my viewpoint on how to do it is as a pure source transformation in a Rust-like language that allows unsafe gotos.

Rust likes iterators, as do I. But writing your own iterator is currently kind of a mess.

@camlorn: stateful is coming along quite nicely with the help of @NSil. We are getting pretty close to a 0.1 release, we just need to implement a few more things on dealing with temporaries across yield points. I'd love any help getting it ready :)

@camlorn FWIW, an official implementation would have to be done on MIR to be optimal - where it should be easier, too, as the MIR is already a CFG.

@eddyb: I know you already know this, but for everyone else, it should be pretty straightforward to port Stateful's state machine transformations to Mir, since Stateful is heavily inspired by Mir. Really the tricky thing is really waiting on impl trait in order to implement this without boxing the iterator type.

@erickt There are two reasons why a port is unnecessary/undesirable:

  • the "state machine transform" on the MIR is _just_ changing Yield(next) to Return and pointing a branch of the entry point to next, no AST is involved _at any point_
  • generating the AdtDef for the state enum is tricky - it depends on when you do the state machine transform, if any optimizations are involved, then the layout of the type may depend on compiler options and any introspection of it would introduce hidden dependencies and compatibility hazards; if you do it too early, you risk being suboptimal (@alexcrichton ran into some perf issues with nested enums in futures-rs)

How are recursive generators handled?

In languages that aren't Rust, the generator can make another instance of itself and then yield from it using a for loop. I think lifetimes can be used to prevent it, but maybe it's something we want to allow.

Python has a yield from, but I think they added it for I/O. Perhaps this isn't necessary. It might also be a good use for the become keyword which I read was reserved, but that could also be used for tail calls in future, so possibly not because of the double meaning. The implication being that you can "become" another generator.

@erickt
I'll take a look. Not sure my Rust is good enough to help, but it might be.

@camlorn Such recursion would become a type recursion and require boxing to avoid the state having an "infinite" size.

@eddyb
O, so it would.

@camlorn: stateful doesn't do something like yield from We could implement some basic sugar that converts yield_from!(some_generator()) just generates for item in some_generator() { yield_!(item) }. So this ends up adding O(k) jumps to sub-iterators, where k is the depth of the generator stack. That should be the same cost if you implement these generators by hand where each generator is opaque.

I bet a sufficiently smart optimizer could inline these generators to get you down to O(1) jumps.

Considering that Rust is built on LLVM, have you been following the coroutine discussion on LLVM-Dev?

https://reviews.llvm.org/rL276513

@erickt
My question, sufficiently answered by @eddyb, was what happens in the case of that stack being infinite. You end up with iterators inside their own iterator structs.

Sorry to double comment, but this page might be more useful as to LLVM 4.0 coroutines: http://llvm.org/docs/Coroutines.html

@l0calh05t / @camlorn: Yeah, I'm somewhat aware of llvm's proposal. As best as I can tell, under the covers they're effectively doing the same transformations that stateful is doing - compiling down functions into state machines and a reified stack to store stack variables. One thing that's slightly dodgy from my first glance is that they seem to require that the coroutine frame might need to be allocated, but that's not required for stateful (once we get impl trait).

Really it comes down where we'd get the best advantage. If LLVM implement some crazy optimizations that we don't want to duplicate, it might make sense to use LLVM's implementation. However, it might be faster, or integrate better with our type system and borrow checker if we lower this transformation on the Rust side. I'm not sure what'd be best.

One thing that's slightly dodgy from my first glance is that they seem to require that the coroutine frame might need to be allocated,

The proposal includes a "heap allocation elision" optimization (CoroElide) pass that will store the coroutine in its parent stackframe when possible. So if you only create such coroutines you never get heap allocations. It is also planed to allow this kind of optimizations across ABI boundaries.

If anything it could be worth it to chim in there and clarify a bit more under which conditions heap allocation might be necessary

@gnzlbg That only works for OS stuff if you can guarantee that the heap allocation will never happen (and get a compile-time error when it cannot be elided).

Do we want to be able to put DSTs on a coroutines stack frame?(I want to)
If yes, can that be done without heap allocation in general?

Do we want to store coroutine stack frames in e.g. a Vec? (I want to) If
yes there must at least be a way to opt into heap allocation.

So while I agree it is useful to be able to ensure that in some cases
coroutines do not heap allocate, I want to be able to use them also in
those situations in which you might need to heap allocate.

On Monday, 3 October 2016, Demi Marie Obenour [email protected]
wrote:

@gnzlbg https://github.com/gnzlbg That only works for OS stuff if you
can _guarantee_ that the heap allocation will _never_ happen (and get a
compile-time error when it cannot be elided).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rust-lang/rfcs/issues/388#issuecomment-251015873, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA3NpouTy411wikk-iObCrZ0xl61b3p0ks5qwGs-gaJpZM4CtyWG
.

@gnzlbg
You can always use Box for a dst. With the state machine transformation you can always box the generator and stick it in a Vec that way, should it be dynamically sized.

Boxing the generator is as efficient as having the generator be a pointer to a stack frame, as far as I'm aware. Boxing the dsts shouldn't be that much of an overhead either. Can you provide an example of code that puts a dst on the stack that you think might be both important and problematic?

To be honest, I find the points about avoiding allocation to be more convincing than otherwise; if you don't care about such things, use a higher level language with GC and whatnot and call it a day.

If being able to allocate dsts next to each other for cache friendliness is important, I can think of ways to write libraries that try to do this, as well. They are slightly less convenient than it just working, but maintain the same sorts of advantages.

In kernel code, or hard real-time code, you might not be allowed to allocate

On Oct 4, 2016 17:09, "Austin Hicks" [email protected] wrote:

@gnzlbg https://github.com/gnzlbg
You can always use Box for a dst. With the state machine transformation
you can always box the generator and stick it in a Vec that way, should it
be dynamically sized.

Boxing the generator is as efficient as having the generator be a pointer
to a stack frame, as far as I'm aware. Boxing the dsts shouldn't be that
much of an overhead either. Can you provide an example of code that puts a
dst on the stack that you think might be both important and problematic?

To be honest, I find the points about avoiding allocation to be more
convincing than otherwise; if you don't care about such things, use a
higher level language with GC and whatnot and call it a day.

If being able to allocate dsts next to each other for cache friendliness
is important, I can think of ways to write libraries that try to do this,
as well. They are slightly less convenient than it just working, but
maintain the same sorts of advantages.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/rust-lang/rfcs/issues/388#issuecomment-251514506, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AGGWB6E3FGkrF-ufw9pa9iUlPqdoglP_ks5qwsCngaJpZM4CtyWG
.

You can always use Box for a dst.
[...]
Boxing the dsts shouldn't be that much of an overhead either.

If I need to Box the generator I do not need to box the DST as well. Having to box _both_ is not a zero-cost abstraction. I think this is bad.

If I cannot use DSTs inside a coroutine, you are essentially splitting the language into two parts, the part that can be used within a coroutine and the part that cannot be used within a coroutine. I think this is also bad.

To be honest, I find the points about avoiding allocation to be more convincing than otherwise

I think that avoiding allocation is very important but in case somebody has a coroutine that must allocate, that should be a zero-cost abstraction as well. I think it would be bad to have to come up with a completely different coroutine system to address this use case, and hence why I am raising this issue.


@DemiMarie

In kernel code, or hard real-time code, you might not be allowed to allocate

Then don't use language features that allocate memory (like box), libraries that allocate memory, or coroutines _that allocate memory_.

I am sure we can find ways of disabling coroutines that allocate memory from kernel code such that coroutines that do not allocate can be used safely.

@DemiMarie @gnzlbg
The point about dsts may be pointless anyway, because according to the book you can't have one in a variable. I don't know for sure, but I can't figure out how you get one on the stack to start with and the internet is widely saying this isn't possible or at least very difficult in safe Rust. They don't make sense in stack variables because they don't have compile-time sizes. I'm basically certain you can't return them from functions directly. Etc. Can you show code that works now that you think might not work in a generator?

The thing is that someone has to know the size eventually. Even with Box, the thing that goes into Box::new is always sized. Assuming there is no counterexample, therefore, the generator has the same limitations as the stack anyway, and you can always convert it to a struct with a state field and a known size.

I'm not understanding why you can't know the size of a generator/coroutine at compile time? Assuming it's non-recursive, of course. If it makes function calls, those can just live on the normal call stack since they'll only be invoked when control is inside the generator. If it invokes other generators/coroutines, those would just live in the static memory of the coroutine itself. Circular dependencies would be detected the same way they're detected in structs. The size of a coroutine would just be pessimistically the minimum size needed to hold all the variables in the local control flow stack, excluding function calls.

My understanding is it'd look something like this:

// Pseudocody
gen fn my_generator(flag: bool) -> i32 {
    let val1: i32 = 10;
    let val2: i32 = 20;
    if flag {
        let val3: i64 = 30;
        yield val1; yield val2; yield val3 as i32;
    } else {
        let val3: i16 = 30;
        yield val1; yield val2; yield val3;
    }
    // The stack space for this function call isn't part of my_generator's
    // storage, because it can only be invoked when control flow is here.
    let val4 = function_returning_i32();
    yield val4;

    // The storage required for this generator IS part of the storage, though.
    let mut sub_gen = my_sub_generator();
    for value in sub_gen {
        yield value;
    }
}

In this example, the storage required for this generator (before compiler optimizations, and ignoring alignments) would be:

i32 + i32 + max(i64, i16) + i32 + sub_gen + value

Plus a small, fixed amount of extra space to store where the last yield was so we can resume iteration.

So I don't understand the argument that we can't know the size of a generator at compile time. It's true that the size of a generator might be _large_ at compile time, to account for deeply-nested but non-recursive generators, but that's no different than the deeply nested struct that would be required to implement the same pattern.

I've elided ownership & lifetime concerns here, but I don't see any obvious reasons in principle that it would be different than a lambda or a struct.

@Lucretiel
That's it, yes. That's why I'm kind of confused as well. The only thing which can break it is if you actually can get a DST on the stack somehow. Admittedly there's alloca (the Linux function, not the LLVM op), but other than that the size is known and the transformation is safe.

Now, the really fun part: work is going on to eliminate unused variables. I think your example might actually only have to include one of val1 through val3. At the very least, val4 shouldn't ever exist in the storage. SO the best case is even better than that, potentially by a whole lot in some cases.

I'm starting to think I should throw an RFC out which does what we can do and leaves room for streaming in future whenever we get HKT.

I'm also looking at LLVM, and probably it's best to do this as LLVM coroutines if we can. There's a pretty long todo list, however, including a few things which might be deal breakers (specifically alignment being ignored by the allocation/freeing intrinsics). This documentation isn't currently very good, but it looks like the pointer for the dynamically allocated frame is supplied by you, not just made by LLVM. So probably it could be worked out such that Rust generates a struct with a [u8; sizeof(coroutine's frame)] in it for that case,. Alternatively, maybe Rust would need a new kind of type internally to make this work.

Actually I take that back. I bet you can't move the frame once you allocate it, which would sadly make this heap only in some cases. The only saving grace is that you can tell beforehand.

I bet you can't move the frame once you allocate it

Sure, because of dangling pointers, but that's how everything in Rust is, right? Like, that's the whole point of the ownership / borrow / mutable borrow system. It's no different than being unable to move a stack-allocated struct once you've allocated it on the stack; you can move it by bitwise copy, but only when there are no living borrows of that struct in existence. The point is that a generator "frame" is essentially the same thing as a struct, in terms of ownership, lifetime, and sizing constraints.

The only problem I can see is that there might be some interior references, like this:

gen fn my_gen() -> int {
    let i = 10;
    let ri = &i;
    ...
}

Which would make it impossible to move the struct in a way that the borrow will accept (again, before compiler optimizations or a potentially more relaxed borrow checker).

But it's still essentially syntax sugar over a struct. The challenge will be communicating borrow checker violations to the user in a useful way, but I don't see any reasons in principle that it's different than a struct frame.

Yikes. Sounds like any borrows of stack allocated variables at a yield
site would cause the transformation to fail.

Obviously, this would be a major usability problem.

On Oct 6, 2016 3:35 PM, "Lucretiel" [email protected] wrote:

The only problem I can see is that there might be some interior
references, like this:

gen fn my_gen() -> int {
let i = 10;
let ri = &i;
...
}

But it's still essentially syntax sugar over a struct. The challenge will
be communicating borrow checker violations to the user in a useful way, but
I don't see any reasons in principle that it's different than a struct
frame.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rust-lang/rfcs/issues/388#issuecomment-252065900, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AGGWB-aTdDU9dwEaVO7ctUSTXUWn0vyeks5qxU2VgaJpZM4CtyWG
.

I'm betting that you can't even move LLVM coroutines by bitwise copy safely. The implementation assumes that you allocate the pointer for it in some cases, but doesn't seem to allow you to set the pointer if it changes. I.e. if you use alloca for it either directly or indirectly it can only live as long as the stack frame that made it, even if it's a case wherein the struct would live longer. I.e. you maybe can't return a generator from a function using Rust's model if you use the LLVM coroutine intrinsics because it invalidates a pointer you can't set. Or store them in a struct that gets passed around, for that matter. This needs further investigation.

This is fixable if you just allocate the frame on the heap, but having Rust language features allocate implicitly is way against the language's philosophy, imho.

In terms of references, What doesn't work is specifically borrowing a variable on the "stack" of the generator and then yielding the reference. You can borrow them all you want, you just can't yield the reference. This is the streaming iterator problem, and the latest RFC is #1598.

As for borrow checker errors, you do that first as well. If the coroutine tries to yield anything with a lifetime less than the return type of the coroutine, you error. And then you make the lifetime elision rules apply like it's a function. I.e. you're allowed to yield references to/from any argument to the generator, but not from inside the generator.

In general, you do any safety checks before lowering the generator to a struct and iterator implementation.

We might need to wait for #1598, though, if only because hitting the non-streaming case and the streaming case at the same time might be necessary. I sort of favor dedicated syntax for a streaming generator: we can implicitly detect that it needs to be streaming but the difference would be one tiny change in the code, anywhere at all. This would lead to cryptic errors that are hard to fix.

offtopic: The cppcon2016 video about LLVM coroutines is up: https://www.youtube.com/watch?v=8C8NnE1Dg4A

I want to breathe some fresh air into this RFC, with my yield function to state machine (as an iterator) concept. Gist. It's not very good - it was quickly hacked together, but it actually compiles - you can play around with it!

Explanation: The commented-out function foo at the top is our yielding function / coroutine. All the other code below, is what could be generated by the compiler. It's similar to what Roslyn (C#) would do, I think. The return type of the coroutine becomes the data type of the values which the coroutine yields.
FooStack is a container for all the stack variables declared in the coroutine.
YieldsFnFoo is the actual iterator generated by the compiler.
The main function shows how the coroutine can be called.


I didn't follow this RFC much, but it appears that the current discussion is mostly about LLVM's built in coroutine support. I guess my concept could be considered an opposite to LLVM coroutines, since my concept could be implemented at a very high compiler level, possibly even procedural macros.

Feedback very welcome!

Experimental RFC #2033 has been accepted and implemented in nightly.

+1

On Sun, 7 Oct 2018, 8:42 am Mazdak Farrokhzad, notifications@github.com
wrote:

Closing in favor of rust-lang/rust#43122
https://github.com/rust-lang/rust/issues/43122

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rust-lang/rfcs/issues/388#issuecomment-427633024, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAPWS80F_HlVsD55ir-nBzB5NcqsVwDQks5uibBagaJpZM4CtyWG
.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

pnkfelix picture pnkfelix  Â·  69Comments

steveklabnik picture steveklabnik  Â·  86Comments

pnkfelix picture pnkfelix  Â·  160Comments

Ekleog picture Ekleog  Â·  204Comments

rust-highfive picture rust-highfive  Â·  59Comments