Runtime: Remove reentrancy by default from the TPL

Created on 23 Jul 2015 · 31Comments · Source: dotnet/runtime

The TPL has a few performance features that cause reentrancy. Here are the ones I know of:

TaskCompletionSource.SetXxx can cause continuations to be executed synchronously. AFAIK this can happen both with or without ExecuteSynchronously. Starting with .NET 4.6 there is RunContinuationsAsynchronously to disable this behavior.
Task.Wait() and Task.Result can execute an unstarted CPU-based task inline. I do not know of any good workaround from the perspective of the caller of Wait/Result. Probably, static Task.Wait(Task) and related methods have the same problem.
CancellationTokenSource.Cancel() can execute registered notifications inline.
TaskAwaiter uses TaskContinuationOptions.ExecuteSynchronously if there is no synchronization context set. This forces reentrancy for whatever code completes the task being awaited.

In all of these situations arbitrary code can run as part of invoking the respective TPL method. You might not control that code. There might be no way for you to know what code that is because you are writing a library. In big apps it might be architecturally impossible/unwanted to know what code might run.

This is arbitrary, non-deterministic reentrancy in multi-threaded code. This is very dangerous behavior.

Concrete problems (examples):

You might call TaskCompletionSource.SetResult() under a lock. Now, arbitrary code might run under your lock as well. This can deadlock.
That arbitrary code might perform blocking operations (e.g. IO). This will now run under your lock and extend the lock hold time.
The invariants of your data might be temporarily broken. Due to reentrancy other code can now observe invalid data.
Monitor might fail to provide mutual exclusion because lock (x) { lock (x) { } } does not block/deadlock. Inlining code onto the current thread opens the gates for any locks held by this thread.
You might run on a thread with thread-dependent state (STA thread, changed priority, deviating culture, HttpContext.Current null or not null, ...). This state is now exposed to arbitrary code. It is made available for mutation as well.

This is very unsafe behavior by default. The number of questions that come up about it on Stack Overflow is noticeable. The bugs are insidious (racy). Please disable _all_ of these behaviors by default. Make unsafe performance features opt in. I doubt there will be compatibility impact from doing this because all of the behaviors above are, to my knowledge, non-deterministic.

Completion callbacks should always be queued to the thread pool (or to any scheduler if specified) by default. The TPL should be safe by default. Reentrancy behavior should be:

Off by default.
Opt-in possible.
Overriding forced opt out by any side taking part (completing thread and registering thread). A library should be able to force-disable any of this, no matter whether it registers continuations or whether it completes tasks.

This is a problem both for monolithic apps as well as for library authors. See this high-profile case that lead to the inclusion of RunContinuationsAsynchronously in .NET 4.6: http://stackoverflow.com/questions/22579206/how-can-i-prevent-synchronous-continuations-on-a-task

api-needs-work area-System.Threading.Tasks up-for-grabs

Source

GSPP

👍1

Most helpful comment

if I could change anything in the Framework, it would be that only the first continuation can be executed inline and all subsequent ones would execute on their appropriate TaskScheduler or SynchronizationContext

Out of all of the changes discussed in this thread, this one is I think the most viable. There are still possible breaks from it, but I expect they would be relatively rare. Something to think further about...

stephentoub on 23 May 2017

👍5

All 31 comments

Thanks for the write-up. While I appreciate the thought that went into this, we have no plans of changing the default behaviors here. Doing so would be a significant breaking change, with lots of code actually depending on these behaviors for either functionality and/or performance, and much of this behavior was thoughtfully added for various reasons. I understand the choices have caused some problems, and we've tried to address those where possible.

stephentoub on 15 Sep 2015

OK. Then it should at least be possible to disable all these behaviors case by case. I believe there is at least one such place missing (TaskAwaiter).

GSPP on 15 Sep 2015

Just noticed Roslyn commit 85c4c37e9d79ce04cb3b73cad31926d6c3bef1f5: Continuation inlining caused a deadlock in some AsyncQueue class.

Really, a lot of people are bitten by the issues outlined above. Nondeterministic reentrancy is poison.

Quoting the commit message:

Fix: Fix the AsyncQueue.EnqueueCore method to invoke SetResult on the waiter task completion source on a separate task. This ensures that no external code executes while attempting to Enqueue a compilation event, and hence cannot deadlock.

Searching the Roslyn repo for "inline" turns up commit 150de4281f394feecd1f5bfe66bae0f30c559dfc which is another instance of this.

GSPP on 8 Nov 2015

Apparently, WCF has a bug caused by TPL reentrancy.

In the following code the await continuation is inlined causing reentrancy in WCF:

var s = new MyWcfClient();
await s.LockUserAsync(id); //Continuation inlined
s.IsUserLocked(id); //Deadlocks inside of WCF

This is a WCF bug (not a TPL or user bug) but it shows how insidious reentrancy is.

GSPP on 8 May 2016

+1, this behavior is very unintuitive and “releases Zalgo”

binki on 16 Jun 2016

😄1

Looks like we need API proposal here.
The default should not change as @stephentoub said above due to compat.

karelz on 14 Oct 2016

An interesting case is SignalR. They have wrapped TaskCompletionSource to always complete the task on the thread pool. This is I assume to avoid reentrancy issues.

See https://github.com/SignalR/SignalR/blob/dev/src/Microsoft.AspNet.SignalR.Core/Infrastructure/DispatchingTaskCompletionSource.cs

Dispatch just calls Task.Run. And they have a bug there because SetUnwrappedException forgets to call Dispatch. This bug will very rarely show itself with unpredictable symptoms when random code is pulled into unrelated code paths unexpectedly.

I see so many projects hack around this issue, surely after the bugs have been coming in.

GSPP on 14 May 2017

I, personally, wouldn't want to see this changed. Writing correct unoptimized code is pretty easy in C# and that's what most people should do.
When you start using the TPL primitives (e.g. TaskCompletionSource) that's when you care about performance the most. In those kinds of scenarios I wouldn't want to sacrifice performance for safeness.
Most people shouldn't be using TaskCompletionSource to begin with, but those that do should know about RunContinuationsAsynchronously (and they will when they encounter their first deadlock :)).

i3arnon on 14 May 2017

@i3arnon completing an async method calls TCS.SetResult. Everyone has this issue. Please also see the long list of disastrous consequences in the opening post!

You don't have to sacrifice performance at all. There should be opt-in ways to get inlining. Most places already accept a flags enum so that's easy to do. I want opt-in instead of opt-out.

(and they will when they encounter their first deadlock :)

When my production app freezes (for all requests, not just the current one) and causes an outage at 4AM I will make the following smiley: :(

This is not the usual ASP.NET sync context deadlock that's deterministic. This is random code running and it's possibly different code each time.

GSPP on 17 May 2017

👍2

This is a good overview of issues. Tagging @AArnott as well.

sharwell on 17 May 2017

An additional thought: TaskCompletionSource requires you to pass TaskCreationOptions during construction, but you might call SetResult() on that particular TaskCompletionSource from both synchronized-continuation-safe (no locks on the thread pool) and unsafe (in a lock). Maybe the way awaiters work requires the continuation style to be known when the awaiter starts awaiting, so maybe this is impossible—but it might be nice if you could decide per SetCanceled()/SetException()/SetResult() call if the continuation should be synchronous or not. This would have the effect of making it more obvious where in your code the synchronous continuations would be launched from.

If synchronous continuations were made opt-in and they could be opted-in at the SetResult() call, then this code would more obviously look wrong (with the current API, you have to find the initialization of completionSource to figure out if it is correct or not):

lock (someResourceLock)
{
    completionSource.SetResult(someResource.GetValue(), runContinuationsSynchronously: true);
}

binki on 17 May 2017

👍1

Thanks, @sharwell.

@GSPP, I can sympathize with how you're feeling. I felt similarly when I first discovered this behavior. But as @stephentoub said, it would be a breaking change to change default behavior at this point and there are significant performance wins from this pattern, which yes does sometimes bite folks but most of the time it doesn't. And when it does bite, there are always ways to fix it, even without additional APIs from .NET, as I'll demonstrate:

Many APIs you call that inline continuations at an undesirable location can be wrapped with Task.Run:

Task.Run(() => tcs.SetResult(x));

There was one time where this was not acceptable to me because I had a requirement that tcs.Task had itself completed before my method returned, yet I couldn't afford to inline continuations anywhere in my method. The new RunContinuationsAsynchronously flag made this much better.

Moving Task.Wait(); into a Task.Run() is not good:

Task.Run(() => task.Wait()).Wait(); // <-- DON'T DO THIS

This is bad because if task indeed has continuations to be inlined, you'll end up waiting for those continuations to execute. They won't run on your callstack and within your lock, but you could still deadlock. Instead, I wrote this extension method called WaitWithoutInlining() which does it correctly. (EDIT: in looking at the code to prepare this post, I noticed a bug. I guess I couldn't get it right the first time myself.)

You mentioned how TaskWaiter inlines continuations so long as there is no SynchronizationContext, and that you want a way to opt out of that behavior. You can at the site you await at, and we use it in a few places ourselves, with just one line of code:

await task;
await Task.Yield(); // avoid being inlined by the completer of `task`
MoreCodeThatShouldNotBeInlined();

If you'd rather control from task itself such that folks who await on it do not inline continuations, you could write your own awaitable type, which isn't too hard, but in all cases so far, I simply go with the above step.

Does that help?

AArnott on 17 May 2017

I will add one particularly nasty problem though that we've hit due to inlining await continuations, just for the record:

It creates an unwanted dependency between unrelated code when multiple folks await on the same task. We've seen very subtle deadlocks because the inlining behavior is such that when multiple continuations are added to a task and that task completes, the continuations are executed in order on the original completing thread. This means that the second continuation cannot run till the first continuation completes, and when the second one will free a resource that the first one requires (and synchronously blocks for) that means a deadlock will happen. It's particularly hard to discover because the second continuation isn't on any callstack. Only deep heap inspection reveals the problem.
It is extremely difficult in code inspection to predict these bugs in our code as well. We're developing roslyn Analyzers to help us, but their effectiveness is limited and I'm afraid we're somewhat bug driven to add await Task.Yield(); after awaiting on shared Tasks and before we do synchronously blocking things in order to avoid the problem.

@stephentoub: if I could change anything in the Framework, it would be that only the first continuation can be executed inline and all subsequent ones would execute on their appropriate TaskScheduler or SynchronizationContext. I believe this would preserve most of the perf wins that are so important because most scenarios (that I know of) have just one continuation, but it would avoid creating subtle dependencies between unrelated tasks.

AArnott on 17 May 2017

👍4

@AArnott thanks for those contributions. It is very helpful to have a set of workarounds documented.

It seems these workarounds are mostly for the code that's waiting. I was more concerned with the code that is completing. If I'm waiting I'm already expecting other code to run (on any stack). This means that I cannot hold conflicting locks and my exposed state must be clean. Here, there is an explicit intention to run code.

But if I'm completing (as opposed to waiting) I do not necessarily expect other code to run at all. And if I do I expect that code to respect the locks that I guard my state with through an API.

It seems like a subtle point and I hope it makes sense.

It seems await Task.Yield(); fixes the issue on the wrong side. It requires a fix for all callers. But you cannot build an API like that because you don't know your callers if it's a good API! I found so far that the right location for a fix usually was the code completing a task (although that cannot solve the multiple continuations issue which, I believe, has no clean fix at the moment).

In your posts you mention multiple further cases where a (sophisticated!) Microsoft team was bitten. It seems everyone has one of these stories to tell... somewhat bug driven sounds very alarming to me and I know what you mean.

Maybe the framework could be put into a "sanitization mode" where inlining, if possible, occurs randomly (based on actual random numbers). That way developers would more likely notice the issue during testing. There also could be a mode that disables inlining.

Unfortunately, it seems the finest possible granularity for this is the app domain. This makes it hard to use this for production since all code is affected. But it's great for testing. Test hosts could by default use those flags or have an easy way to opt in (e.g. [assembly: SomeMSTestAttribute]).

I truly believe that the reentrancy discussed here is so nasty that it's worth taking the breaking change. Let's not live for the next 15 years of .NET with this. The compat impact will only rise over time. These days, really a lot of code is TPL based due to async. The instability tax for .NET apps will rise over time overproportionally due to code patterns changing. Almost all .NET code in the world is super boring business code that does not care very much about task overhead.

Would an app.config flag somehow be feasible? Or could this be incorporated default-on into the next major .NET version? Maybe a prototype could show that the compat impact is very small (I don't know about the outcome but it might be the case and it might enable the team to take the change with confidence.)

GSPP on 20 May 2017

👍3

If it's going to chang behavior of the API, I think it should change per assembly depending on the framework version being targeted during compile time instead of through app.config. Like having a different default parameter value for a particular method in the reference assemblies so that when you run it on a different framework version the behavior doesn't change.

binki on 20 May 2017

@GSPP speaking for SignalR and ASP.NET in general, we learnt this all the hard way when we did SignalR ~4 years ago. Now we know all of the tricks but it took a bunch of bugs to get there. We also have to educate developers because the behavior is pretty non-obvious. We just need to make sure options like RunContiuationsAsynchronously can be passed to all of the things (even cancellation token callbacks).

Here's a similar issue that bit Kestrel recently: https://github.com/dotnet/corefx/issues/14903.

PS: Pipelines has similar issues 😄 because the default scheduler is "inline" (configurable of course). Channels on the other hand defaults to thread pool dispatch by default to avoid these problems (also configurable).

davidfowl on 20 May 2017

👍1

@GSPP Yes, what you say makes sense. But solving this at the completing side is straightforward: just wrap your code that completes the task in a Task.Run delegate. That will ensure any continuations don't inline within your locks. And for cancellation you can do the same thing: Task.Run(() => cts.Cancel());

@stephentoub will have to comment on whether anything can or will change here. I'm an interested party, but I'm rather reserved when it comes to behavioral breaking changes.

@binki An assembly-level switch I don't see working out. Because at runtime, all the Task or CancellationTokenSource class has access to is appdomain or CLR host switches. It has no idea which assembly to consider attributes of.

AArnott on 20 May 2017

@AArnott

@GSPP Yes, what you say makes sense. But solving this at the completing side is straightforward: just wrap your code that completes the task in a Task.Run delegate. That will ensure any continuations don't inline within your locks. And for cancellation you can do the same thing: Task.Run(() => cts.Cancel());

But that's inefficient. If there are no callbacks you still end up using a thread pool thread just to call cancel. CancellationToken itself knows on the other hand if it needs to dispatch or not and can do it at the right level.

davidfowl on 20 May 2017

But that's inefficient

True. But then again so is always using the threadpool to start continuations when most of the time, running them inline is OK. Not to argue. I'm just pointing out it cuts both ways.

AArnott on 21 May 2017

@karelz why is this marked up for grabs?

davidfowl on 23 May 2017

See my explanation above: https://github.com/dotnet/corefx/issues/2454#issuecomment-253654240

See our triage rules, they explain why we mark up-for-grabs everything we can.

Does it answer your question?

karelz on 23 May 2017

why is this marked up for grabs?

Just speculating, I'm assuming it's based on the earlier discussion about potentially adding new knobs/APIs, and the "up for grabs" part is a proposal for that.

Overall, the discussion can of course continue, and good points have been raised, but the inlining behavior (whether for synchronous waits asking the scheduler whether they can inline the thing they're waiting on, or for a completing task checking with both the scheduler and its continuations whether it should run each of those continuations synchronously or asynchronously) has been around for a long time, and at this point APIs generally exist to opt-out of the behaviors if you don't like them. Making a change to this behavior would be hugely breaking, so much so that I don't think we could even do it for desktop under a quirk. And I think the impact of having the behavior be different for this across runtimes would be much worse than the current state of things today. For the cases where there aren't easy opt-out mechanisms, we can and should consider additional APIs to make that possible; this was already discussed previously, and the issue was left open in case someone wanted to make specific proposals. Finally, I get that there are dragons lurking in some of the default choices, but they were made for good reasons, with real impact on performance; it's possible the wrong tradeoffs were made, but had we decided on something else, we would find ourselves now having a conversation about how onerous it is to need to call this and that API to get good perf. You can see this even with the evoluation from ContinueWith to await: ContinueWith by default creates asynchronous continuations, whereas await by default creates synchronous ones.

My $.02.

stephentoub on 23 May 2017

👍1

if I could change anything in the Framework, it would be that only the first continuation can be executed inline and all subsequent ones would execute on their appropriate TaskScheduler or SynchronizationContext

stephentoub on 23 May 2017

👍5

(I could be misreading the direction of this thread. Please let me know if I should delete this comment/open a new issue instead).

For the cases where there aren't easy opt-out mechanisms, we can and should consider additional APIs to make that possible; this was already discussed previously, and the issue was left open in case someone wanted to make specific proposals.

Could such a new API be the option to choose if continuations may run synchronously or not be providing the option at the SetResult/SetException/SetCanceled level? As I stated before, you often know when calling one of those three methods if it would be OK to run continuations synchronously. But you don’t always know when instantiating the TaskCompletionSource (and that instantiation could be in “unrelated” code).

Such an API change would be to add these members to TaskCompletionSource:

public class TaskCompletionSource<TResult>
{
    public void SetCanceled(bool runContinuationsAsynchronously);
    public void SetException(Exception exception, bool runContinuationsAsynchronously);
    public void SetException(IEnumerable<Exception> exceptions, bool runContinuationsAsynchronously);
    public void SetResult(TResult result, bool runContinuationsAsynchronously);
    public bool TrySetCanceled(bool runContinuationsAsynchronously);
    public bool TrySetCanceled(CancellationToken cancellationToken, bool runContinuationsAsynchronously);
    public bool TrySetException(Exception exception, bool runContinuationsAsynchronously);
    public bool TrySetException(IEnumerable<Exception> exceptions, bool runContinuationsAsynchronously);
    public bool TrySetResult(TResult result, bool runContinuationsAsynchronously);
}

I do not understand the direction shown by the existing API: the CancellationToken overload only exists for the Try variant of SetCanceled. Does that mean that this new option should only ever be added to the Try variants too?

And, also, after typing out all of these new overloads, I can see that just the number of new overloads required for this approach could be a problem in and of itself…

A more drastic approach to actually change defaults as the OP suggested and just for TaskCompletionSource would be to create a whole new class and recommend people use it instead of the old one such as by marking TaskCompletionSource obsolete. This would be ugly because then you’d suddenly have warnings in a lot of perfectly good code and you’d have a class with a name like TaskCompletionSource2.

(to reply to earlier)

@AArnott > @binki An assembly-level switch I don't see working out.

I realize now this is impossible without breaking the API and ABI because C#/.net does not allow members in an assembly to be hidden at compile-time and visible at runtime. Also, casting to delegates does not bind default values, so this would even be an API break. So baking defaults at compiletime using different default parameter values in different reference assemblies is not a usable technique for this sort of thing.

binki on 23 May 2017

@binki TaskCompletionSource has an overload that takes TaskCreationOptions so that may be preferable also for the overloads?

e.g.

tcs.TrySetResult(result, 
    TaskCreationOptions.RunContinuationsAsynchronously | TaskCreationOptions.PreferFairness);

tcs.TrySetResult(result, TaskCreationOptions.HideScheduler);

benaadams on 23 May 2017

@benaadams I thought about that, and to enable further options to be specified in the future, using an enum instead of a bool would definitely be a better choice. But I do not like TaskCreationOptions specifically because there is a lot in there that you cannot even use with new TaskCompletionSource<TResult>(TaskCreationOptions) (things like HideScheduler only make sense for “old” new Task(Action)-style tasks) and some of options can only take effect during instantiation when the TaskCompletionSource.Task property’s value is set.

Should an enum be used instead of a bool for new SetResult/SetCanceled/SetException overloads to enable future expansion? If so, should TaskCreationOptions be used or a new enum made?

binki on 23 May 2017

This issue is still unresolved. I keep finding code on the web or questions on Stack Overflow where unsuspecting people run into this. It happens all the time. Few people suspect that any wait can execute arbitrary code. And few people are able to write safe code under these circumstances. I can't. I have illustrated this in comments above where I pointed to qualified Microsoft programmers who made mistakes of this kind. These mistakes were hard to find and very critical to the correctness of their code.

Please find a way to address this. I said:

Please disable all of these behaviors by default. Make unsafe performance features opt in. I doubt there will be compatibility impact from doing this because all of the behaviors above are, to my knowledge, non-deterministic.

There should not be correctness impact, only fairly minor performance impact. Is this argument not correct? Can we maybe do a compatibility switch to re-enable this? But reentrancy must be off by default. Maybe it can be on by default for old target frameworks and off by default for new targets. Then, eventually all code will transition away from being under a broken model. All individual APIs should have a way to opt-in and/or opt-out in code. Let's ponder this and find a way.

Pinging the knowledgeable and wise @stephentoub for guidance!

GSPP on 25 May 2018

👍2

There should not be correctness impact, only fairly minor performance impact. Is this argument not correct?

It is not correct. For example, remove the inlining support from Result, and the following will almost certainly deadlock:
```C#
using System;
using System.Threading;
using System.Threading.Tasks;

class Program
{
static void Main() => Console.WriteLine(Fib(20));

private static int Fib(int n)
{
    if (n <= 1) return 1;

    Task<int> t1 = Task.Factory.StartNew(() => Fib(n - 2));
    Task<int> t2 = Task.Factory.StartNew(() => Fib(n - 1));
    return t1.Result + t2.Result;
}

}

and even if it didn't, the perf impact would be _enormous_.

Without changing the runtime, you can see the impact with a small modification to the repro like the following, which uses a custom scheduler to turn off inlining:
```C#
using System;
using System.Collections.Generic;
using System.Threading;
using System.Threading.Tasks;

class Program
{
    static void Main() => Console.WriteLine(Task.Factory.StartNew(() => Fib(20), CancellationToken.None, TaskCreationOptions.None, new NoInliningScheduler()).Result);

    private static int Fib(int n)
    {
        if (n <= 1) return 1;

        Task<int> t1 = Task.Factory.StartNew(() => Fib(n - 2));
        Task<int> t2 = Task.Factory.StartNew(() => Fib(n - 1));
        return t1.Result + t2.Result;
    }

    private sealed class NoInliningScheduler : TaskScheduler
    {
        protected override void QueueTask(Task task) => ThreadPool.QueueUserWorkItem(_ => TryExecuteTask(task));
        protected override bool TryExecuteTaskInline(Task task, bool taskWasPreviouslyQueued) => false;
        protected override IEnumerable<Task> GetScheduledTasks() => null;
    }
}

stephentoub on 25 May 2018

👍1

Interesting. If I understand correctly, the deadlock would arise when the thread pool reaches its maximum thread count because all of the .Result are blocking threads and preventing them from being returned to the thread pool.

If the issue is only found in waits which block the current thread, could await be changed to avoid inlining by default with something like .ConfigureAwait(allowInlineExecution: true) to get old behavior?

binki on 25 May 2018

If I understand correctly, the deadlock would arise when the thread pool reaches its maximum thread count because all of the .Result are blocking threads and preventing them from being returned to the thread pool.

Yes. Even before then, though, performance degrades terribly, as once you get to Environment.ProcessorCount threads, the thread pool's going to start throttling how quickly additional threads are injected.

stephentoub on 25 May 2018

We did https://github.com/dotnet/corefx/issues/2454#issuecomment-303475181 for 3.0. Beyond that, I'm not seeing anything else here that's actionable, so I'm going to close this. If there's a specific API proposal for something further, please open a separate issue. Thanks.

stephentoub on 9 Sep 2019

👍2

Was this page helpful?

0 / 5 - 0 ratings