Roslyn: DCR: Roslyn should throttle parallelizable work and not flood the threadpool

Created on 11 May 2017  Â·  43Comments  Â·  Source: dotnet/roslyn

I've learned from @CyrusNajmabadi that Roslyn tends to flood the threadpool with Tasks whenever highly parallelizable work needs to be done. This is bad for the rest of the application because (for example) if you have several seconds' worth of work in the ThreadPool's queue, any more urgent request for the Threadpool that is added to the queue has to wait several seconds before getting attention. This can make the IDE feel sluggish (at best) or have long UI delays (at worst) when the UI thread needs a threadpool thread as part of its work.

@CyrusNajmabadi tells me Roslyn has itself felt the pain of this and has established a policy that UI commands to Roslyn shouldn't need the threadpool. But this is an untenable policy for the rest of VS to honor as well. So the rest of VS features will feel sluggish and suffer UI delays while Roslyn is flooding the thread pool.

Instead, anyone with lots of work should throttle their concurrency level to a reasonable number (usually the number of CPU cores on the machine is a great number). This won't slow down your work because all cores are busy, and it also keeps the threadpool queue is short, which ensures that any new request will be satisfied very quickly, providing a better VS experience.

Please prioritize fixing this. FWIW, I find Dataflow ActionBlock and/or AsyncSemaphore (or SemaphoreSlim) make it relatively easy to throttle to a given level.
Added later: Actually in the case of async work, if you really want to keep all CPU threads working hard the best way to throttle is probably to use ConcurrentExclusiveSchedulerPair since it doesn't "see" async tasks and thus will use its concurrent "slots" active with synchronous portions of async work.

Area-IDE Feature Request

All 43 comments

But this is an untenable policy for the rest of VS to honor as well.

This is going to be a problem no matter what. Any component/extension/module/whatever may be using the threadpool in precisely this manner. Indeed, it's the purpose of the the threadpool. To efficiently allow you to get high throughput with the work you need that is not latency dependent.

If you need latency (i.e. because you're on the UI thread), then the threadpool is not an appropriate place to go to precisely because it is shared among everyone and it will be saturated with tasks.

Any component/extension/module/whatever may be using the threadpool in precisely this manner.

Yes, any VS extension can ruin the experience for the rest of VS. But that's not a good reason for Roslyn to do the same. We file bugs against poorly performing 3rd party extensions too.

Indeed, it's the purpose of the the threadpool.

The purpose of the threadpool is so folks can schedule CPU work on another thread without paying the costs of spinning up a thread. It also minimizes the cost of maintaining too many dedicated threads where a dedicated thread is not required.

I suspect @stephentoub can offer guidance or links to docs affirming to the guidance that apps should avoid flooding the threadpool queue with tasks such that the app can't access the threadpool in a timely way.

The threadpool is a shared resource. Please use it with respect to others that also need it. And regardless of your original goals, UI delays in VS are a reality and so is our customers' burning desire that we fix them. We need a responsive threadpool for many of our features in VS.

Yes, any VS extension can ruin the experience for the rest of VS.

This should not be ruining the experience for the rest of VS. AFAICT the threadpool has never given any sort of latency guarantees. When you schedule work in it there is no expectation that your work will happen any time soon. It's simply a place for BG threads to churn on tasks and whatnot. I've never seen a single piece of documentation in VS that says that its inappropriate to use the threadpool in this manner. And this is what Roslyn has been doing since our inception (and we're tracking perf pretty heavily). I can't see how any component could be written to be on the UI thread, be latency sensitive, and then use the threadpool and expect tehre to not be issues here.

Tagging @Pilchie

I've never seen a single piece of documentation in VS that says that its inappropriate to use the threadpool in this manner.

In VS Platform we have dealt with threadpool exhaustion for years and yes, we even recently publicly documented the problem. Most of this documentation has been around for years, and we tend to point people at it as we identify offenders who flood the threadpool. Most of the time we see it, it's because folks are simply blocking individual threads. In your case you're flooding the threadpool queue instead of simply blocking threads, but the effect is the same.

MSDN has a couple of topics that review the threadpool, including this one, which even includes this caution (emphasis added):

You can use the SetMinThreads method to increase the minimum number of idle threads. However, unnecessarily increasing these values can cause performance problems. If too many tasks start at the same time, all of them might appear to be slow. In most cases the thread pool will perform better with its own algorithm for allocating threads.

That same article goes on to explain perf characteristics of the threadpool and makes some pretty solid guarantees about how it performs under several circumstances. And it describes the "ideal", which Roslyn is detracting from at present.

So yes, I daresay general .NET guidance and certainly VS platform guidance has been to avoid flooding the threadpool. I'm sorry it's taken so long for us to realize this and to redirect your design. Customers are very clear that they want VS to be more responsive, and threadpool exhaustion is a very real contributor to this in some of our top UI delays and actually one of the harder ones to pin blame on coming from the other direction. But it's easy as a feature owner to know you're causing it, and in this case we know. So please fix it.

which even includes this caution (emphasis added):

The caution is about what you can observe if you use SetMinThreads and set the value too high. We're not doing that.

but the effect is the same.

It shouldn't be. Again, the threadpool is not, and has not been a resource for latency sensitive purposes. That's why, for example, it mentioned "You require a thread to have a particular priority". Effectively, when you're using the threadpool, you're saying i'm ok with my work being done in cooperation with everyone else using it. As you have stated, it's a shared resource.

If you have a circumstance where sharing is not ok, i.e. because you have heavy latency characteristics you need to maintain, then using the threadpool is not advisable because there is no way you could adequately ensure that the threadpool will ever be able to get around to processing your tasks in the time slice you require**

--

** Note: you could use 'LongRunning' to get a dedicated thread. But if you're just created standard tasks and using the threadpool, you should not have any latency expectations.

The caution is about what you can observe if you use SetMinThreads and set the value too high. We're not doing that.

Yes, I guess that's true. But the rest of the caution is true even if you omit the first sentence, because scheduling too many tasks at once makes most of them appear slow.

Note: here's what your own documentation says:

In general CLR thread pool is designed short running tasks that don't block the threads and CLR will adjust active worker thread counts to maximize throughput. Because of this, long running tasks that block on network IO, other async tasks can cause problems in how CLR allocates threads. In order to avoid these issues, it is recommended to always use async waits even in thread pool threads to avoid blocking a thread with no actual work being done. There are async APIs available already to do file IO, network IO that should make this easier.

That's how we do things. :)

As I said before, that doc was written with the starvation cause we've seen so far being folks who block threads. You're one of the first teams to simply flood the queue, so we didn't talk about that directly.

I don't actually see how making fewer, longer running tasks would actually make the situation any better (modulo the small per-task overhead). As long as the tasks are yielding their threads instead of blocking, I don't see a problem here.

@stephentoub - what are your thoughts on this sort of practice?

But if you're just created standard tasks and using the threadpool, you should not have any latency expectations.

Where is your documentation to back that up?

The threadpool has lower latency than creating a dedicated thread, when the threadpool is available.

because scheduling too many tasks at once makes most of them appear slow.

A task is not meant to be 'fast'. It is meant to represent 'asynchronous' work. i.e. work that may take a large amount of time, precisely because of any number of factors may be impacting it. The point of tasks is to be able to say "i want this work to happen, and i accept that it may take time" and to design the rest of your system appropriately. For example, for a feature like FAR, we know that searching may take a long time. So we use async-tasks to do the work, and we populate the result with the values as the async work makes progress. We do not block the UI thread on this. Indeed, we release the UI thread almost immediately, precisely so that nothing is sluggish and so that the user can continue interacting with it.

Now, there's also the issue of thread-starvation when threadpool threads get blocked. But roslyn tries very hard to never do that. Indeed, the only times we must (and they're super rare), are when we must call into some library that absolutely has no async codepath for us to use. For example, sqlite has no async codepath and yet may do IO. There's really no way for us to completely ensure that we don't block a thread in this case. But, otherwise, we try to use async-everywhere to make sure that threadpool thread starvation is not caused by us.

I don't actually see how making fewer, longer running tasks would actually make the situation any better (modulo the small per-task overhead). As long as the tasks are yielding their threads instead of blocking, I don't see a problem here.

@Pilchie I'm not asking for bigger tasks. I'm not asking to throttle the number of tasks. I'm asking that you throttle the rate at which you add tasks to the threadpool queue. And the reason that makes the situation heaps better is that with a shorter threadpool queue, any individual scheduled task completes more quickly. And when the task is necessary for high priority work (like the user is waiting on it), that's gold.

Where is your documentation to back that up?

It's the reverse. The Task and Threadpool APIs do not given any latency guarantees. Without guarantees, you should not expect them to behave in this manner.

I mean, the entire basis of Task is that is represents async work. The point of it was to allow you to say "this may take some unknown amount of time, go do it, and allow me to resume once you are complete". If you need it to represent bounded work, then i don't know how you'd ever get that.

I'm asking that you throttle the rate at which you add tasks to the threadpool queue ... And when the task is necessary for high priority work (like the user is waiting on it), that's gold.

I don't see how that helps. So we throttle the rate that we add tasks. But we still add the tasks, and the TPL still runs our tasks before your high-pri task. You're still not going to run for a while.

Note: the documentation you did link to said precisely this:

When Not to Use Thread Pool Threads

You require a thread to have a particular priority.

If you need priority work, then the recommendation is to not use the threadpool and to use your own scheduler. Note: this is what we did in TypeScript so that it could prioritize work without being at the mercy of the threadpool.

Now, there's also the issue of thread-starvation when threadpool threads get blocked. But roslyn tries very hard to never do that.

Thank you for that.

The Task and Threadpool APIs do not given any latency guarantees.

Neither does the UI thread. The windows kernel is not (typically) a real-time scheduler. So guarantees are rarely written down. Nevertheless, much more code in VS is (thankfully) being written to be async where possible now. But blocking the UI thread on some of this work is still sometimes necessary. And if you have a background operation going in Roslyn and VS wants to update some UI that needs a background task to finish first, whether or not the UI is blocking on that background work, you're slowing down a high priority UI display with a low priority long running operation. That's not ideal.

Cyrus, we're reading the same docs and interpreting them differently. I'm OK with that, but I don't have more time to spend right now trying to find more documentation to back this up.

What I can tell you from ample experience though is that what you're doing is unhealthy for VS. I haven't seen UI delays in VS that we've pinned to Roslyn's flooding the threadpool because as I've said, identifying threadpool exhaustion offenders from the UI delay traces is usually impossible. But when we do have more data, we have seen over and over again that threadpool exhaustion contributes to large UI delays.

And we have a lot of UI delays that are unexplainably slow for a small % of cases, so threadpool exhaustion is a plausible explanation for some of them.

All your points that this might be a bad idea notwithstanding, most of the rest of VS has code that is at odds with your ideals. And Roslyn is the only component in VS that I know of that would aggravate this problem. Other threadpool exhaustion contributors (those that block threads) we actively chase and get them to fix and UI responsiveness gets markedly better as we do this. I'm confident that fixes to Roslyn will win similar gains because again, the symptom is exactly the same, whether you flood the threadpool with tasks that block, or you flood the threadpool queue.

As an authority on threading in VS, I'm asking you to please fix this for the betterment of VS customers. I'm happy to offer guidance and/or assistance in making the changes necessary to throttle. I believe it can be done with no impact to your callers or extensions, and quite possibly with relatively low cost to yourselves.

Do you have a suggestion on how we could do this? Roslyn is dozens of components across several languages. It's unclear to me how we would actually coordinate things such that the rate at which tasks were added to the threadpool was throttled.

Tagging @jasonmalinowski

I'd be happy to suggest options. My initial issue section has several suggestions of resources .NET offers that can help you. The most relevant since we're focused on the threadpool here is ConcurrentExclusiveSchedulerPair, which should keep your scenario running just as fast, while still being a good VS citizen.

Can you point me to a sample piece of source code in Roslyn that has unbounded concurrency today? I can try to whip up a "fix" that hopefully can be equally applicable elsewhere.

Can you point me to a sample piece of source code in Roslyn that has unbounded concurrency today?

We do not bound anything. :) We create tasks everywhere. Everything, afaict, is unbounded. Note: as mentioned previously, we're also a disparate set of components. Would the ConcurrentExclusiveSchedulerPair ensure throttling across all our components? or will you just get several dozen throttled components, which still end up putting so much onto the threadpool, that any of your tasks still may wait a long time to run?

Also, doesn't this fall apart at the first set of awaits, which will allow new tasks to start, and then await, but all the continuations will just end up on the thread pool anyway?

I admit I don't see how ConcurrentExclusiveSchedulerPair helps. Either:

  1. It reduces the number of simultaneous tasks we're scheduling to the thread pool, which means we're increasing latency of our own features (which is bad.)
  2. It doesn't reduce the number of simultaneous tasks we're scheduling, so it does nothing (which doesn't address @AArnott's concern.)

I would be curious to understand this statement:

@CyrusNajmabadi tells me Roslyn has itself felt the pain of this and has established a policy that UI commands to Roslyn shouldn't need the threadpool. But this is an untenable policy for the rest of VS to honor as well

What is different about the rest of VS that means that Roslyn's own approaches to this problem don't apply elsewhere?

when the UI thread needs a threadpool thread as part of its work.

We have avoided this and work to do even more of that because even moving to a different set of threads is still bad: there's no guarantee that the windows thread scheduler will run the thread you want any better than the thread pool being available. Hence why if an operation starts on the UI thread we keep it there because jumping threads will never get to the user any faster than not. (In fact, possibly a bit slower because of the thread priority bump windows gives to the thread associated with the active window.)

And if you have a background operation going in Roslyn and VS wants to update some UI that needs a background task to finish first, whether or not the UI is blocking on that background work, you're slowing down a high priority UI display with a low priority long running operation. That's not ideal.

It's interesting because in many cases, we don't know which "background" work might suddenly become high priority: if I'm tying in the editor we're doing completion in the background and light bulbs in the background, both of which we would block on depending upon which key the user presses next. A system where we start different features on a set of threads and adjusts priorities accordingly might be something interesting to experiment with, although it would require some non-trivial problems to be solved that would have widespread impacts on code. Unfortunately tracking that would be challenging. (And also has other impacts on VM usage which need to be addressed.)

@AArnott, it would be good (no matter what comes of this discussion) to have some good ways to monitor [a] thread pool queue length, [b] thread pool queue latency, and [c] some distribution of how different producers of the thread pool work are consuming it. Do we have any way to monitor this today?

If the UI needs to run some task of its own and needs it to be run earlier than later, shouldn’t it be using a dedicated thread pool instead of trusting that the shared one will be responsive?

@CyrusNajmabadi said:

We do not bound anything. :) We create tasks everywhere. Everything, afaict, is unbounded.

Understandable. Creating tasks is normally fine, as the threadpool can usually keep up with relatively low latency. It becomes a problem when you have a loop that generates a lot of tasks which is sure to outrun the threadpool with just a single generator of tasks.

Would the ConcurrentExclusiveSchedulerPair ensure throttling across all our components? or will you just get several dozen throttled components, which still end up putting so much onto the threadpool, that any of your tasks still may wait a long time to run?

You have control over that by who shares your instance of ConcurrentExclusiveSchedulerPair. Each instance maintains its own throttled scheduling. If sharing the instance is convenient, you could go ahead and do that. If it's not, using isolated instances in each feature would still be a significant win because each feature's contribution is curbed.

To @Pilchie's question:

Also, doesn't this fall apart at the first set of awaits, which will allow new tasks to start, and then await, but all the continuations will just end up on the thread pool anyway?

That depends on your .ConfigureAwait(bool) policy. Awaiting tasks by default resume your async method using the SynchronizationContext or TaskScheduler on which they were running before. So it works great if you don't explicitly specify .ConfigureAwait(false) everywhere. In fact, not specifying that suffix everywhere helps in another case you might like: Cyrus's description of staying on the UI thread for UI commands.
If you use .ConfigureAwait(false) everywhere and want to keep doing that, the only effective throttling mechanism is an async-aware one such as Dataflow blocks or semaphores. And those aren't guaranteed to keep the CPU busy, since you'll have to pick a number of concurrent logical operations to allow instead of concurrent synchronous tasks the way a TaskScheduler would. So while this isn't as ideal, you could still likely pick a 'sweet spot' of a concurrency level that would tend to keep the CPU busy but without dragging in a long threadpool queue.
Also, FWIW, the AnyCode team was swamping all cores on the CPU for indexing (via the threadpool) and quickly found out that the UI responsiveness of VS suffered greatly. We'd have to ask them to find where they settled, but it might have been on 1-2 threads dedicated to indexing, leaving 1-2 cores (on a 4 core machine anyway) available to serve the user's interactions with the IDE.

To @jasonmalinowski's points:

I admit I don't see how ConcurrentExclusiveSchedulerPair helps. Either:

  1. It reduces the number of simultaneous tasks we're scheduling to the thread pool, which means we're increasing latency of our own features (which is bad.)
  2. It doesn't reduce the number of simultaneous tasks we're scheduling, so it does nothing (which doesn't address @AArnott's concern.)

I think you misunderstand this TaskScheduler. Assume you specify a concurrency level equal to the number of CPU cores (let's suppose you have 4). Suppose you have 80 tasks to get done. You can throw them all onto the threadpool queue at once or you can give it the first 4 or 5. As soon as one is finished, you enqueue another. The threadpool is busy the entire time your 80 tasks are getting done either way. So they get done in equal time with either strategy.
But this strategy alleviates my concern because if anyone else needs the threadpool (usually something to serve what the user is interactively doing), any item they add to its queue will be executed very quickly because the queue is short. In this scenario, your own next task will be behind this interactive-supporting task in the queue, but that's probably appropriate and improves the UI responsiveness. At the very worst, you're sharing the pool with other background work in a more equitable fashion.

What is different about the rest of VS that means that Roslyn's own approaches to this problem don't apply elsewhere?

It might. Except that of all the massive code bases in VS besides Roslyn, redesigning them to follow Roslyn's solution would be extremely expensive and risky. And ultimately, it would probably lead to a slower experience. Consider CPS (which I call on because I happen to be very familiar with it): it uses the threadpool to help load multiple projects concurrently while loading a solution. The UI thread can't load projects in parallel, so multiple other threads are necessary, and dedicated threads are not as we have thousands of little tasks to do and we can scale out to whatever the CLR's tuned threadpool can do. But solution load can become UI thread blocking where we need to load all projects at once. Suddenly, threadpool latency becomes critical to user responsiveness and perceived performance in VS. So honestly I don't know how Roslyn's solution even could be applied here.
I also am not yet sure how your solution works for Roslyn either, since UI behaviors including opening and closing solutions also require Roslyn, and roslyn does work on background threads (I think and hope) as part of initializing its language service, and I expect that ends up blocking the UI thread at times (like close solution).
And in particular, if you use .ConfigureAwait(false) in code you call from the UI thread and then block (I'm pretty sure you do for things like completion lists), then in fact you do depend on threadpool latency, don't you?

Hence why if an operation starts on the UI thread we keep it there because jumping threads will never get to the user any faster than not.

If the operation can be parallelized, then moving the operation off the UI thread can absolutely get the job done faster than restricting oneself to the UI thread. But it depends on a responsive threadpool. And in fact major VS components do this. Roslyn doesn't?

It's interesting because in many cases, we don't know which "background" work might suddenly become high priority: if I'm tying in the editor we're doing completion in the background and light bulbs in the background, both of which we would block on depending upon which key the user presses next.

Yes! Exactly. And it's particularly painful when the UI requirement is short yet has to wait a long time for a very long background operation to finish just because it allocated 8 seconds of time in the threadpool before the UI task got enqueued. By helping to keep the queue short, we mitigate this risk significantly.

This is very similar to another problem we had a few years back with design-time builds. CPS would schedule dozens of design-time builds and "flood" the build queue. This was fine for the async requests, but when a project needed a design-time build because the References were being displayed or modified by the user, we needed that build done now -- not 30 seconds later. So CPS had to either use short queues or learn to interrupt and reschedule the work. IIRC we went with throttling for short queues because that was much easier to implement than recognizing priority, interrupting and rescheduling low priority work.

it would be good (no matter what comes of this discussion) to have some good ways to monitor [a] thread pool queue length, [b] thread pool queue latency, and [c] some distribution of how different producers of the thread pool work are consuming it. Do we have any way to monitor this today?

I very much agree. I'm going to talk with Ashok Kamath about adding to PerfWatson's UI latency monitoring another monitor that measures threadpool responsiveness, and perhaps even collects callstacks of the threadpool threads when it is sluggish. I may write a VS extension myself that shows a graph over time and perhaps shows an alert when it's flooded. This might help users understand that the UI delay they're seeing is related to threadpool exhaustion.

@binki:

If the UI needs to run some task of its own and needs it to be run earlier than later, shouldn’t it be using a dedicated thread pool instead of trusting that the shared one will be responsive?

Last I checked, .NET doesn't offer a constructor on the ThreadPool class. And implementing one as efficient as the built-in one is far from trivial. If we were to have dedicated instances of pools, I'd suggest that the heavy concurrency features should own their own instead of using the shared one and pushing all the thousands of random code that just wants it for little things off of it.
A process works best with very few (or just 1) threadpool. The .NET ThreadPool is tuned to maximize the user's CPU without "overscheduling". If you have multiple threadpools, any combination of them could jointly overschedule the CPU, reducing overall performance.
Simply throttling one's work avoids these problems and the complexities associated with them.

(previous comment I made retracted and deleted. I forgot about some other things we do.)

Then perhaps this isn't as big of an issue as I feared. :) It would be interesting for folks familiar with the code to suggest potential problem areas, and to get that VS monitoring system set up to call out when and if it ever is a problem to catch the rest.

I think only the Compiler and Formatter actually use the actual Parallel.For API. But we certainly have tons of concurrent tasks processing and producing new tasks all over the place.

As mentioned earlier, we're a loosely coupled set of components. And all often work independently in response to the the environment (and other triggers) to do work. Take, for example, 'Tagging'. We have many different taggers, and each listens to a myriad of independent events to trigger work. This work happens concurrently, and may itself lead to calling into many different subsystems (with their own boundaries) which then operate in their own concurrent manner. Tasks abound. For example, the 'highlight'-tagger will call into the subsystem repsonsible for finding references (which can happen in a parallel fashion). Find references may call into the indexing subsystem (which may be processing in a parallel fashion). While this is happening, we have the diagnostic subsystems and the analyzer subsystems doing analysis which will then show up through things like the squiggle-tagger. At the same time we'll be doing the concurrent work to determine if there are lightbulbs on hte current line, or suggestion tags elsewhere in the file. While this is happening, dozens of other features are running concurrently against this.

As part of this, we are trying to make many of these subsystems now work both out of process, and in a streaming fashion. Meaning we'll have parts of VS concurrently calling OOP to compute values, and several of those requests will be sending back results as individual pieces are computed. i.e. we pass a callback to the server side, and we allow the server to communicate back to VS incrementally in response to a single request to it.

In the non-oop scenario, this works well, without us having seen this be an issue for VS. We're heavily sensitive to UI thread latency ourselves, so when the UI thread can't respond effectively, it's a big deal for us.

Where this came up as a problem was in the bug you had me open, where we actually just took the same above approach, but moved to using OOP, and now we found a lot of wasted CPU usage in VS, despite the fact that we were actually moving most of the computation out of VS and, were only remoting data to/from VS to the OOP server. This was not expected on our part. We did not realize that concurrent transmission of bidirectional data to/from VS/OOP woudl end up with a high amount of CPU used, and that's what started this whole discussion :)

I think only the Compiler and Formatter actually use the actual Parallel.For API

I think that API has a throttle built in. I could be wrong though.

We did not realize that concurrent transmission of bidirectional data to/from VS/OOP woudl end up with a high amount of CPU used

I didn't get to analyze the ETL myself. I'd be very interested in it. The part that you shared shows the high CPU waste from contention in writing to a single stream concurrently and in creating connections. I believe both of those can be resolved. Was there a lot of other CPU overhead in the transmission as well?

BTW, I just whipped up a sample to demonstrate throttling work and its benefits. It shows perf of the heavy work not suffering (in fact, it improved in my testing) and a huge improvement in app responsiveness. Please check it out.

That depends on your .ConfigureAwait(bool) policy.

Our ConfigureAwait policy (simplified) is "everything ConfigureAwait(false), unless it must return back to the calling SyncContext". This is explicit precisely because Roslyn itself acts as a library, and we do not want to impact a caller SyncContext.

I just checked and we have roughly 30 ConfigureAwait(true) calls, and more than 100x ConfigureAwait(false) calls. So 99% is ConfigureAwait(false).

Was there a lot of other CPU overhead in the transmission as well?

Sorry, i just meant 'transmission' in the sense of 'this happened during sending data'. If that's all in writing data to the stream, then that's all i meant. We did not consider that simply writing our data to the stream would incur this sort of CPU overhead in highly concurrent scenarios.

We did not consider that simply writing our data to the stream would incur this sort of CPU overhead in highly concurrent scenarios.

Neither did I. I'm sorely disappointed that SemaphoreSlim.WaitAsync is so expensive in concurrent scenarios where you might want to use a semaphore in the first place. I am considering replacing AsyncSemaphore implementation to not depend on SemaphoreSlim. But that won't likely make it for 15.3.

Long thread... reading through it and responding as I go... sorry if I respond to something that's later addressed :smile:

I don't actually see how making fewer, longer running tasks would actually make the situation any better (modulo the small per-task overhead). @stephentoub - what are your thoughts on this sort of practice?

It won't. The thread pool's thread-management heuristics work best for smaller rather than larger work items. That's for a few reasons:

  • The hill-climbing algorithm does sampling every N items that complete, at which point it may experiment with adding/removing a thread to see how that impacts throughput. If you have long work items, it can't adapt as quickly.
  • The starvation algorithm checks once or twice a second to see if any items have been taken from the queue. If you have longer work items, it's more likely to think things are being starved and introduce more threads than are actually desirable.

I don't see how that helps. So we throttle the rate that we add tasks. But we still add the tasks, and the TPL still runs our tasks before your high-pri task. You're still not going to run for a while.

I believe what Andrew is asking for is a fairness scheme. He's saying that rather than creating 1 large work item and adding it to the queue, or N smaller work items and adding them to the queue all at the same time, effectively add just one of those smaller work items, and then when that one completes, add the next, and one that one completes add the next, or rather than a concurrency level of 1, choose some other concurrency level M < N, and add work items to the queue in M-sized chunks. Effectively that's what the ConcurrentExclusiveSchedulerPair he mentions does. You add your N work items to it, and it then in turn queues M runners to the underlying scheduler, where those runners pull your N work items from its queues and runs them (and where each of those runners can be configured with a max number of work items to run before it requeues itself). This approach creates some level of fairness between entities scheduling work, because each a) doesn't necessarily have access to the full concurrency level supported by the thread pool, and b) creates a sort of FIFO where rather than my N items getting queued ahead of your N items, only one of mine does and only one of yours, such that we might be able to better share the pool's resources. At least that's how I interpret Andrew's comments.

Also, doesn't this fall apart at the first set of awaits, which will allow new tasks to start, and then await, but all the continuations will just end up on the thread pool anyway?

By default await task; will look for either the current SynchronizationContext or current TaskScheduler, so if you use await task; while running on a ConcurrentExclusiveScheduler pair, the continuation will be queued back to it. If you use await task.ConfigureAwait(false);, though, you explicitly tell the await not to look for a current scheduler/context, and the continuation will typically end up running as part of the antecedent's completion, or if it can't getting queued to TaskScheduler.Default (i.e. the thread pool).

it would be good (no matter what comes of this discussion) to have some good ways to monitor [a] thread pool queue length, [b] thread pool queue latency, and [c] some distribution of how different producers of the thread pool work are consuming it. Do we have any way to monitor this today?

For (a) and (b) there are ThreadPool perf counters / ETW events that provide this information. (c) would be app-specific.

Last I checked, .NET doesn't offer a constructor on the ThreadPool class

It doesn’t. We’ve considered this multiple times in the past (either a way to construct ThreadPool instances or some other custom pool type), but have shied away from doing so. It’s one of those constructs that’s valuable in the 5% case but in the 95% case can easily be misused leading to worse problems. It’s something that could be considered again, though. The ConcurrentExclusiveSchedulerPair is a sort of compromise: different components can target their own “pool”, but from an execution resources perspective, they all end up sharing and cooperating.

Neither did I. I'm sorely disappointed that SemaphoreSlim.WaitAsync is so expensive in concurrent scenarios where you might want to use a semaphore in the first place. I am considering replacing AsyncSemaphore implementation to not depend on SemaphoreSlim. But that won't likely make it for 15.3.

What data is this statement based on? I’ve yet to see any issues opened in the coreclr repo for example concerning such an expense of WaitAsync. And if there is a problem, let’s please have our first instinct be “let’s fix the problem in the existing implementation so that everyone using it benefits and so that we don’t have competing choices” rather than “let’s go off and write a different one”.

Thanks so much for wading through this long thread, @stephentoub, and for your comments. Yes, I was proposing throttling as a fairness scheme.

What data is this statement based on? I’ve yet to see any issues opened in the coreclr repo for example concerning such an expense of WaitAsync.

@CyrusNajmabadi has an ETL with the evidence. A screenshot is available here in the bug he filed here:
https://github.com/Microsoft/vs-streamjsonrpc/issues/37

let’s please have our first instinct be “let’s fix the problem in the existing implementation so that everyone using it benefits and so that we don’t have competing choices” rather than “let’s go off and write a different one”.

I fully support that. I haven't yet looked at the code in SemaphoreSlim to see if it's doing something that we can improve on. But giving it the benefit of the doubt, I was expecting it to be something we couldn't fix without a breaking change. But I'm happy to give it more of a look over (or to file a bug in coreclr for the same by folks more familiar with it).

Although thinking about it a bit more, fixing SemaphoreSlim being great, it might not alleviate our need to abandon it for now, unless .NET Framework takes the fix and ships a servicing release soon enough that we don't have to alleviate the pain another way in the meantime.

has an ETL with the evidence. A screenshot is available here in the bug he filed here

That first trace shows the contention in Monitor, not specifically SemaphoreSlim. Unless you're planning to build a lock-free asynchronous semaphore, I expect you'd hit similar contention in any sync primitive you implemented that utilized lock / Monitor. But I'd still suggest looking at what could be done in SemaphoreSlim:

  • Maybe locks are being held for longer than they need to be, and code could be restructured to avoid the number of times locks are taken or the size of those locked regions, so as to reduce contention.
  • Maybe MonReliableContention itself in the runtime needs to be addressed. I believe @vancem has had some related thoughts in the recent past from other investigations he's done.
  • It's possible some more level of lock-freedom could be employed, though I somewhat doubt that.

I've had similar thoughts. I've never written a semaphore before, but I figured at least if I constrain the case to capacity=1, that a lock-free mechanism is likely possible. And for most cases I use SemaphoreSlim, it's with capacity=1 and used to "lock" async code.

But yes, restructuring to lock for less time or lock fewer times would be very interesting too. I just took a look at the code and it's really intense. It would take me a while to ramp up on all the internal .NET code it's calling and multiple fast paths it tries to take for perf reasons.

Unfortunately, i've checked our codebase rgd ConfigureAwait. We require every task to be ConfigureAwait'ed. And there is a >99:1 ration of ConfigureAwait(false) to ConfigureAwait(true). We did this as Roslyn is intended to be used as a library, and there were definite issues in the past with people calling into us, and our tasks getting scheduled onto their SyncContexts.

This approach creates some level of fairness between entities scheduling work

My primary concern is:

  1. Roslyn itself is 'many entities scheduling work'.
  2. The code that plugs into us is also composed of many entities scheduling work.
  3. The rest of VS is filled with many other entities scheduling work.

A fairness scheme seems undermined by the issue that we'll still end up with lots of work going into the queue (since all of these entities are not going to coordinate), and the UI thread effectively being at the mercy of this if the UI thread is written to block on some threadpool thread completing.

--

Aside: Andrew, i looked at your gist. One thing you're doing is setting the concurrency level to 2xProcessorCount. When i set the ThreadPool to use that same number of threads as its MinNumber i get the same output time:

(not throttled) FINISHED in 00:00:04.4155004 vs
(throttled) FINISHED in 00:00:04.6857670

If i change your concurrent scheduler to only use ProcessorCount concurrency, then i get:

(not throttled) FINISHED in 00:00:06.0017420 vs
(throttled) FINISHED in 00:00:09.3727364

@CyrusNajmabadi When you set the concurrent scheduler to ProcessorCount, are you keeping the ThreadPool.MinNumber at 2×ProcessorCount? I don’t know how the thread pool works, so the following is conjecture: if the idle threads are first in line for the next job and a busy thread finishes its work and, when looking for more work, gets in line behind one of those previously idle threads, maybe it’s forcing a context switch. I.e., that thread will never get more work until it sleeps/blocks, allowing another thread to consume a job. Whereas when the pool is saturated with work (concurrency level ≥ ThreadPool.MinNumber), any thread which finishes a job is able to keep working because a job is becomes available for it right then, allowing it to fully eat a slice. So keeping the pool at least saturated is necessary for performance reasons, interesting…

@CyrusNajmabadi Just FYI we should not change the threadpool min size in VS. If that is the right thing to do (and we're open to discuss if you want), we should do it within VS Platform rather than have individual extensions/components of VS setting/resetting it since it's a shared resource.

I set my concurrency level to 2xProcessorCount in my demo because that seemed to give a good balance between threadpool queue and performance. And it makes sense because it ensures that there is always work to do, even if every threadpool thread finished an item at the same time (because the queue will have that much work in it, at least).

A fairness scheme seems undermined ...

What you're pointing out I think is that the fairness scheme in its most trivial application is limited to being "fair" across a lot of individual components when perhaps it would be even "more fair" if some components got together in clusters to share a concurrency limit. And I'd agree, that might make sense for some clusters of components. But short of having the time and ability to arrange for clusters, I wouldn't let that stop me for a second in limiting concurrency of each individual component. If you have 500 components and 1 of them is a bad actor, that hurts the user experience. But if that 1 bad actor limits concurrency, it fixes the problem.
The clustering will fix a different problem, where individual components aren't themselves flooding the queue, but collectively end up flooding them together. If we can't solve that problem, let's not let that be a reason to not solve the easier and likely more common problem. :)

@AArnott How do you feel about:

  1. Posting the notes from our in-person meeting yesterday here for external people to see
  2. Closing this issue, and we'll file individual issues for the problematic areas we identify as a result of that diagnostics we're planning on adding from that meeting

?

Sure:

Meeting minutes
In attendance: @Pilchie, @sharwell, @jasonmalinowski, @CyrusNajmabadi, @BertanAygun, @aarnott

  • There are no known cases of UI delays that come from Roslyn flooding the threadpool queue, although identifying such would be very difficult given our tooling limitations today.
  • Roslyn team believes there are several places in their code that may introduce unbounded threadpool queuing. Some of these are candidates for moving to their ServiceHub process already and may remain unbounded as they would no longer impact VS’s ThreadPool. Their process runs as low priority, but they automatically adjust to Normal priority when the OOP service is responding to a user command from the VS process.
  • Other components (besides Roslyn) may also contribute to the problem. This is a blind spot for VS performance monitoring that we should address from the VS Platform side.
  • The TaskScheduler approach is inappropriate for Roslyn given their .ConfigureAwait(false) policy, which we agree is reasonable in their design. JTF.Run helps avoid blocking multiple threads when async code is called from synchronously blocking code. Roslyn’s AsyncLazy class solves this by having sync and async delegates and invoking the appropriate one.
  • AsyncSemaphore/SemaphoreSlim would make a reasonably small fix that we expect to be largely effective at mitigating the risk of UI delays by limiting the length of the threadpool queue footprint from these features.
  • Feature A may call Feature B, both with individual throttling mechanisms, but together they may still flood the threadpool. We think taking the step to throttle individual features will be an inexpensive and largely effective first step. We will build the monitoring and telemetry we need to measure its effectiveness and consider multi-feature throttling only if necessary to resolve what should then be confirmable UI delays.
  • @sharwell suggested that custom awaiters may help coordinate throttling across features, but we believe and hope there are simpler options to consider first.
  • Bertan suggests that activating ETW from a VS extension may be fairly cheap. Andrew has offered to write the extension to monitor threadpool latency and may add @BertanAygun’s code if it proves reasonably inexpensive to do so as that would collect more data that we expect to need later.
  • We identified several hypotheses and ways we can validate. We believe we should take them one at a time, in this order, to invest in increasingly expensive work as it is warranted by prior steps.

| Hypothesis | Validation |
|--|--|
The threadpool queue sometimes grows too long for reasonable latency | Author a VS extension that monitors and reports on threadpool latency
High threadpool latency leads to UI delays | Correlate threadpool latency data with UI delays
Throttling threadpool queuing for individual features will reduce UI delays | Collect traces of threadpool usage in high latency conditions to identify contributing code and inspect for opportunities to reduce queue flooding. Then re-measure.

Next steps

  • @aarnott will author a VS extension that we will circulate internally to validate the first hypothesis.
  • @BertanAygun will supply Andrew with ETW activating code to include in that extension.
  • Roslyn team will review their code to find unbounded threadpool queuing, especially in code that is not about to move to their OOP service, and apply throttling to it. This includes Dataflow, Parallel.For, list.Select(Task.Run), etc.
Was this page helpful?
0 / 5 - 0 ratings