When describing tables, there are a few levels of locks which force single-threaded operation
https://github.com/aws/aws-sdk-net/blob/b691e46e57a3e24477e6a5fa2e849da44db7002f/sdk/src/Core/Amazon.Runtime/Internal/Util/SdkCache.cs#L421 https://github.com/aws/aws-sdk-net/blob/b691e46e57a3e24477e6a5fa2e849da44db7002f/sdk/src/Services/DynamoDBv2/Custom/DataModel/ContextInternal.cs#L125 https://github.com/aws/aws-sdk-net/blob/b691e46e57a3e24477e6a5fa2e849da44db7002f/sdk/src/Services/DynamoDBv2/Custom/DataModel/InternalModel.cs#L608
And inside those locks we synchronously call an async method
https://github.com/aws/aws-sdk-net/blob/189abfb1d51c5300e68c741c536f4144d5f2c4fd/sdk/src/Core/Amazon.Runtime/Pipeline/HttpHandler/_mobile/HttpRequestMessageFactory.cs#L499
This is a problem as it creates a new task which may not be able to be run by the ThreadPool for quite some time. Also, because the locks in these three places do not pre-check their condition before locking - they make the problem worse by consuming more threads than necessary.
We run 100 requests, and after the initial describe table call, the cache is loaded, and the program quickly completes the 100 requests
Cold start - no caches are filled.
We run 100 dynamodb LoadAsync requests, creating 100 tasks.
The threadpool begins processing the tasks, starting with threads equal to the number of cores we have to work with, say, 4.
The first thread gets through the locks and begins a synchronous call of an async http method, creating our 101th task.
Deadlock. All 4 threads are either stuck at a lock, or waiting for the 101st task to continue
The threadpool adds one more thread to the pool each second to help handle load.
97 (101-4) seconds go by, and the 101st thread picks up the http response task, resolving the deadlock
All the threads enter the lock, use the cache and complete shortly after.
To prevent build-up on the locks (and greatly improve performance) all three locks above should be using the double-checked locking pattern to remove unnecessary calls to Monitor.Enter.
https://en.wikipedia.org/wiki/Double-checked_locking#Usage_in_C#
Table.DescribeTable should be made asynchronous, calling DDBClient.DescribeTableAsync. Locks should be replaced with SemaphoreSlims as they are awaitable, and async should be brought all the way up above the locking code, to prevent blocking within locked sections: https://stackoverflow.com/questions/7612602/why-cant-i-use-the-await-operator-within-the-body-of-a-lock-statement/7612714#7612714
[DynamoDBTable("dynamo_table")]
class Entry
{
[DynamoDBHashKey]
public int id { get; set; }
}
static void Main(string[] args)
{
var context = new DynamoDBContext(new AmazonDynamoDBClient());
var tasks = new List<Task<Entry>>();
for (var i = 0; i < 30; i++)
tasks.Add(Task.Run(() => context.LoadAsync<Entry>(0)));
Console.WriteLine("Waiting for tasks to complete");
try
{
Task.WaitAll(tasks.ToArray());
} catch (Exception){}
}
The above snippet is overly simple and doesn't show what is happening with the threadpools like this reproduction solution does:
https://drive.google.com/open?id=1UXItp4aRlGLBMjbuYm94WU_2TUoT4dKf
This issue caused a production incident because the ECS container locked up under initial load and couldn't reply to health-checks - causing the ELB to kill the container and repeat.
Sorry we haven't responded to this issue yet. Due to re:Invent we have been doing a bit of catchup. Thanks so much for going deep on this issue and providing so much detail. I agree with your conclusion and proposed solutions. Let me discuss with the rest of the team to figure out how we can this work done.
I have been trying to remove the sync call to DescribeTable for the last couple days which is not going well. The problem is this high level abstraction was first written when when we only supported synchronous methods in the SDK and it was designed with that knowledge. The problem is so many of the synchronous methods can trigger a lazy load to DescribeTable. I'm having to create a lot of duplicate async methods but I can't remove the sync methods as that would be a breaking change. This is causing a lot of code duplication, which is extra complicated due to the AWSSDK.DynamoDBv2 still supporting .NET 3.5 which doesn't know about Tasks, and you would have to know to opt into the new methods that you didn't realize could trigger a call to DescribeTable for example FromDocuments.
I'm wondering if instead of trying to put the band aid here if we should bite the bullet and be a bit more radical. We have been debating for awhile about pulling the Document and DataModel abstractions in their own separate repo and NuGet package. In the process of doing that we could do some more breaking changes to get the behavior correct from the start. For example in the new version I would like to drop .NET 3.5 Framework and support async only to so we can avoid getting these types of threads. Once we have the new package I think we would mark the Amazon.DynamoDBv2.DocumentModel.Table and Amazon.DynamoDBv2.DataModel.Context in AWSSDK.DynamoDBv2 as obsolete and whenever we do a major version 4 of the SDK we would drop them from the package.
Moving these abstractions out of the SDK repo should also make working on these abstractions much easier then the SDK repo which is very large and is hard to make large changes as it has to ship basically every day.
To alleviate this issue would a helper method added to the Context to preload the cache for all of the types help? You could call it a the start of the application and then the sync DescribeTable call should not be called when the app is in full stress mode.
I would really appreciate the community thoughts here. What does everybody prefer, continuing on the path of patching the existing version or start spinning up a project to create new version of these abstractions in a new repo.
@costleya
I was conflicted about which approach I would personally go with until I broke things down a bit.
Regardless of whether we try to duplicate methods to allow async within the same assembly, or we pull the base models out and create a newer, modern dynamoDb package - the required .net 3.5 compatibility will require two implementations of similar code 'somewhere'. This complicates support for the dynamo package - but this is simply not possible to avoid in order to support async operation correctly, while also supporting non async.
Having async and legacy methods in the same assembly creates technical debt for when the old non-async code is eventually made obsolete. The only benefit is that required changes to the library can have both changes made in the same project.
The band-aid approach also bugs me in the same way you describe - it won't assist users in upgrading their implementation as their existing code will still continue to work, and they won't know to use the async versions. By contrast, a new async-aware package would force a major code update on how people are using dynamoDb in their solutions
In the short term, double-checked locking will help to alleviate build-up on the locks, and a command to fill those caches will fix the issue for us. Just uuh - please don't make it cache the models in parallel 馃槄
Until an inbuilt workaround is in, for those who come across this thread and just want a quick win, use
ThreadPool.SetMinThreads
For us we are working around this issue by starting with 32 worker threads which gets us through the initial few requests enough for the cache to fill. This is fine for our application but may not work for all applications, so, use due diligence.
I've encountered this issue in my .NET lambdas. Occassionally (~2-4) times a month, the lambda locks on the first call to a DynamoDB table. In my case, I'm not using the DynamoDBContext. I'm using the DynamoDBClient. After adding detailed logging, we've found that the lock is invariably on the first call using the DynamoDBClient.
Most helpful comment
Sorry we haven't responded to this issue yet. Due to re:Invent we have been doing a bit of catchup. Thanks so much for going deep on this issue and providing so much detail. I agree with your conclusion and proposed solutions. Let me discuss with the rest of the team to figure out how we can this work done.