Sqlclient: System.Data.SqlClient 4.7.0 tends to deadlock on async

Created on 13 Oct 2019  路  40Comments  路  Source: dotnet/SqlClient

I am afraid I cannot provide a clean reproduction, but I want to share my experience, so that you could decide if this is an issue or not.

Turns out, that with the bump to 4.7.0 (from 4.6.1) our CI build deadlocks almost every run during unit tests. There is neither a problem on development environments nor on production.

The setup: a xunit test suite starts a .net core 2.1 service host as shared fixture to run some integration tests. part of the application stack is asp.net identity core. The whole environment is being started via docker-compose so that the dependent SQL Server is also available. A rather complete test database seeding is done during test initialization. Build agents are cheap Azure VMs of B2s size (2vCPUs, 4GB RAM) running Ubuntu 18.04 with docker.

Since I was unable to dump/debug the locked process on the VM inside docker, I spent some time to reproduce the issue locally (i7-8850H, Archlinux, Jetbrains Rider). I finally made it by running stress --cpu 12 --io 10 --vm 5 --hdd 1, starting the sqlserver process inside docker and limiting it via taskset -p 0x01 nnnn to one CPU, and starting the integration test restricted to the same CPU directly via taskset 0x01 dotnet test.

Attaching a debugger on the locked process reveals the following stack trace (besides various others waiting on reset events, like the xunit runner, test diagnostics and Application insights):

image

the last frame on the stack resulting from our application code is this line:

await userManager.GetAuthenticationTokenAsync(user, TokenManager.TokenProviderName, "refreshToken"); 

Note that there is not a single Task.Wait() or Task.Result in the whole code base, everything is "async all the way down". After suspecting the advanced parallelization features of xUnit as part of the problem, I disabled all test parallelization and lifted all restrictions on allowed threads, but without any effect. The SQL Server is only hit by one open connection, state "await command".

Reverting the solution to System.Data.SqlClient 4.6.1 made all problems go away, there was not a single locking build since then, so I am pretty sure it's related.

Note: The solution is on a private GitHub Repo, but I'd be able to share an export or more details, when needed.

Bug!

Most helpful comment

Was experiencing random deadlock of various EF queries on a ASP.NET core 3.1 project when deployed to a docker container with mcr.microsoft.com/dotnet/core/aspnet:3.1 as base. Upgrading to Microsoft.Data.SqlClient 1.1.1 appears to have resolved the issue.

All 40 comments

Unless you're using mars can you try turning it off? It significantly increases internal complexity on the managed implementation which is what's used on Unix.

@Wraith2 would disabling MARS be useful to find a possible bug and to develop a fix? Asking, because I am happy with version 4.6.1 now and reproducing the issue is hard and would take some time

I think the application is using MARS, but it could be changed. AFAIR there were contructs like this:

foreach(var entity in dbContext.Set<T>().Where(/*predicate*/) { //<-- no "ToArray()" or "ToList()"!!
    // do something with entity, resulting in another db hit
}

Ideally i'd like to fix the issue. Mars is just a suggestion, if you need that functionality then clearly you can't just give it a try because it'll require code rework. Without a stable repro it'll be hard to investigate so if you're happy with another version it's probably worth staying with it.

@marcwittke

This sounds like a regression to me introduced in 4.7.0, could you also try testing with Microsoft.Data.SqlClient (latest version) and let us know if this driver is also impacted? In case the PR that caused this was ported over and merged here or is it only impacting System.Data.SqlClient.

We'd go down in history to find the buggy PR but also understand impacted areas.

@Wraith2, @cheenamalhotra
I'm experiencing similar deadlock with .Net Core 3.0 app (Microsoft.Data.SqlClient" Version="1.0.19269.1) running on Linux container with MARS enabled. Please see original issue: https://github.com/aspnet/EntityFrameworkCore/issues/18480
I posted the link to a small demo project to reproduce the issue.
The stacktrace for deadlocked thread is equal to what @markwilkie had posted earlier:

        "Microsoft.Data.SqlClient.SNI.SNIMarsConnection.SendAsync",
        "Microsoft.Data.SqlClient.SNI.SNIMarsHandle.InternalSendAsync",
        "Microsoft.Data.SqlClient.SNI.SNIMarsHandle.SendPendingPackets",
        "Microsoft.Data.SqlClient.SNI.SNIMarsHandle.SendAsync",
        "Microsoft.Data.SqlClient.SNI.SNIProxy.WritePacket",
        "Microsoft.Data.SqlClient.TdsParserStateObject.SNIWritePacket",
        "Microsoft.Data.SqlClient.TdsParserStateObject.WriteSni",
        "Microsoft.Data.SqlClient.TdsParserStateObject.WritePacket",
        "Microsoft.Data.SqlClient.TdsParserStateObject.WriteBytes",
        "Microsoft.Data.SqlClient.TdsParser.WriteString",
        "Microsoft.Data.SqlClient.TdsParser.WriteUnterminatedValue",
        "Microsoft.Data.SqlClient.TdsParser.TdsExecuteRPC",
        "Microsoft.Data.SqlClient.SqlCommand.RunExecuteReaderTds",
        "Microsoft.Data.SqlClient.SqlCommand.RunExecuteReader",
        "Microsoft.Data.SqlClient.SqlCommand.InternalExecuteNonQuery",
        "Microsoft.Data.SqlClient.SqlCommand.BeginExecuteNonQueryInternal",
        "Microsoft.Data.SqlClient.SqlCommand.BeginExecuteNonQueryAsync",
        "System.Threading.Tasks.TaskFactory<System.Int32>.FromAsyncImpl",
        "Microsoft.Data.SqlClient.SqlCommand.ExecuteNonQueryAsync",
        "Microsoft.EntityFrameworkCore.Storage.RelationalCommand+<ExecuteNonQueryAsync>d__13.MoveNext",
        "System.Runtime.CompilerServices.AsyncMethodBuilderCore.Relational    ]]",
        "Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteNonQueryAsync",
        "Microsoft.EntityFrameworkCore.RelationalDatabaseFacadeExtensions+<ExecuteSqlCommandAsync>d__13.MoveNext",
        "System.Runtime.CompilerServices.AsyncMethodBuilderCore.Relational]]",
        "Microsoft.EntityFrameworkCore.RelationalDatabaseFacadeExtensions.ExecuteSqlCommandAsync",
        "FlexLabs.EntityFrameworkCore.Upsert.Runners.RelationalUpsertCommandRunner+<RunAsync>d__20<System.__Canon>.MoveNext",
        "System.Runtime.CompilerServices.AsyncMethodBuilderCore.Upsert]]",
        "FlexLabs.EntityFrameworkCore.Upsert.Runners.RelationalUpsertCommandRunner.CoreLib]]",
        "FlexLabs.EntityFrameworkCore.Upsert.Runners.RelationalUpsertCommandRunner.IL_STUB_InstantiatingStub",
        "FlexLabs.EntityFrameworkCore.Upsert.UpsertCommandBuilder<System.__Canon>.RunAsync",
        "Data.Live.Persistence.MatchDataStore+<PersistData>d__7.MoveNext",
        "System.Threading.ExecutionContext.RunInternal",
        "System.Runtime.CompilerServices.AsyncTaskMethodBuilder+AsyncStateMachineBox<System.Threading.Tasks.VoidTaskResult,Data.Live.Persistence.MatchDataStore+<PersistData>d__7>.MoveNext",
        "System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction",
        "System.Threading.Tasks.Task.RunContinuations",
        "System.Threading.Tasks.Task<System.Int32>.TrySetResult",
        "FlexLabs.EntityFrameworkCore.Upsert.Runners.RelationalUpsertCommandRunner+<RunAsync>d__20<System.__Canon>.MoveNext",
        "System.Threading.ExecutionContext.RunInternal",
        "System.Runtime.CompilerServices.AsyncTaskMethodBuilder+AsyncStateMachineBox<System.Int32,FlexLabs.EntityFrameworkCore.Upsert.Runners.RelationalUpsertCommandRunner+<RunAsync>d__20<System.__Canon>>.MoveNext",
        "System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction",
        "System.Threading.Tasks.Task.RunContinuations",
        "System.Threading.Tasks.Task<System.Int32>.TrySetResult",
        "Microsoft.EntityFrameworkCore.RelationalDatabaseFacadeExtensions+<ExecuteSqlCommandAsync>d__13.MoveNext",
        "System.Threading.ExecutionContext.RunInternal",
        "System.Runtime.CompilerServices.AsyncTaskMethodBuilder+AsyncStateMachineBox<System.Int32,Microsoft.EntityFrameworkCore.RelationalDatabaseFacadeExtensions+<ExecuteSqlCommandAsync>d__13>.MoveNext",
        "System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction",
        "System.Threading.Tasks.Task.RunContinuations",
        "System.Threading.Tasks.Task<System.Int32>.TrySetResult",
        "Microsoft.EntityFrameworkCore.Storage.RelationalCommand+<ExecuteNonQueryAsync>d__13.MoveNext",
        "System.Threading.ExecutionContext.RunInternal",
        "System.Runtime.CompilerServices.AsyncTaskMethodBuilder+AsyncStateMachineBox<System.Int32,Microsoft.EntityFrameworkCore.Storage.RelationalCommand+<ExecuteNonQueryAsync>d__13>.MoveNext",
        "System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction",
        "System.Threading.Tasks.Task.RunContinuations",
        "System.Threading.Tasks.Task<System.Int32>.TrySetResult",
        "System.Threading.Tasks.UnwrapPromise<System.Int32>.TrySetFromTask",
        "System.Threading.Tasks.UnwrapPromise<System.Int32>.ProcessInnerTask",
        "System.Threading.Tasks.Task.RunContinuations",
        "System.Threading.Tasks.Task.FinishSlow",
        "System.Threading.Tasks.Task.ExecuteWithThreadLocal",
        "System.Threading.ThreadPoolWorkQueue.Dispatch",
        "DebuggerU2MCatchHandlerFrame"

This gave us headaches the whole last night...
As soon as our ASP.NET Core 3 project ended up having some load some connections got stuck...
Occasionally we got

``c# Microsoft.Data.SqlClient.SqlError: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__164_0(Task1 result)
at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Update.ReaderModificationCommandBatch.ExecuteAsync(IRelationalConnection connection, CancellationToken cancellationToken)

```

At some time Kestrel doesn't even want to handle any connections anymore and the database together with the app is locked until the whole application is restarted.

After disabling MARS everything is working fine again!

EDIT: We also hadn't any issue when changing the connection string to target a Server 2017 (Developer) instead of Server 2016 (Standard). Maybe that is SQL Server Version related.

Hi @alex-zyl

I was able to reproduce the issue using your repro provided here: https://github.com/alex-zyl/EFCore3OnLinuxIssue on Azure Database.

But in order to find the PR that caused this between 4.6.1 and 4.7.0, I'd need a repro using EF Core 2.2.6 and System.Data.SqlClient 4.7.0 if possible, as issue was introduced somewhere in between. Would you be able to provide one?

@cheenamalhotra
Glad to hear it helped.
I'll downgrade the demo project to 2.2.6 and try to repro the issue using 4.7.0. Will keep you updated.

@cheenamalhotra
I've added another project with ef core 2.2.6 and sqlclient 4.7.0. I was able to reproduce the issue almost immediately after the container started.
Please pull the changes from my repository.

Hi @alex-zyl

It looks like problem starts from "4.7.0-preview.19073.11" version, before which everything seems to be fine. The PR culprit for this issue is likely to be https://github.com/dotnet/corefx/pull/34184.

I'd like a second opinion on this one, to confirm investigation, so please test it for me at your end too. The NuGet package with reverted changes is available here:

Microsoft.Data.SqlClient.1.2.0-dev-debug.nupkg.zip
(Contains reverted changes for .NET Core)

Use steps from PR https://github.com/alex-zyl/EFCore3OnLinuxIssue/pull/1 to verify fix with this NuGet package.

Thanks!

That PR was reviewed by a couple of people very experienced with async programming and notes that it removes use of async void which is definately a bug. I don't think reverting it is the best course of action, for proving it's the error it's fine but long term no.

Is it easy to reproduce using the example projct required? i'm a bit short on time at the moment but if it is i can take a look into it and see if i have any ideas.

Hi Wraith,

Yes it's easy to reproduce issue, with 'master' branch, once reproduced, you can build your own version of Microsoft.Data.SqlClient nuget package with assembly file version specified to same as in use with EF Core, and use it with PR 1 changes.

I don't intend to revert the entire PR but I would prefer to look for and fix the bug in there, plz feel free to provide any ideas/suggestions that I can start with tomorrow.

@cheenamalhotra
Thank you for your PR.
I built new image and gave it some time to run (it's running more than hour already). No deadlocks so far

@stephentoub @benaadams @saurabh500

Would like to grab your attention to identify issue in PR https://github.com/dotnet/corefx/pull/34184 that causes deadlocks in Async Operations.

We have repros available and reverting the methods ReadFromStreamAsync and WriteToStreamAsync to their old design does not cause deadlocks. Any ideas/directions to fix the async design properly would be highly appreciated!

There is a fix for one of the paths in https://github.com/dotnet/SqlClient/pull/295; however that might need to be rolled out to the other changed methods in that PR dotnet/corefx#34184 ?

Also as an aside the Streams in SqlClient don't look to override the newer async methods on Stream which these methods use (like SslOverTdsStream.cs/SqlSequentialStream.cs)

virtual ValueTask<int> ReadAsync (Memory<byte> buffer, System.Threading.CancellationToken cancellationToken = null);
virtual ValueTask WriteAsync (ReadOnlyMemory<byte> buffer, System.Threading.CancellationToken cancellationToken = null);

Also SqlStream/SqlCachedStream don't override any async methods.

Not sure how important these are in the scheme of things though and what they are used for.

The ssl stream passthrough is on my list of perf things to look into when I get chance.

There is a fix for one of the paths in #295; however that might need to be rolled out to the other changed methods in that PR dotnet/corefx#34184 ?

Tried that, doesn't seem to fix the issue.

SslOverTdsStream / SqlSequentialStream / SqlStream / SqlCachedStream not overriding async methods that return ValueTask.

These could be related, as the change PR https://github.com/dotnet/corefx/pull/34184 brings is also changing async calls to replace Task to ValueTask (introduced in 2.1), but I also tried to revert the same with how its implemented in SNIPacket.NetStandard.cs, but with no luck.

_I have a doubt:_ Is it just not possible to use Task<int> here in NetCore instead of ValueTask<int>? As when doing so my repro app simply cannot connect in linux container and fails with error: Server provided routing information, but timeout already expired. Reverting back to ValueTask<int> brings back the current behaviour of deadlocks.

I'm wondering if SNIPacket.NetStandard.cs is working fine in NetStandard applications unless I'm missing something 馃槦

I can't recreate it with the current build and EFCore3Demo when running via visual studio (not via docker)

Output being as follows

[03:57:40 DBG] Updating data for 2019-0466-LS009
[03:57:41 DBG] Updating data for 2019-1017-RS020
[03:57:41 DBG] Updating data for 2019-0703-LS007
[03:57:41 DBG] Updating data for 2019-0466-LS009
[03:57:41 DBG] Updating data for 2019-1017-RS020
[03:57:42 DBG] Updating data for 2019-0703-LS007
[03:57:42 DBG] Updating data for 2019-0466-LS009
[03:57:42 DBG] Updating data for 2019-1017-RS020
[03:57:43 DBG] Updating data for 2019-0703-LS007
[03:57:43 DBG] Updating data for 2019-0466-LS009
[03:57:43 DBG] Updating data for 2019-1017-RS020
[03:57:43 DBG] Updating data for 2019-0703-LS007
[03:57:44 DBG] Updating data for 2019-0466-LS009
[03:57:44 DBG] Updating data for 2019-1017-RS020
[03:57:44 DBG] Updating data for 2019-0703-LS007
[03:57:44 DBG] Updating data for 2019-0466-LS009
[03:57:45 DBG] Updating data for 2019-1017-RS020
[03:57:45 DBG] Updating data for 2019-0703-LS007
[03:57:45 DBG] Updating data for 2019-0466-LS009
[03:57:46 DBG] Updating data for 2019-1017-RS020
[03:57:46 DBG] Updating data for 2019-0703-LS007
[03:57:46 DBG] Updating data for 2019-0466-LS009
[03:57:46 DBG] Updating data for 2019-1017-RS020
[03:57:47 DBG] Updating data for 2019-0703-LS007
[03:57:47 DBG] Updating data for 2019-0466-LS009
[03:57:47 DBG] Updating data for 2019-1017-RS020
[03:57:47 DBG] Updating data for 2019-0703-LS007
[03:57:48 DBG] Updating data for 2019-0466-LS009

Using ;MultipleActiveResultSets=True; in the connection string.

Is there another configuration I need to do?

Note this is via Windows; maybe Linux is different?

Ah, the fix in #295 would only effect .NET 5 with statemachine reuse, so unlikely to be this

Hi @benaadams ,

Thanks for taking a look! The code modified in PR belongs to Managed SNI applicable only to Unix machines. So the issue won't reproduce on Windows client.

@benaadams to get managed running on windows you can either sets the environment variable if it's a debug build or use private reflection to set the property.

https://github.com/dotnet/SqlClient/blob/60fd1da55e296c8b938217be86c66e8b1c462317/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParserStateObjectFactory.Windows.cs#L23-L30

Thanks @cheenamalhotra, @Wraith2 I'll try that was having trouble getting it to override the dll going via wsl2 and docker

Hi @benaadams

If you're trying to replace sqlclient dll with your changes, you can take a look at alex-zyl/EFCore3OnLinuxIssue#1 where I've listed steps to do so.

For building driver, you may use msbuild /p:AssemblyFileVersion=1.0.19249.1 to replace EFCore Tool's referenced SqlClient dll.

Once you have NuGet package available in SqlClient repo, place it in packages folder of repro as I documented in the linked PR. The repro shall then use your built SqlClient NuGet package.

Still can't recreate the issue (perhaps my latency is too long or not experiencing connection issues); however have added some tweaks to the various paths in https://github.com/dotnet/SqlClient/pull/335

So if someone who can recreate the issue can try that; that would be great

Hi @benaadams

I tested with your branch but still see issue being reproduced.
I'm wondering if there are more changes needed, because I see your changes are targeted towards splitting classes to .NetCoreApp (ValueTask) and .NetStandard (Task).

However, there are still a few instances of Task in main classes that also apply towards NetCoreApp.
e.g. SNITCPHandle.cs, SSRP.cs

Is the intent to use either of the both overall or only ValueTask for NetCore?

Is the intent to use either of the both overall or only ValueTask for NetCore?

Only ValueTask for NetCore as it has the ValueTask and Memory/Span overloads in the BCL and further enhancements for the type, whereas netstandard/netfx doesn't.

Also some of the other methods do more complicated things with the Task that ValueTask doesn't support

Looking at the stacktraces they seem to be suck on

Microsoft.Data.SqlClient.SNI.SNIMarsConnection.SendAsync

and looking at the method that would suggest they were stuck waiting for the lock?

dependabot just tried a bump to System.Data.SqlClient 4.8.0. Result on first build:

image

deadlocked :disappointed: well, somehow expected, since there was no fix supplied so far

Microsoft.Data.SqlClient will contain the fix starting next preview release _v2.0.0-preview1_ planned mid-January, while for System.Data.SqlClient we've backported the fix and shall be hopefully available in upcoming releases.

Since this is still crashing is there any chance of getting a memory dump of the deadlocked process? sysinternals procdump could get one. It's a long shot and difficult but someone might be able to identify the waits that are stuck.

What would be the linux equivalent? On the Ubuntu build agent it locks quite predictable. I could give it a try

https://docs.microsoft.com/en-us/dotnet/core/diagnostics/dotnet-dump I think, I've not used it and i'm not sure the output would be compatible with windbg which is what i'd use to look at it.

is there any workaround to fix issue? we have trouble at production.

you can try one of the recent 2.0.0 builds which contain this PR https://github.com/dotnet/SqlClient/pull/349 and see if it resolves the issue.

you can try one of the recent 2.0.0 builds which contain this PR #349 and see if it resolves the issue.

where I can find artifacts? Not available in nuget. I think i should compile it myself.

If this is an issue in 4.7.0; presumably going back to 4.6.1 should solve the issue as a workaround?

If this is an issue in 4.7.0; presumably going back to 4.6.1 should solve the issue as a workaround?

I tried to downgrade to 4.6.1 System.Data.SqlClient from Microsoft.Data.SqlClient 1.1.0, seems deadlocks resolved.

where I can find artifacts? Not available in nuget. I think i should compile it myself.

@oguzhaneren

As of currently, we don't trigger nightly builds, you can build yourself using BUILDGUIDE.md until targeted release milestone completes and the release is published.

Was experiencing random deadlock of various EF queries on a ASP.NET core 3.1 project when deployed to a docker container with mcr.microsoft.com/dotnet/core/aspnet:3.1 as base. Upgrading to Microsoft.Data.SqlClient 1.1.1 appears to have resolved the issue.

Was this page helpful?
0 / 5 - 0 ratings