Sqlclient: Async reader may return wrong results on MacOS, Linux, WSL, Docker

Created on 21 Jul 2020 · 16Comments · Source: dotnet/SqlClient

Describe the bug

Sometimes the query may return wrong results under high load, on Non-Windows clients.

To reproduce

.NET Core
MSSQL instance
Client is not running on Windows (reproduced on MacOS or Linux)
Query uses async reader methods
High load that saturates the connection pool and causes timeouts.

Bug is seen more often when MultipleActiveResultSets=True (MARS) is in the connection string.

The issue is hard to reproduce and rare, therefore this project was created with the code that attempts to simulate the situation.

This program starts 2000 concurrent connections that runs select @Id as Id
statement. Each Id is different, and the query result should always return
the id that was queried, yet under certain circumstances that's not the case. It looks like
sometimes the reader returns the stale data from random previous connection that experienced the timeout.

There are more details in the linked project.

Network latency can be more easily simulated using this method here: https://github.com/trainline/SqlClient659

Expected behavior

Under no circumstances a select statement should return a different result.
___This can cause application crashes, personal data leaks, user purchasing products on behalf of other user, and security breaches.___

Further technical details

Microsoft.Data.SqlClient version: 4.8.1 or earlier
.NET target: netcoreapp2.0 - netcoreapp3.1
SQL Server version: SQL Server Web Edition 13.00.5426.0.v1, SQL Server Express Edition 14.00.3281.6.v1
Operating system: macOS 10.15.5, Docker container

Workaround

Do not use this SQL client on linux/docker.

Always check the id of the returned result and retry or crash if it is wrong.

To decrease the likelihood of this, do not use MARS feature, if you are using .NET Core with MSSQL database, ensure that the connection string does not contain MultipleActiveResultSets=True.

Bug! Managed SNI

Source

Nercury

❤4

Most helpful comment

Hello!

I've created a new repro putting together items of this discussion using docker-compose so it can be easily executed locally also simulating network failures.

https://github.com/trainline/SqlClient659

I can confirm the problem can be reproduced even without MARS, although it is less frequent.

| .NET version | SQL Server Version | MARS | Failures rate |
| ------------ | ------------------ | ---- | ------- |
| .NET Core 3.1 | 2017 Enterprise | Yes | High |
| .NET Core 3.1 | 2017 Enterprise | No | Low |
| .NET 5.0 | 2017 Enterprise | Yes | High |
| .NET 5.0 | 2017 Enterprise | No | Low |
| .NET Core 3.1 | 2019 Enterprise | Yes | High |
| .NET Core 3.1 | 2019 Enterprise | No | Low |
| .NET 5.0 | 2019 Enterprise | Yes | High |
| .NET 5.0 | 2019 Enterprise | No | Low |

@cheenamalhotra Any suggestion on how to try to pinpoint the origin of the problem? Is there a diagnostic source available in the SqlClient?

alefranz on 7 Dec 2020

❤1 👍1

All 16 comments

@Nercury thanks for reaching out. Issue #422 is similar to the problem you have and is under investigation.

JRahnama on 21 Jul 2020

Hi @Nercury,

Thank you for producing the sample app and describing the issue well.
I ran your sample app multiple times at a virtual machine running Ubuntu 18.0, .Net core 3.1, and SQL Server 19 with MARS enabled and connection pool equals 100. I didn't get any Wrong results as you expected.

Do you have any comments?

DavoudEshtehari on 18 Aug 2020

I didn't get any Wrong results as you expected.

Do you have any comments?

Yes, this matches my experience. As I mentioned, the issue seems to be related to timeouts, and they can not happen if the DB server and the code is on the same machine, and the query is doing basically nothing.

Try to introduce some latency between the db server and the application.

I reproduced it using a bit crude approach, by running DB server in cloud and the application on my laptop with a router in another room (so there is some considerable latency between the two). Maybe there are tools to do it a bit more cleanly.

In production environment the AWS starts throttling read operations when they go above allowed IOPS limit, that's how it originally happened when running on AWS.

Nercury on 18 Aug 2020

👍1

By replacing the data source by a slower connection over the network I am able to reproduce the wrong result.
I checked the code and added a local index inside the QueryAndVerify method to keep the index parameter value and used this local variable instead of that parameter inside the function body and this change solved the issue. Can I ask you to give it a try and let me know?

DavoudEshtehari on 4 Sep 2020

Thanks for checking it out. I am not sure I completely understand what you mean by "added a local index inside the QueryAndVerify method to keep the index parameter value and used this local variable instead of that parameter inside the function body", because the int index is already local to the QueryAndVerify function. Can you provide a changeset (in the form of PR) so that I am sure I am doing the same thing?

Nercury on 7 Sep 2020

While I was preparing the requested changes I found out it is happening again. Maybe the network conditions simultaneously had affected my test by the last change.
The issue intermittently happens either with System.Data.Sqlclient & Microsoft.Data.SqlClient when ExecuteReaderAsync blends with exceptions.

DavoudEshtehari on 8 Sep 2020

👍1

In between investigation, I found these mitigations besides a workaround that has mentioned as disabling the MARS:

Set Pooling = false
Removing a connection that encountered an exception from the connection pool: SqlConnection.ClearPool(cn)
Attempting to execute a SQL command without attaching a transaction to the command (remove the BeginTransaction & CommitAsync) :

var command = new SqlCommand(sql, cn);
command.Transaction = null;
Using ExecuteScalarAsync instead of ExecuteReaderAsync

@Nercury Can you give them a try one by one?

DavoudEshtehari on 16 Sep 2020

Set Pooling = false

MultipleActiveResultSets=True;Pooling=false; painfully slow and many timeouts, but no wrong results, which is good

Removing the connection that encountered an exception

            SqlConnection cnp = null;
            try
            {
                using (var cn = new SqlConnection(connectionString))
                {
                    cnp = cn;

                    // ...
                }
            }
            catch (Exception e)
            {
                if (cnp != null)
                {
                    SqlConnection.ClearPool(cnp);
                }

                // ...
            }

Got wrong results on a third attempt.

Attempting to execute a SQL command without attaching a transaction to the command

This was attempted in production when it was still unclear if this is a MARS issue. It makes wrong results more rare, but they still happen. I was not able to reproduce it in this test project with a simplistic select statement.

Using ExecuteScalarAsync instead of ExecuteReaderAsync

Tried it 5 times and wrong results did not happen.

@DavoudEshtehari I hope this helps.

Nercury on 17 Sep 2020

👍2

Same bug experienced when MARS is disabled:

We are experiencing a similar problem in our code base, although we have MARS disabled explicitly in the connection string and still see the bug.

We run this method across multiple incoming requests from an API Controller:

public async Task<T> GetAsync(string id)
        {
            var sql = $"SELECT Data FROM { _configuration.TableName } {(_configuration.InMemory ? "WITH (SNAPSHOT)" : "")} WHERE Id=@id";

            using (var connection = _connectionFactory.Create())
            {
                var record = await connection.QueryAsync<string>(sql, new { id }).ConfigureAwait(false);

                var data = record.FirstOrDefault();

                return data == null ? default(T) : _serializer.Deserialize<T>(data);
            }
        }

The connectionFactory simply returns a new SqlConnection:
return new SqlConnection(_connectionString);

id is a GUID
data will be an encrypted blob of data that decrypts to a JSON payload.
_serializer handles the decryption (if needed) then serialisation.

Things will run normally, then a single instance will experience a SQL Exception - previously there was an unexpected (and denied) request to perform a SHOWPLAN, sometimes it's a SQL Timeout exception. Following that exception, serialisation bugs will continue to appear on that instance until it is taken out of service.

The source of the serialisation failure is that the value stored in data is an id from an earlier request, rather than the expected value stored on the SQL DB. We've had this happen both on the in-memory DB and without it.

Once the instance has been out of service for a while, subsequent attempts to access the DB via this method complete successfully, so it seems to have the ability to resolve itself. Even when the instance is misbehaving, it appears that not all requests get jumbled up, as some manage to succeed.

Previously this service was hosted on Windows without any issues. Since the port to Linux this problem crops up every 7-30 days or so. We've also seen it on more than one service (the above code is in a shared library). This bug seems to be the closest thing to what we are experiencing, except that we have explicitly disabled MARS in the connection string and are still seeing it from time to time.

Libraries/Infrastructure involved:
NetCore 3.1
Dapper 2.0.35
Microsoft.Data.SqlClient 2.0.1

This is being run on an AWS T3 instance hosting Ubuntu 18.04

We are going to attempt changing the call from QueryAsync to ExecuteScalarAsync and are attempting to capture some more information to help us debug the issue. If the change to ExecuteScalarAsync resolves the problem for at least a month or so then I'll add an extra comment here indicating that it's a possible fix.

gyphysics on 30 Oct 2020

Hi @Nercury @gyphysics

Could you please try your use-cases with PR #796 as I tried to run repro provided and am able to see issue fixed.
For me 10k iterations also complete successfully now while the same failed with current driver.

Kindly confirm.

Edit: This issue remained unfixed for now, due to colliding changes in different async paths.

cheenamalhotra on 13 Nov 2020

This issue remained unfixed for now, due to colliding changes in different async paths.

@cheenamalhotra All right, it makes sense to wait until all fixes are in.

Nercury on 24 Nov 2020

Following on from @gyphysics (some company, different team):
Upgraded to 2.1.0 and still seeing this issue in multiple services, all on Linux.
Definitely no MARS in connection string (though we haven't actively disabled it).
We are using Dapper and will try to remove it from one project to prove it's not related.
In the meantime we're using guard clauses to prevent wrong data being used

if (!thingId.Equals(thing.Id))
{
    throw new ThingConflictException($"Asked for thing [{thingId}], got thing [{thing.Id}]");
}

We have also seen some problems on INSERT, where no exception was raised, but checking later the data was not written to the table. There were gaps in the IDENTITY column, suggesting a possible silent transaction rollback (exception didn't come back to originating async context?). Possibly unrelated, but seems close enough to link not here.

simonjduff on 2 Dec 2020

Hello!

I've created a new repro putting together items of this discussion using docker-compose so it can be easily executed locally also simulating network failures.

https://github.com/trainline/SqlClient659

I can confirm the problem can be reproduced even without MARS, although it is less frequent.

@cheenamalhotra Any suggestion on how to try to pinpoint the origin of the problem? Is there a diagnostic source available in the SqlClient?

alefranz on 7 Dec 2020

❤1 👍1

Hi; I see a mention of Dapper here. It is entirely possible that this is a Dapper issue, using an unintended and unsupported feature that only kinda works by accident (and only when MARS is enabled). Ultimately, "MARS enabled" is not the same as "performing evilness via overlapped unawaited operations", and I will be the first to say: any time you don't immediately await something, you're effectively doing the same thing as multi-threading, and race conditions become obvious.

So; if there's a problem, and this API doesn't work (or even claim to work) the way in which Dapper has used (abused) it: maybe we should just deprecate that feature from Dapper - or at least see if we can re-work it to be less evil.

(I'm one of the Dapper maintainers)

mgravell on 7 Dec 2020

The system where the bug was discovered was using Dapper too. But the bug remained after the code was rewritten without Dapper. We will know for sure once the underlying issue is fixed.

Nercury on 7 Dec 2020

@Nercury is it worth updating the issue title as it is not only happening with MARS enabled?

alefranz on 8 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings