SqlClient. Inappropriate behavior after a commit timeout

Created on 28 Jun 2019  路  18Comments  路  Source: dotnet/SqlClient

I've initially created this bug in the CoreFX repository: https://github.com/dotnet/corefx/issues/38900, but it does not look like anyone is looking there. So I'm repeating it here. A small addition: I've now tested this using both System.Data.SqlClient (4.6.1) and Microsoft.Data.SqlClient (1.0.19128.1-Preview), compiled for both .NET Core 2.2 & .NET Framework 4.8. In each of these four cases I was able to reproduce this bug.

=============================================================

I have a SQL Azure database and I use System.Data.SqlClient (4.6.1) to talk to it. From time to time (sometimes several times a day, sometimes no incidents for weeks), it would produce a SqlException with Number = -2 (Timeout) during a SqlTransaction.Commit call. I assume that's happening because SQL Azure databases have mirroring set up and there is some transient network condition happening around Commit() which prevents that from finishing. That's fine with me - i know those are transient exceptions and would retry them after some interval. The problem is that the connection which caused that exception is returned to the pool. The next time it's got picked from the pool, it seems to be in some invalid state. Then, when i do "SELECT * FROM MyTable WHERE Id = @Id" on that connection, it would return an empty set, even though that record is 100% there in the table. That is one really disturbing bug.

I've tried to build a small reproduction for this issue. The hardest thing was to have SqlTransation.Commit() to throw a timeout exception. I invented a trick to do that (see below, ForceCommitTimeout() function). It's a bit clumsy one and does not always reproduce. If it does not repoduce in your environment, try increasing RowNumber and running the code several times. From time to time it would produce an exception. Note, that in the connection string I've set "Connection Timeout=1". That timeout value is also used as commit timeout by SqlTransaction. In any case, if someone knows a better way to force a timeout from SqlTransation.Commit() - please share it with me.

The second important point to build a stable reproduction was to set "Max Pool Size=1". This is to make sure that the same connection is returned from the pool when you open a new connection after a commit timeout has happened.

Before you run the code, you need to create a database named "TestBed" on your DB server and then create a test table in it:

CREATE TABLE TestTable
(
    Id INTEGER NOT NULL IDENTITY(1, 1) PRIMARY KEY,
    Name NVARCHAR(MAX)
)

Now, when you are all set up, here's the code to reproduce the problem:

```c#
using System;
using System.Data.SqlClient;
using System.Diagnostics;

namespace SqlClientTimeoutIssue
{
internal class Program
{
private const string ConnectionString = "Server=(local); Initial Catalog=TestBed; Integrated Security=True; Connection Timeout=1; Max Pool Size=1";
private const int RowNumber = 10000000;

    private static void Main()
    {
        try
        {
            SelectOne();
            ForceCommitTimeout();
            SelectOne();
        }
        catch (Exception ex)
        {
            Console.WriteLine("Unexpected exception: {0}", ex);
        }
    }

    private static void SelectOne()
    {
        using (var conn = new SqlConnection(ConnectionString))
        {
            conn.Open();
            Console.WriteLine("SelectOne. ConnectionId={0}", conn.ClientConnectionId);

            var result = 0;

            using (var cmd = conn.CreateCommand())
            {
                cmd.CommandText = "SELECT 1";
                using (var reader = cmd.ExecuteReader())
                {
                    while (reader.Read())
                        result = reader.GetInt32(0);
                }
            }

            Console.WriteLine("SelectOne. Result: {0}", result);
        }
    }

    private static void ForceCommitTimeout()
    {
        using (var conn = new SqlConnection(ConnectionString))
        {
            conn.Open();
            Console.WriteLine("ForceCommitTimeout. ConnectionId={0}", conn.ClientConnectionId);

            using (var tran = conn.BeginTransaction())
            {
                var timer = Stopwatch.StartNew();

                for (var i = 0; i < RowNumber; ++i)
                {
                    using (var cmd = conn.CreateCommand())
                    {
                        cmd.CommandText = "INSERT TestTable (Name) VALUES(N'Some value')";
                        cmd.Transaction = tran;
                        cmd.ExecuteNonQuery();
                    }
                }

                Console.WriteLine("ForceCommitTimeout. Insert duration: {0} sec", timer.Elapsed.TotalSeconds);

                try
                {
                    timer.Restart();
                    tran.Commit();
                    Console.WriteLine("ForceCommitTimeout. Commit duration: {0} sec", timer.Elapsed.TotalSeconds);
                }
                catch (SqlException ex) when (ex.Number == -2)
                {
                    Console.WriteLine("ForceCommitTimeout. Commit timeout has occurred");
                }
            }
        }
    }
}

}


When i run it and i'm lucky to hit a timeout, i get output like this:

SelectOne. ConnectionId=21106a9d-08db-43b5-b2e5-d6f9b751bebb
SelectOne. Result: 1
ForceCommitTimeout. ConnectionId=21106a9d-08db-43b5-b2e5-d6f9b751bebb
ForceCommitTimeout. Insert duration: 1799.231 sec
ForceCommitTimeout. Commit timeout has occurred
SelectOne. ConnectionId=21106a9d-08db-43b5-b2e5-d6f9b751bebb
SelectOne. Result: 0
```

Notice, that the second call to SelectOne() outputs "0". Even though it does "SELECT 1"! And that is the essence of this (really weird) bug. I've been able to reproduce it on both SQL Azure and my local SQL Server 2017 Dev Edition.

Bug!

Most helpful comment

You might be able to trigger a commit timeout by suspending the sqlservr.exe process using Process Explorer. Open 100 transactions concurrently all under write load. Then pause the process. Unpause. You should get lucky on a few of those 100 transactions.

You also can set up mirroring or AG and pause the network.

Or, put the transaction log on storage that you can pause. You can pause an SMB file share by interrupting the network.

All 18 comments

I responded in corefx and with initial triage thoughts, copied here for continuity.

I think this may be the same issue as dotnet/SqlClient#84 whereby the connection is in a transaction when it is returned to the pool and the state isn't detected and resolved. There isn't a clear resolution or way to doom that specific connection so it isn't restored to the pool that i'm aware of currently but it might be worth investigating.

Can you close the corefx issue if you want to track this here so that the other issues doesn't get moved and become a duplicate please.

@Wraith2 Yeah, you responded. But:

1) how do i know if you are an MS employee or just a stranger? It's not clear from your github profile at all. Yes, you have some PRs accepted in MS repositories, but that does not mean you are an employee and that you answer is "official".

2) The bug is super crazy. SqlClient lies to me about the data in my database. And that is the official reaction from MS i'm getting? Really? How am I supposed to trust my data to this library in the future?

Can you close the corefx issue if you want to track this here so that the other issues doesn't get moved and become a duplicate please.

That is not a problem. I can close that issue or I close this one. However you say. What is important is that this bug is fixed. At least recognized and a real workaround is provided until the fix is ready.

I'm a contributor and I focus on SqlClient quite a lot but I don't work for Microsoft. Nothing I say is in any sense official. I looked at the details you provided and based on my experience I think it's the same or a similar issue to the one linked so I provided that information for others. If I get chance I'll investigate.

The request to have one issue is just so we don't end up with fragmented conversations.

Okay, this means that there is no official reaction. And that's what bothers me here. I have presented a critical bug along with reproduction and i'm getting zero official reaction. This is not how i expect this to happen. And this is precisely why I opened this issue again here. To get some attention from the appropriate team who can schedule and resolve this.

If you can investigate - this is fine. I even can do it myself, i'm pretty sure the fix is not that complex. The question is if this fix will be accepted or not. For this to be accepted, it should be first recognized as a bug. And that's not what happening here.

For the workaround, I've already found one. It is to catch the timeout exception from Commit and call SqlConnection.ClearPool(https://docs.microsoft.com/en-us/dotnet/api/system.data.sqlclient.sqlconnection.clearpool?view=netframework-4.8). However, this method "Empties the connection pool associated with the specified connection". This may have undesirable performance effects in a highly parallel server app with hundreds of connections.

So, for now i've applied this workaround in our apps, but by no means it is a permanent solution.

@David-Engel @divega

Thanks for looking into it @Wraith2. I agree it looks similar to the other issue.
@dmytro-gokun Thanks for the submission. It looks like a bug. If you have time to debug System.Data.SqlClient from corefx to pinpoint the issue for us, great. Otherwise we'll look into it as soon as we can.

CC: @cheenamalhotra

@David-Engel Thanks for prioritizing it. If I wanted to look, what would be the right place to do that and submit a PR? This repository does not have any code as far as i can see.

The code in the process of being forked. For the moment the version in corefx can be used. This is likely to be a very complex change and not a quick fix.

okay, i will leave it to someone who knows what they are doing then

@dmytro-gokun I'd like to clarify that GitHub is not replacement for support.
While we obviously pay attention to highly impactful bugs, we do not scale to treat every single issue as high pri or critical. Even if the filer says so. We do not have any SLAs on responses (sometimes it can take a looong time for us to respond).
Importance of bug typically changes if either someone clearly recognizes it as a critical problem with serious impact on important scenarios, or if there are multiple customers hitting the problem without a workaround.
Having a workaround is considered as mitigation even if it is not pretty. If the workaround is not acceptable, it has to be clearly described why (not just mentioned that it may have perf impact).

If you have immediate problem and you need immediate support, we recommend to contact official Microsoft support - also linked here - product support.

@karelz Thanks for the clarification. I have workaround for now. So, for the company i work at, everything is under control (hopefuly).

But think about this. These days, there are thousands of companies running their stuff on SQL Azure. This means, they hit these commit timeouts from time to time (I described in my start post why). Now, even if they implemented proper transient exception handling, they will hit this super ugly thing which causes SqlClient to return empty sets when there is some data in the DB. That is super-crazy and is super hard to find the reason of if you face it. No respectful DB provider has to behave this way. I caught it almost by accident. This is why my perception was that the bug is critical and has to be fixed as soon as possible. People may run into it all the time, they just do not realize that. But they will face consequences later - in form or corrupt data, wrong decision by the code and so on. That is scary as hell.

I hope this explains my reaction and my surprise for the lack of attention. In any case, I might be over-dramatizing here. The ball is on you side of the field and you decide what to do with it.

Hi @dmytro-gokun

Triggering a Commit Timeout with errorcode -2 isn't hard as one can call WAITFOR DELAY '00:05:00' and run this SqlCommand with SqlCommand.CommandTimeout set to let's say 1.

The problem is I don't see the issue happening with this scenario so it might not be related to Commit Timeout exception with errorcode -2. To me it looks like more related to the time it took to throw exception [In your case 1799.231 sec], and the server unable to clear incomplete transaction on the connection.

I would request you to try setting SET XACT_ABORT to ON on this connection after opening which will force termination and rollback of the entire transaction when a run-time error such as commit timeout occurs. Let me know how it goes!

Hi @cheenamalhotra

Triggering a Commit Timeout with errorcode -2 isn't hard as one can call WAITFOR DELAY '00:05:00' and run this SqlCommand with SqlCommand.CommandTimeout set to let's say 1.

This will cause a command timeout and not a commit timeout. Those are different things and are handled differently by SqlClient.

Have you run my code? Were you able to reproduce the error?
In any case, plz read the code and my comments. It does not look like you fully understand the problem here. I do not think that you should start doing any fixes/giving advises before you understand the problem/are able to reproduce it.

If you are not able to reproduce it and my description is not clear enough - please let me know and I will try to explain it better.

You might be able to trigger a commit timeout by suspending the sqlservr.exe process using Process Explorer. Open 100 transactions concurrently all under write load. Then pause the process. Unpause. You should get lucky on a few of those 100 transactions.

You also can set up mirroring or AG and pause the network.

Or, put the transaction log on storage that you can pause. You can pause an SMB file share by interrupting the network.

@GSPP Thanks. These are very clever ideas (Y).

Hi @dmytro-gokun

We added a change to mark active connection as not poolable and doomed when an exception is captured during transaction commit, ensuring corrupted connection is dropped on Dispose.

The fix will be available as part of GA release of 1.0.
Keeping the issue open for you to test and confirm in your scenarios.

Hi @dmytro-gokun

Stable version 1.0.19239.1 for Microsoft.Data.SqlClient 1.0 is now available to download that contains the fix. Let us know if you need any more support.

@cheenamalhotra Thanks a lot! I've started using this release in our production environment. If I see the problem happening again, i will open a new issue.

Was this page helpful?
0 / 5 - 0 ratings