Sqlclient: .NET 5.0 | Implement SqlException.IsTransient

Created on 12 Jul 2020 · 14Comments · Source: dotnet/SqlClient

https://github.com/dotnet/runtime/issues/34817 introduces DbException.IsTransient for .NET 5.0, database-agnostic transient error detection. Note that this is orthogonal to SqlClient's own resilient connection mechanism - IsTransient is only about surfacing error transience to external retry strategies/policies such as Polly.

Note that EF Core's SQL Server provider has a transient exception detector, this could be a starting point for implementing this within SqlClient. The goal would eventually be for EF Core to rely on the new IsTransient property rather than contain specific knowledge on SqlClient error codes.

Enhancement

Source

roji

👍4

Most helpful comment

I want to sidestep the question of what should be considered a transient error for now to focus on the biggest advantage of this API. In the past SQL Azure has changed the list of transient errors requiring changes to the application, EF Core or SqlClient to change the retry policy accordingly. If SqlClient implements SqlException.IsTransient by delegating to the server (when that's where the error originated) then such behavioral changes won't be breaking anymore.

A related useful API would be a TimeSpan RetryAfter property that provides a baseline for the recommended delay before retrying depending on the error, but this would need to be provided by the server to give the most benefit.

AndriySvyryd on 13 Jul 2020

👍3

All 14 comments

Please consider adding some of these that EF Core's SqlServerTransientExceptionDetector does not recognize:

| Number | Severity | Message | Reasoning |
|---|---|---|---|
|601|12|"Could not continue scan with NOLOCK due to data movement."|Advice in Hints (Transact-SQL) - Table
|617|20|"Descriptor for object ID %ld in database ID %d not found in the hash table during attempt to unhash it. A work table is missing an entry. Rerun the query. If a cursor is involved, close and reopen the cursor."|"Rerun"|
|669|22|"The row object is inconsistent. Please rerun the query."|"Rerun"|
|921|14|"Database '%.ls' has not been recovered yet. Wait and try again."|"Try again"|
|1203|20|"Process ID %d attempted to unlock a resource it does not own: %.ls. Retry the transaction, because this error may be caused by a timing condition. If the problem persists, contact the database administrator."|"Retry"|
|1204|19|"The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions."|"Rerun"|
|1221|20|"The Database Engine is attempting to release a group of locks that are not currently held by the transaction. Retry the transaction. If the problem persists, contact your support provider."|"Retry"|
|1222|16|"Lock request time out period exceeded."|"Time out"|
|3935|16|"A FILESTREAM transaction context could not be initialized. This might be caused by a resource shortage. Retry the operation. Error code: 0x%x."|"Retry"|
|3960|16|"Snapshot isolation transaction aborted due to update conflict. You cannot use snapshot isolation to access table '%.ls' directly or indirectly in database '%.ls' to update, delete, or insert the row that has been modified or deleted by another transaction. Retry the transaction or change the isolation level for the update/delete statement."|"Retry"|
|3966|17|"Transaction is rolled back when accessing version store. It was earlier marked as victim when the version store was shrunk due to insufficient space in tempdb. This transaction was marked as a victim earlier because it may need the row version(s) that have already been removed to make space in tempdb. Retry the transaction"|"Retry"|
|8628|17|"A time out occurred while waiting to optimize the query. Rerun the query."|"Rerun"|
|8645|17|"A timeout occurred while waiting for memory resources to execute the query in resource pool '%ls' (%ld). Rerun the query."|"Rerun"|
|8651|17|"Could not perform the operation because the requested memory grant was not available in resource pool '%ls' (%ld). Rerun the query, reduce the query load, or check resource governor configuration setting."|"Rerun"|
|9515|16|"An XML schema has been altered or dropped, and the query plan is no longer valid. Please rerun the query batch."|"Rerun"|
|10922|16|"%ls failed. Rerun the statement."|"Rerun"|
|14355|16|"The MSSQLServerADHelper service is busy. Retry this operation later."|"Retry"|
|17197|16|"Login failed due to timeout; the connection has been closed. This error may indicate heavy server load. Reduce the load on the server and retry login.%.*ls"|"Retry"|
|20041|16|"Transaction rolled back. Could not execute trigger. Retry your transaction."|"Retry"|

Also, if SqlClient does not recognize the specific error code, perhaps it can still treat some of the Database Engine Error Severities as transient; like 13 (deadlock) or 17 (out of resources).

KalleOlaviNiemitalo on 13 Jul 2020

👍2

EF Core's SqlServerTransientExceptionDetector returns true if it finds at least one transient error in the SqlException.Errors collection. Should SqlException.IsTransient instead treat some errors as always permanent, so that it returns false if the collection contains at least one always-permanent error, regardless of whether the collection also contains transient errors? Severity 15 (T-SQL syntax error) could be a candidate for that.

KalleOlaviNiemitalo on 13 Jul 2020

👍1

@KalleOlaviNiemitalo that's great, thanks for this info! Your suggestion below also makes sense to me - if there's any non-transient error in the SqlException, that means there's no use to retry since that error will happen again, so we may as well return false from IsTransient.

/cc @ajcvickers for the error codes missing in the EF Core SQL Server provider.

roji on 13 Jul 2020

To be clear, I do not suggest returning false from SqlException.IsTransient if SqlException.Errors contains two errors, of which one is recognized as transient and the other is not recognized at all. The suggestion was for errors that SqlClient specifically recognizes as always-permanent.

For example, if an SqlException contains the following two errors, then SqlException.IsTransient would be true, because there is at least one transient error and no always-permanent errors:

921 recognized as transient "Database '%.*ls' has not been recovered yet. Wait and try again."
4060 not recognized "Cannot open database "%.*ls" requested by the login. The login failed."

KalleOlaviNiemitalo on 13 Jul 2020

The suggestion was for errors that SqlClient specifically recognizes as always-permanent.

I tend to see it the other way - if there's a chance that an error could be indicating a transient problem, then that's a good reason for surfacing it as transient. From this discussion on MySQL:

Which specific errors are classified as transient is of course a provider decision. However, the way I think about it is to err on the side of false positives, rather than conservatively erring on the side of false negatives. A false positive would typically mean a needless retry (which doesn't seem critical), whereas a false negative would mean throwing where a retry may have result in success.

See also the note in the IsTransient proposal about the property being "optimistic".

So I'd say, if there's a single error in the exception which is known to not be transient by nature, the entire exception should report IsTransient=false; otherwise, it should return true. A pragmatic approach may make sense here - we should keep in mind that the use of this is to trigger retrying by an upper layer.

/cc @bgrainger @fransbouma @ajcvickers

roji on 13 Jul 2020

So I'd say, if there's a single error in the exception which is known to not be transient by nature, the entire exception should report IsTransient=false; otherwise, it should return true.

In that case, SqlException won't need a list of transient error numbers at all, as it will only recognize errors that are "known to not be transient by nature".

KalleOlaviNiemitalo on 13 Jul 2020

At least for PostgreSQL (which I maintain), the (vast) majority of errors are still non-transient, so it seems to make sense to have a list for errors that are known to be (possibly) transient. That may not be the case for SQL Server, though I'd be surprised.

roji on 13 Jul 2020

@AndriySvyryd has traditionally been the steward of which codes we treat as transient.

ajcvickers on 13 Jul 2020

Hi @KalleOlaviNiemitalo @roji

So I'd say, if there's a single error in the exception which is known to not be transient by nature, the entire exception should report IsTransient=false; otherwise, it should return true. A pragmatic approach may make sense here - we should keep in mind that the use of this is to trigger retrying by an upper layer.

I agree with this too!

In the initial phase, the list of transient errors that the driver would recognize as "transient" is driven by Azure's documented error numbers: Transient Fault Error Codes as currently SqlClient retries on.

The error messages documented above are good recommendations, but as client drivers work close to SQL Server/Azure Standards, only errors listed "transient" by server are handled implicitly in SqlClient. ORM frameworks like EF Core posses the flexibility to implement their own retry policy based on experiences and application errors, as it's an external extender.

We will take them into discussion when we work on this activity internally in order to provide best experience to customers, but we also would need to abide by driver policies to reflect server-side transient policies.

cheenamalhotra on 13 Jul 2020

It doesn't sound like there is consensus on what a transient exception is and if that's the case adding the feature without a way to configure it sounds like setting ourselves up for years of issues of people saying one thing or another should be transient. To avoid that could we add something at a process level (so at provider level in the api) which lets users add whatever we end up needing to determine if an error is transient? If other providers end up needing the same functionality the implementations could be promoted to a common api in .net 6.

Wraith2 on 13 Jul 2020

AndriySvyryd on 13 Jul 2020

👍3

But if we delegate to the server to decide if something is transient how do we know if inability to connect to the server is transient... 😀
More seriously is there something in sql that we can query to get a list of transient exceptions?

Wraith2 on 13 Jul 2020

@Wraith2

It doesn't sound like there is consensus on what a transient exception is and if that's the case adding the feature without a way to configure it sounds like setting ourselves up for years of issues of people saying one thing or another should be transient.

I think that's giving up a bit quickly here; this is definitely something we need to discuss thoroughly in order to arrive at a good definition, but I think that's certainly possible. I don't believe transience is a concept that really varies that much, and pragmatically speaking I think most applications are going to have the same needs around this.

EF Core's transient error detector could be considered a pretty good proof of this; the built-in retrying strategy identifies the codes it identifies, and we've rarely had users arguing for a different concept of transience. What we have seen, as @AndriySvyryd wrote, is changes in actual error codes. Note also that the EF Core detector has been copied and pasted for use outside of EF Core (e.g. https://gist.github.com/hyrmn/ce124e9b1f50dbf9d241390ebc8f6df3#file-sqlservertransientexceptiondetector-cs), which is exactly why I think this kind of knowledge should go into SqlClient.

To avoid that could we add something at a process level (so at provider level in the api) which lets users add whatever we end up needing to determine if an error is transient?

The whole point of this API is for the DB provider to report to an upper layer (Polly, EF Core, LLBLGen Pro) what it considers to be transient for the general, common case. If an application wants to customize which errors are considered transient, it can simply do that at that upper layer, i.e. defining a Polly/EF Core retrying strategy that does whatever the user wants. In other words, I don't think it makes any sense for users to call a SqlClient API which changes what SqlClient reports in IsTransient, when that IsTransient API is itself called by the user (via Polly/EF Core). If you want to control what's considered transient and what isn't, that's totally fine - you can just customize your Polly/EF Core policy and leave SqlClient out of it.

I also very much agree with @AndriySvyryd on the problem of changing error codes in SQL Azure.

roji on 14 Jul 2020

Ok,

Wraith2 on 14 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Microsoft.Data.SqlClient.SNI.pdb not found with V2.0.0

genifycom · 8Comments

Microsoft.Data.SqlClient.SNINativeMethodWrapper.UnmanagedIsTokenRestricted can't read sni.dll on IIS with custom apppool Identity

chromigo · 8Comments

Entity Framework Core 3.1.0 System.TypeInitializationException: The type initializer for 'Microsoft.Data.SqlClient.SNINativeMethodWrapper' threw an exception. ---> System.ComponentModel.Win32Exception

dmuruges · 4Comments

A network-related or instance-specific error occurred while establishing a connection to SQL Server. On AWS Lambda

pnquest · 5Comments

SqlClient's transient error handling strategy going forward?

Plasma · 6Comments