Runtime: Database-agnostic way to detect transient database errors

Created on 10 Apr 2020  路  50Comments  路  Source: dotnet/runtime

Provider tracking issues

This new API has been merged, following are tracking issues for implementation in the different providers:

  • SqlClient: https://github.com/dotnet/SqlClient/issues/649
  • Microsoft.Data.Sqlite: https://github.com/dotnet/efcore/issues/21598
  • Npgsql: https://github.com/npgsql/npgsql/issues/3058
  • MySqlConnector: https://github.com/mysql-net/MySqlConnector/issues/849

Summary

A recurring theme for database programming has been building automated retry policies, but System.Data doesn't currently allow database drivers to expose whether a database error can be expected to be transient or not. This leads to database-specific logic for detecting transient exceptions being built again and again outside of the driver.

API Proposal

c# class DbException { public virtual bool IsTransient => false; }

Notes

  • IsTransient should be optimistic, i.e. return true in all cases where a simple retry of the operation - without any other change - may be successful. We prefer to err on the side of transience, since the downside is only a few extra retries.
  • In at least some cases, if an operation transiently fails within a transaction, the entire transaction must be rolled back and retried (including all previous operations). This would be out of scope of this proposal: DbException would only inform the consumers that the exception is transient, but the actual handling (rolling back, reissuing previous commands) would be the consumer's responsibility.

Resources and examples

Old notes on SupportsTransienceDetection (removed from proposal)

  • SupportsTransienceDetection can be used to signal that IsTransient has been implemented by a driver, and can be expected to be true where relevant. Note that in general it should be fine to just check IsTransient directly, as it would default to false in any case.
  • It's a bit problematic for to be on DbException:

    • This makes it hard to do a simple check on startup (one needs an actual DbException instance thrown from the driver).

    • This feature flag would ideally be in a general metadata/feature discovery facility (see #28763), but we don't have that at the moment.

    • We could put this flag on DbProviderFactory, but that would turn it into a general place for feature flags as opposed to a factory (current flags such as CanCreateDataAdapter are about factory methods on the type, so much more appropriate).

    • As the problem isn't very acute, having this property on DbException is probably OK.

/cc @David-Engel @cheenamalhotra @bgrainger @bricelam @ajcvickers @stephentoub @terrajobst @mgravell @FransBouma

Edit history

| Date | Modification |
| ----- | ------------ |
| 2020-04-10 | API proposal
| 2020-04-18 | Removed SupportsTransienceDetection

api-approved area-System.Data

Most helpful comment

While working on SqlClient, I realized that the classification of Transient errors is a combination of recommendations offered by the database and the nature of the customer workload and the customer's own understanding of their infrastructure running the applications and databases.

A driver can classify an error as transient, if it is recommended. And such recommendations are usually made for a very small subset of errors. At least this is what my observation is, while interacting with SqlClient customers.

For many other errors, it is the application that decides whether it should be transient or not.

Consider SqlConnection.Open() in SqlClient, it retries on a bunch of errors while connecting to Azure SQL DB. But those errors are returned by the server. However there are many cases in which customers want to retry on networking specific errors, which may be caused by broken connections or caused by failure in TCP paths.
Some customers want to have retries on such errors and some customers want to detect these ASAP and have no retries.

Similar arguments go for Command Execution where the customers want to retry on some errors and fail fast on some errors. E.g. Deadlock is a problem that some customer applications want to retry on, because in their workload it is transient and for some customer applications, deadlocks need investigation.

Basically the known transient errors in SqlClient are much fewer and this proposal makes more sense with a mechanism via which the customers can tell us if they want a particular error to be treated as transient and retried on, vs errors that we fail fast with.

Otherwise the client's may not be authoritatively able to classify errors as transient and in case of client considering errors as transient may be undesirable.

One of the asks from a few customers of SqlClient is to be able to specify what errors the applications want to consider transient and have the client retry internally on those errors.

I wonder if the above situations apply for other drivers, they probably might.

An API ecosystem where customers have the ability to specify what errors should be treated transient would be lot more powerful.

All 50 comments

Is SupportsTransienceDetection really necessary? All I'm interested in as a consumer of the exception is whether it's a transient error. Now I have a list of transient error codes per database in my transient error recovery strategy classes and that's not very flexible (if error codes are added I have to maintain this list plus I have to look up these codes for all databases supported). So I just want to know 'is this exception a transient error?' If so, use the strategy, if not, throw the exception further up like a normal exception.

If IsTransient is false, whether that's because it's not a transient error or because the driver/ado.net provider doesn't support this, the net end result is the same: the recovery strategy is bypassed and the exception is rethrown as a normal exception.

Perhaps I'm overlooking a use case tho :)

You're right that it's not terribly useful (i.e. "Note that in general it should be fine to just check IsTransient directly, as it would default to false in any case.").

The main case I had in mind was to be able to surface that information to a user, or possibly to fail at startup if transience isn't supported by a provider. I'm not sure how relevant that really is, if people think it's useless we can also drop it.

Ah now I understand it. :) Not sure how that information can be used in a practical way however: a recovery strategy of some sort has to be implemented regardless: a re-entrant aware system that can deal with retrying the commands in a given form, and if that strategy is built around 'IsTransient' it will simply bubble up all exceptions (as they're never 'transient') which is effectively the same as having code checking SupportsTransienceDetection and switching based on that between no recovery strategy and a recovery strategy. Except perhaps a tiny performance gain in the 'no recovery strategy' scenario perhaps.

[...] and switching based on that between no recovery strategy and a recovery strategy. Except perhaps a tiny performance gain in the 'no recovery strategy' scenario perhaps.

Yeah, I wouldn't really expect anyone to switch on SupportsTransienceDetectionfor recovery/no-recovery (and I definitely don't think there's any real perf to be gained). But in theory someone could try to do something at a higher level if they know there's no transience detection. I admit I don't really have a compelling use-case - it's more a general idea of allowing consumers to know whether it's been implemented or not, for any reason they might have.

I'm definitely open to removing this - let's see what other people think.

While working on SqlClient, I realized that the classification of Transient errors is a combination of recommendations offered by the database and the nature of the customer workload and the customer's own understanding of their infrastructure running the applications and databases.

A driver can classify an error as transient, if it is recommended. And such recommendations are usually made for a very small subset of errors. At least this is what my observation is, while interacting with SqlClient customers.

For many other errors, it is the application that decides whether it should be transient or not.

Consider SqlConnection.Open() in SqlClient, it retries on a bunch of errors while connecting to Azure SQL DB. But those errors are returned by the server. However there are many cases in which customers want to retry on networking specific errors, which may be caused by broken connections or caused by failure in TCP paths.
Some customers want to have retries on such errors and some customers want to detect these ASAP and have no retries.

Similar arguments go for Command Execution where the customers want to retry on some errors and fail fast on some errors. E.g. Deadlock is a problem that some customer applications want to retry on, because in their workload it is transient and for some customer applications, deadlocks need investigation.

Basically the known transient errors in SqlClient are much fewer and this proposal makes more sense with a mechanism via which the customers can tell us if they want a particular error to be treated as transient and retried on, vs errors that we fail fast with.

Otherwise the client's may not be authoritatively able to classify errors as transient and in case of client considering errors as transient may be undesirable.

One of the asks from a few customers of SqlClient is to be able to specify what errors the applications want to consider transient and have the client retry internally on those errors.

I wonder if the above situations apply for other drivers, they probably might.

An API ecosystem where customers have the ability to specify what errors should be treated transient would be lot more powerful.

Thanks for these thoughts @saurabh500.

One of the asks from a few customers of SqlClient is to be able to specify what errors the applications want to consider transient and have the client retry internally on those errors.

I think there may be a bit of confusion here between the transient retrying implemented inside of SqlClient, and external retrying implemented outside of any specific driver, and which would leverage the proposed IsTransient property. Anything happening inside SqlClient is specific to that driver, and although you could add an API to make its retry mechanism more flexible, I don't think that would be applicable for a database-agnostic approach. Npgsql (as well as most other drivers AFAIK) doesn't consider retrying to be within its scope of responsibility and doesn't contain any logic internally for doing so - NpgsqlConnection.Open (almost) always immediately throws if any I/O error occurs. The user is expected to handle retrying according to their needs (typically using something like Polly), which obviates any need for an API to tell the driver what to retry and what not to retry. The proposed IsTransient property is meant to help with writing such an external retrying strategy.

It definitely makes sense that different customers may have different needs around retrying, and around which errors should be classified as transient. From a database-agnostic perspective, doing anything here would require some sort of database-agnostic classification/categorization system: networking errors, transaction deadlock errors, etc. If such a classification existed, driver users would be able to react differently to types of errors in a database-agnostic way. However, I'm skeptical of whether such a system can work; as I wrote in https://github.com/dotnet/SqlClient/issues/518#issuecomment-611994146, database errors and error codes in general seem very divergent across databases, and I'd be very wary for us to go into the business of standardizing database error categories for the purpose of transience.

(As a aside note, the standard SQLSTATE is precisely an attempt to categorize database errors in a database-agnostic way (although SQLSTATE isn't really concerned with transience as a first-level concept). As I wrote in https://github.com/dotnet/SqlClient/issues/518#issuecomment-611994146, I'm not aware of this being implemented in SQL Server, for example.)

Some further thoughts:

  • There is one definition of transience which seems clear and unambiguous to me, regardless of specific customer needs - whether retrying the same operation may succeed without doing any other change. Under this definition, all network errors are transient, as are any locking issues which could disappear on a second attempt.
  • It seems to me that this definition has enough value for the general/default case; at the end of the day, consumers are always free to ignore the proposed IsTransient property and define whatever retry policy they want. However, if in the typical case customers really do need to configure and tweak exactly which error type they want to retry, then I guess there's little value in this proposal and a non-agnostic approach is the only possibility (as today). It would be good to see what others think about this.

Let me know what you think or if I've misunderstood your points.

I am not asking for ADO.Net base classes to be database aware or to categorize errors. That would not be desirable and may not serve all databases and their ways of categorizing errors.
I should have not used too many references to SqlClient in my explanation :)

For a driver to say whether DbException.IsTransient should be set true or false is prior knowledge. So a driver knows that there are errors that are transient since those errors are documented. It is only for those errors that the driver can definitely set IsTransient to true. For all other errors, there should be a way for the consumer of ado.net driver to add to the list of transient errors.

If the customer can feed in any more errors codes that they want to consider transient, then the driver can set the IsTransient property to true for those additional errors along with the errors it knows about and any transient error handling framework like Polly can simply retry since the information is plumbed into the exception from the driver.

Does this make sense?

I had a chat with Shay:

The fundamental problem is that every driver / database has errors which need to be interpreted differently, they may be in different formats, an error code may be overloaded with lot more information than just being a code.

Hence providing a DB agnostic way of getting these errors from the ADO.Net layer, may not satisfy all providers.

The recommendation would be to have the driver expose a mechanism (say via Connection String) to take in a list of errors that the driver user wants to consider transient. Then the driver can simply make use of DbException.IsTransient to express that the error is to be considered transient.

Basically the mechanism of ingestion of the custom error will be dependent on the driver and ADO.Net DbException will provide a way for the driver to express if the error is transient or not.

@roji, I hope this captures the discussion. Please add to it if I missed anything.

Yeah, I think that makes sense.

I admit that for Npgsql (where IsTransient already exists) I haven't received requests to tweak its meaning. However, if for a particular driver it makes sense to expose a configuration interface (via connection strings or via a special API), which affects what the driver returns via InTransient, then it's by all means free to do so (again, this would be specific to that driver since errors aren't agnostic).

I'll just remark that in that case, instead of configuring the driver to modify its IsTransient values, a user could just implement their logic wherever the IsTransient is being consulted (since things like Polly are almost always configurable). That is, say I want to treat error code X as transient, although by default the driver doesn't treat it as such; I could modify my external retry strategy (which is under my direct control) to recognize both IsTransient and code X. But it may indeed still make sense to influence IsTransient in the driver in some scenarios.

(so all good)

Thanks everyone for the conversation on this. I've been reading through and this is the first time a suggestion of mine has made it to an API proposal. What is the next step for this to take this from a proposal to an actionable item? This looks to appear to be adding a property to DbException with a default implementation so this shouldn't be a breaking change. I wouldn't think this would require any heavy testing because this is just a property on the abstract class.

Assuming the conversation regarding this API proposal is for the most part agreed upon.

@rgarrison12345 let's give this a bit more time to let other people comment and post their views; it's always good to have a consensus from multiple people involved in .NET database programming and drivers. Having said that, this indeed is a small non-breaking change, which I think we'll be able to get in to .NET 5.0.

A couple more thoughts on in-driver and external resilience from the conversation with @saurabh500 (this doesn't have a direct bearing on this issue):

  • Azure SQL seems to require clients to implement retrying strategies, because apparently transient errors there are a normal occurrence, as opposed to other scenarios. I'm not aware of the same thing with other cloud database services (e.g. Citus for PostgreSQL), and am interested if anyone has more info on this.
  • As a result, and in order to provide a good experience for Azure SQL out of the box, SqlClient implements a retrying strategy inside the driver, rather than pushing the problem to the user. Given the Azure SQL behavior, this of course makes total sense, but the general model raises some problems:

    • Retrying policies are very open-ended and configurable (how many times to retry, what kind of back-off...) - this is why products such as Polly exist solely to handle this. Internalizing retrying into the driver introduces all this complexity/configurability inside the driver, and likely can never be as configurable/extensible as a dedicated product etc.

    • This also forces all language drivers to implement retrying, which seems like quite a burden. I'm curious, do non-.NET SQL Server drivers implement the same kind of resilience logic as SqlClient because of Azure SQL? @David-Engel do you have any info here?

It's new to me that SQLClient does the retrying itself on Azure, is this behavior introduced in Microsoft.Data.SqlClient? As I had to deliberately implement retry strategies for azure for System.Data.SqlClient.

I think the relevant info is in https://docs.microsoft.com/en-us/azure/sql-database/sql-database-connectivity-issues#net-sqlconnection-parameters-for-connection-retry, but @saurabh500, @David-Engel and @cheenamalhotra can provide more info.

@FransBouma The behavior was introduced in System.Data.SqlClient.SqlConnection.Open() in 4.6.1 and has been carried over to Microsoft.Data.SqlClient. The link above from @roji provides the connection string parameters that can be used to control this feature.

@roji
Whether this specific feature is available in other drivers or not, I am not sure about it. @David-Engel and @cheenamalhotra will definitely have more information

However the various driver teams try to keep the feature offerings in different drivers the same, so that all languages and platforms get the SQL Server/Azure SQL DB related benefits and they are equally easy to use in all the cases. E.g. If ODBC driver offers Always Encrypted feature, then all other drivers would, or if ADO.Net offers Azure Active Directory auth, then all other drivers would eventually try to offer the feature.

I am not sure why you think this is a burden? Knowing that all drivers on all platforms where SQL Server drivers are available, are trying to offer the same features is quite nice.

Retrying policies are very open-ended and configurable (how many times to retry, what kind of back-off...) - this is why products such as Polly exist solely to handle this. Internalizing retrying into the driver introduces all this complexity/configurability inside the driver, and likely can never be as configurable/extensible as a dedicated product etc.

I agree that drivers shouldn't go to the extent of offering what Polly or other transient fault handling frameworks offer, I do believe that offering some kind of default strategy to allow for common failure use cases in drivers is a great way of not bringing in another dependency in the application framework to make sure that a connection.Open() succeeds. It made sense for SqlClient and we did see reduction in cases where failures were reported to us, because legacy applications didn't have a retry policy in place while being moved to Azure.
There would be many cases where the driver doesn't offer the retry feature which can satisfy the needs to the application, then the developers can turn off the driver feature (in case of SqlClient usign a connection string parameter) and move to transient handling frameworks that they would prefer, or use transient fault handling framework in conjunction with Driver's retry handling if that serves the purpose.

I am not sure why you think this is a burden? Knowing that all drivers on all platforms where SQL Server drivers are available, are trying to offer the same features is quite nice.

Don't get me wrong, given that Azure SQL DB works this way, I think it does make sense for SqlClient (and other drivers) to implement this, otherwise it seems that it would be unusable with Azure on its own. However, what is basically happening is that Azure SQL DB is passing a mandatory resiliency handling burden onto its clients (i.e. onto drivers), where it would have been better to it to simple handle them internally, exposing a reliable service. There may be technical architectural reasons why things have to be this way, but I'm not aware of other cloud data services where things work this way (granted, I don't know a huge deal about this).

Let's put it this way - setting aside Azure SQL DB for a moment (where apparently resilience is particularly problematic), would it make sense to have in-driver retrying for regular, on-premise where networking issues are likely to be much more rare? I'm asking this because I've never received requests for such a feature for Npgsql, and I'm not aware of other drivers doing this apart from SqlClient (I may simply be unaware though).

One last reservation I have about this, is that in-driver resiliency can only go so far. For example, as per the documentation, if a command has started executing and fails, no retry attempt will happen. This is reasonable behavior to expect from the driver (because re-executing commands can be tricky, considering double execution, transactions...), but if transient failures really are common then users must deal with this as well. In other words, driver resiliency coverage seems like it would necessarily be limited, no?

In any case, this discussion is quite academic. I do think SqlClient is doing things right given the Azure SQL DB resiliency situation, it's just that in an ideal world Azure SQL DB would be more reliable and not shift this burden to the driver, that's all. And as you pointed out, users can out-in (or out) of in-driver retrying, in any case.

I am curious as to where this discussion is going. What are you planning to achieve from this discussion about retries?
I have put in a bunch of historical data points here about what how we handled it with Azure SQL DB and I feel that I am giving the wrong perception that Azure SQL DB is full of problems with with transient errors. Certainly not the picture I was trying to paint :)

I know about SqlClient and I am only trying to tell you what problems we have faced in the past, my recommendation about how such a common theme can be incorporated into ADO.Net programming.
However your paraphrasing of my comments are a bit discouraging and painting what I work on or I have worked on, in the wrong light. :)

Sorry @saurabh500, that's not at all what I wanted to express...

I really think SqlClient is doing the right thing here, and am not criticizing any of it. I also admit most of the above isn't very relevant in a concrete way to this proposal (which is just about adding IsTransient to DbException).

The above comes only from my surprise in learning that Azure SQL DB passes the burden of resilience checks to the client (i.e. driver), that's all - I'm simply not used to this from a cloud data service. Please accept my apologies if any of the above cast your work in a negative light - that was really the last thing I wanted to do. Your comments helped me personally understand the landscape and customer needs better.

I guess at this point I should stop posting general thoughts that don't make this issue move forward (unless you feel there's additional value in continuing this conversation). Sorry once again.

I'm curious, do non-.NET SQL Server drivers implement the same kind of resilience logic as SqlClient because of Azure SQL? @David-Engel do you have any info here?

@roji SqlClient is the most feature-rich in terms of this type of functionality, but yes, other drivers do implement resilience logic. ODBC has automatic connection open retry logic, for example and automatic reconnection of idle connections which have been disconnected by network devices. Saurabh touched on what I believe is the biggest use case for these features: existing/legacy applications which were built on the assumption of local resources where "blips" rarely occurred (whether that's a network hiccup, fail over event, provisioning delay, etc) and where it is burdensome to make code changes. Enabling those to be migrated to cloud databases without suffering from those cloud "blips" has very high value to customers.

Thanks @David-Engel, makes complete sense.

Let me know if you guys think that the IsTransient property proposed here makes sense as-is (as well as SupportsTransienceDetection).

Transient errors and retries aren't only connection open / reconnect oriented, i.e. if you have issued 3 commands and the 4th fails due to error 1205 (result streaming) you have to retry all 4, if they're in the same transaction. I have no idea if SqlClient's connectionstring retry logic takes care of that (I doubt it).

Hence it's a nice idea, but to get full resilience you really have to perform the retry logic at the app level. (IMHO)

You are right @FransBouma There is no retry around command execution in SqlClient.

What would a retry framework do in case IsTransient is set to false and there is no feature detection (i.e. no SupportsTransienceDetection property)? Will the retry framework fallback to the user configured retry strategy?

The things that SqlClient can retry on are necessarily limited based on the limited context that the ADO.NET provider has about what the application is doing--as @FransBouma points out. This is why, in my opinion, resiliency is something that should, in general, happen at a higher level than the ADO.NET provider.

EF (both EF6 and EF Core) have more of this context available, and hence are able to do a better job of retrying. For example, EF can retry a failure in getting results back by buffering the results and never giving anything to the application until all the data has been obtained from the database. This ensures that the application won't get duplicates or other inconsistent data if the query has to be re-run. Likewise, EF is has integration with transactions, such that entire transaction can also be retried on transitive failures.

I don't know the full details of why transient error handling was added to SqlClient, especially given that it was already available in EF6 at the time it was added to SqlClient. I _suspect_ the reasoning was something like this:

  • SQL Azure was unusable without retries. (I'd like to think this isn't the case now--5 years ago, it certainly was.)
  • Not everyone is using EF
  • Therefore SqlClient needs to add these retries for SQL Azure to be usable outside of EF.

It has also become clear over the years that there is no rigid definition of what is a "transient error", and we have updated EF several times based on new data from SQL Azure. We also have a mechanism for applications to modify the the list of errors that are considered transient.

The value of this feature, to me, is to allow the list of errors considered transient to be coupled to the ADO.NET provider, rather than something that the application or EF Core needs to create. Likewise, it no longer becomes a SQL Server-specific list in EF Core. So, to me, this is more about providing some provider-specific information to EF or the application so that it can choose what to do.

I like @saurabh500 's suggestion:

The recommendation would be to have the driver expose a mechanism (say via Connection String) to take in a list of errors that the driver user wants to consider transient. Then the driver can simply make use of DbException.IsTransient to express that the error is to be considered transient.

Basically the mechanism of ingestion of the custom error will be dependent on the driver and ADO.Net DbException will provide a way for the driver to express if the error is transient or not.

And as @ajcvickers mentioned:

We also have a mechanism for applications to modify the the list of errors that are considered transient.

Since transient errors are owned by Azure and could change with time, driver would have to be kept up-to-date with that info. But having said that, we should consider client application as well along with what driver does so they can freely update their list whenever the list changes from Azure side.

SupportsTransienceDetection seems unwanted to me, as, if driver can customize it's behavior with provided list, IsTransient is all that matters.

Would it make sense to have in-driver retrying for regular, on-premise where networking issues are likely to be much more rare?

@roji I don't think this should be touched upon by other drivers/use-cases if not needed. Also I'm not sure if we want to introduce this DbExcpetion at all, and not SqlError instead, as this is clearly Azure SQL specific. Unless I'm missing info about other .NET Data providers handling transient errors.

However giving precise information to users from DbException like ErrorCode holds more value IMO so client apps can decide which error codes to flag as transient along with the list of "transient" errors that driver would maintain (ref #34798).

@cheenamalhotra

@roji I don't think this should be touched upon by other drivers/use-cases if not needed. Also I'm not sure if we want to introduce this DbExcpetion at all, and not SqlError instead, as this is clearly Azure SQL specific. Unless I'm missing info about other .NET Data providers handling transient errors.

From my point of view, transient error recognition and resilience/retry is something that affects all databases, not just Azure SQL - any database can have an intermittent networking issue, or some other transient problem causing it to momentarily fail. I've definitely received requests for this for Npgsql, and IsTransient is in fact implemented there. The goal here is simply to surface the transience information to the user (and/or to any resilience layer they're using 脿 la Polly) so that operations can be retried.

My point above was that including resilience logic inside an ADO driver (as opposed to just exposing IsTransient) makes sense to me mainly for Azure SQL (at least back in the day), since as @ajcvickers said, "SQL Azure was unusable without retries".

However giving precise information to users from DbException like ErrorCode holds more value IMO so client apps can decide which error codes to flag as transient along with the list of "transient" errors that driver would maintain (ref #34798).

The problem with exposing ErrorCode on DbException (as opposed to SqlException) is that there isn't a database-agnostic concept of what an error actually is (see this comment)... SQL Server's error code is an int, but PostgreSQL's is a string. Even if the CLR type were the same, the actual error codes are completely different, so an API consumer wouldn't be able to actually do anything with the code: the moment they'd want to test for anything, they'd be forced to recognize database-specific error codes, and so they may as well just code against SqlException, NpgsqlException, etc.

Let me know if you guys think that the IsTransient property proposed here makes sense as-is (as well as SupportsTransienceDetection).

@roji I think IsTransient is a good first step. I'm on the same page as everyone else here. SupportsTransienceDetection adds a small amount of value. I could take it or leave it.

For reference, we have SqlClient issue #416 that is related to transient errors and retry functionality at the SqlClient level. There's some good background information. The main goal is to reduce the code cost for existing/legacy applications to reliably migrate to a cloud model.

Thanks @David-Engel, and thanks for the link too. I posted a comment there with insights from here on why attempting full resiliency within the database driver may not be feasible, hopefully that's helpful...

Is it possible that I can work on a pull request for IsTransient property or do we need to wait for more discussion?

OK, as I haven't been able to come up with a credible scenario for SupportsTransienceDetection, I'm removing it from the proposal for now. We can rediscuss this if someone comes up with a good scenario/argument.

@rgarrison12345 this still has to get official sign-off from API review, but as this is a tiny addition I'd say go for it if you want to submit a PR. Ping me if you need any guidance.

@roji I created a pull request for this update. This is my first pull request and I hope to contribute more in the future. I didn't create a test for this update, however I have a default value of false on IsTransient, I didn't create a test because I believe this would ultimately be up to the derived class to determine.

Thanks for the PR @rgarrison12345, it may indeed be better to wait until this API proposal is approved before continuing work. I agree that tests here aren't needed, as this is a simple abstraction only and they wouldn't add any value.

@roji I'll hold off until API review is complete. Is there a time frame for when API reviews are completed in general?

Hopefully we'll be able to get this signed off quickly along with #33397.

/cc @terrajobst @stephentoub

Wouldn't it be possible to implement this on a interface, separate from the DbException? I find my self building blocks catching transient errors but some transient errors are better detected by the application itself and not by the official detection. There may be cases where the error is not a transient exception by itself, but it can easily be categorized as such by the situation. It can be really useful to decorate other exceptions with this interface, and not restricting it to DbException as base class.

Example:
Error 208 is not an transient error. But I know that this error in combination with the message "Invalid object name 'sys.sysschobjs'" when using temporary tables is an transient error in Azure. It will never make its way up to the official transient implementation, but this is one of the reasons I need a custom detection strategy today. If I could decorate an Exception with this interface I would be good to go.

@olle-j if you want to implement your custom logic on what you consider transient, how would such an interface be helpful to you? You're going to have to implement a transience detector in any case, outside of the driver, and implement your custom logic there. See EF Core's current transient exception detector for SQL Server for an example - the idea would be to move a lot of that logic (possibly even all of it) into SqlClient.

Error 208 is not an transient error. But I know that this error in combination with the message "Invalid object name 'sys.sysschobjs'" when using temporary tables is an transient error in Azure. It will never make its way up to the official transient implementation, but this is one of the reasons I need a custom detection strategy today.

If indeed this error is transient on Azure when using temporary tables, then I'd expect SqlClient to flag it as such...

If I could decorate an Exception with this interface I would be good to go.

Once I again, if you're implementing your custom strategy, then why would you need to decorate anything? Your strategy would contain the logic to identify whatever you consider as transient.

The framework that handles the retry (any framework really, EF, Polly etc.) can handle IsTransient. It pushes the logic down in the stack. I can of cause make my own transient detector and handle all kinds of errors, but that should not be needed if I only can check IsTransient. Polly would as an example have one line of code, regardless if its created in the SqlClient or not.

I don't want a custom strategy. I only want to handle IsTransient. But its nice to be able to throw these by other frameworks than SqlClient. Like you say. It push the implementation down in the stack.

Does your suggestion take into account TransactionScope? That is where I have to implement custom detection like Polly today, since they span over multiple SqlConnections.

@olle-j I'm still not really understanding what you're asking for - would you have us change other .NET BCL exceptions to implement the interface you're proposing? I don't see how that would make sense, since a SocketException, for example, isn't transient or non-transient on its own - that's up to the specific application to decide based on context. The reason DbException is different, is that it's an abstraction over multiple database drivers, and it's important for these drivers to pass information on transience through that exception.

I don't want a custom strategy. I only want to handle IsTransient.

Once again, I simply don't think that makes sense. Different exceptions are transient to different applications in different contexts.

Does your suggestion take into account TransactionScope? That is where I have to implement custom detection like Polly today, since they span over multiple SqlConnections.

What exact relationship do you see between an exception being transient and TransactionScope? Once again, the point is only to expose information on DbException which could be used by Polly (in addition to other logic). What Polly (or any other retry strategy) should do when a transient (or non-transient) exception is thrown is out of scope here, in my mind.

Video

  • Makes sense as proposed.

C# namespace System.Data.Common { public partial class DbException { public virtual bool IsTransient => false; } }

Now I have a list of transient error codes per database in my transient error recovery strategy classes and that's not very flexible (if error codes are added I have to maintain this list plus I have to look up these codes for all databases supported).

@FransBouma Would you be willing to share the list of MySQL error codes you consider to be transient for MySQL? I think MySqlConnector could adopt them, instead of coming up with a different list (that you'd have to augment because you couldn't trust that IsTransient is sufficient for your needs).

https://github.com/mysql-net/MySqlConnector/issues/849

@bgrainger
I don't have such a list, we only ship a transient error recovery class for SQL Server with boilerplate code for users if they want to add support for their database for that.
Basically any error that doesn't kill the transaction / connection is usable for transient error recovery, however to figure out which ones that are is a painful process with some databases.

Basically any error that doesn't kill the transaction / connection is usable for transient error recovery [...]

I tend to be a bit more conservative on this, e.g. a unique constraint violation isn't something I'd consider transient, since the exact same operation would in principle produce the same failure (even if the same operation but with a different ID would succeed). The way I'm thinking about this is for the development of automatic retry strategies (Polly, EF Core, LLBLGen Pro).

It's indeed a good idea for us to end up with the same concept of transience across providers (that's the whole point of the feature), so let's continue having this conversation.

Doesn't a UC violation terminate the transaction? For SQL Server we use these:

switch(Convert.ToInt32(errorNumber))
{
    // Time out. 
    case -2:
    // DBNETLIB Error Code: 20
    // The instance of SQL Server you attempted to connect to does not support encryption.
    case 20:
    // SQL Error Code: 64
    // A connection was successfully established with the server, but then an error occurred during the login process. 
    // (provider: TCP Provider, error: 0 - The specified network name is no longer available.) 
    case 64:
    // SQL Error Code: 233
    // The client was unable to establish a connection because of an error during connection initialization process before login. 
    // Possible causes include the following: the client tried to connect to an unsupported version of SQL Server; the server was too busy 
    // to accept new connections; or there was a resource limitation (insufficient memory or maximum allowed connections) on the server. 
    // (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.)
    case 233:
    // SQL Error Code: 1205
    // Dead lock situation. 
    case 1205:
    // SQL Error Code: 10053
    // A transport-level error has occurred when receiving results from the server.
    // An established connection was aborted by the software in your host machine.
    case 10053:
    // SQL Error Code: 10054
    // A transport-level error has occurred when sending the request to the server. 
    // (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.)
    case 10054:
    // SQL Error Code: 10060
    // A network-related or instance-specific error occurred while establishing a connection to SQL Server. 
    // The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server 
    // is configured to allow remote connections. (provider: TCP Provider, error: 0 - A connection attempt failed 
    // because the connected party did not properly respond after a period of time, or established connection failed 
    // because connected host has failed to respond.)"}
    case 10060:
    // SQL Error Code: 10928
    // Resource ID: %d. The %s limit for the database is %d and has been reached. For more information,
    case 10928:
    // SQL Error Code: 10929
    // Resource ID: %d. The %s minimum guarantee is %d, maximum limit is %d and the current usage for the database is %d.
    // However, the server is currently too busy to support requests greater than %d for this database.
    case 10929:
    // SQL Error Code: 40197
    // The service has encountered an error processing your request. Please try again.
    case 40197:
    // SQL Error Code: 40143
    // The service has encountered an error processing your request. Please try again.
    case 40143:
    // SQL Error Code: 40501
    // The service is currently busy. Retry the request after 10 seconds.
    case 40501:
    // SQL Error Code: 40613
    // Database XXXX on server YYYY is not currently available. Please retry the connection later. If the problem persists, contact customer 
    // support, and provide them the session tracing ID of ZZZZZ.
    case 40613:
        return true;

so basically everything related to creating a connection and keeping it alive. SQL Server documentation is pretty good on this, but I found other database documentation pretty vague on which errors do and which don't terminate a transaction for instance.

The PG logic is pretty simple (and somewhat extreme) - any error puts the ongoing transaction in a failed state, and you can't do anything aside from rolling the transaction back (or rolling back to a savepoint).

But I'm not sure that IsTransient should mean the same thing as "does it terminate the transaction"... I was only thinking of this as an indication that a retry may be successful (even if that retry involves rolling back the transaction and replaying statements). In other words, I'd expect an ORM/retry policy to check this flag, and if it's true, roll back the ongoing transaction (if any) and re-attempt.

It's true that an error may be transient but not cause transaction termination, and so rolling back the entire transaction to retry may be overkill (it could be possible to just retry the latest command). That doesn't seem terribly important to me as transient exceptions aren't supposed to be that common for it to matter (am I wrong?).

Does this make sense to everyone? Am I missing something?

Our application also needs to know whether a different command on a new database connection could succeed, or the entire database or server is down. The application currently recognizes error codes for this purpose but we could perhaps make it instead check which method threw the exception.

@KalleOlaviNiemitalo detecting that the database/server is down is notoriously difficult (e.g. a network issue could indicate the server is down (or not)), am curious what you're current logic for this is...

But either way this seems to be orthogonal (and so out of scope) for this feature, which is only about identifying transient errors - or do you see things otherwise?

@roji Our application does not need to distinguish between a network issue, the server being down, and the database being down. It needs to know that those errors do not depend on the database commands that it was running in the transaction.

I agree that such a feature is not in scope for this issue, but I mentioned it as an example of difficulty in migrating to DbException.IsTransient from application-side recognition of error numbers.

The PG logic is pretty simple (and somewhat extreme) - _any_ error puts the ongoing transaction in a failed state, and you can't do anything aside from rolling the transaction back (or rolling back to a savepoint).

But I'm not sure that IsTransient should mean the same thing as "does it terminate the transaction"... I was only thinking of this as an indication that a retry may be successful (even if that retry involves rolling back the transaction and replaying statements). In other words, I'd expect an ORM/retry policy to check this flag, and if it's true, roll back the ongoing transaction (if any) and re-attempt.

It's true that an error may be transient but _not_ cause transaction termination, and so rolling back the entire transaction to retry may be overkill (it could be possible to just retry the latest command). That doesn't seem terribly important to me as transient exceptions aren't supposed to be that common for it to matter (am I wrong?).

Does this make sense to everyone? Am I missing something?

I think what you describe is very sensible and in practice it likely leads to redoing the whole transaction again anyway, but I think the key difference with e.g. a UC violation in the middle of a transaction is that the error is outside of the scope of the data, i.e. there's nothing to 'correct' on the client / app side; the only thing it can do is retry the whole unit of work.

An error in the data, violating a UC, an FK, retrying the same Unit of work will lead to the same error, as the error is inside the scope of the data.

Would that help determining when the IsTransient flag should be set or not?

@ KalleOlaviNiemitalo

I agree that such a feature is not in scope for this issue, but I mentioned it as an example of difficulty in migrating to DbException.IsTransient from application-side recognition of error numbers.

Yeah, IsTransient unfortunately won't solve everything requiring applications to look at provider-specific error numbers. Note that there's SqlState which is also being added to DbException (https://github.com/dotnet/runtime/issues/35601), which could also help (e.g. for recognizing unique constraint violations across databases).

@FransBouma

An error in the data, violating a UC, an FK, retrying the same Unit of work will lead to the same error, as the error is inside the scope of the data.

There could conceivably be some variation here across products... If retrying the UoW means retrying the exact same statements (with the same ID values), then yeah, the same error should be reproduced; that's why I think that UC/FK violations shouldn't be treated as transient. But it's possible that for some ORM, retrying would also mean generating new IDs, in which case the UC/FK violations would no longer occur.

I do think that in the general case data issues shouldn't be treated as transient - am interested in any other views though.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

omajid picture omajid  路  3Comments

btecu picture btecu  路  3Comments

aggieben picture aggieben  路  3Comments

bencz picture bencz  路  3Comments

jzabroski picture jzabroski  路  3Comments