Efcore: After 2.2 -> 3.1 migration, infrequent outages under heavy load, possibly related to runaway memory usage under a deadlock condition

Created on 30 Sep 2020 · 8Comments · Source: dotnet/efcore

Since we migrated to 3.1 we've been seeing one of our heavily-trafficked Asp.Net REST API services, which uses EF, dropping offline about once or twice a week. We have App Insights enabled, and whenever an outage occurs the Available Memory metric for the service always shows the same thing - a precipitous drop in available memory from ~10GB to <100MB over the course of about 4 or 5 minutes, followed by degraded servicing of requests for some period of time (which we've seen range from a few seconds to about 30 minutes) while the Available Memory trace stays flatlined at <100MB, followed by the service recycling.

The Event Viewer system trace at this point shows:

An ISAPI reported an unhealthy condition to its worker process. Therefore, the worker process with process id of '' serving application pool '' has requested a recycle.

We have enabled a lot of debug-level logging in an effort to pinpoint which requests might be causing this, and this has identified some query patterns that were less than efficient in memory usage (ie loading too much data for client-side processing), and we've made changes to resolve those. Unfortunately we're still seeing occasional outages under heavy load.

We're now working on the theory that we have one or more query patterns that are causing deadlocks to occur (since we do occasionally, but not always, see deadlock errors in the logs in the time periods leading up to the crashes). Analyzing our logs we've noticed that there is one endpoint in particular that seems to be always "in-flight" at the time of the outages, and appears to be waiting for SaveChanges to return all the while from the beginning of the "memory leak" to the point where the service dies (ie our logs show that SaveChanges never returned). This particular endpoint is one of our most complex operations in terms of database updates, so we think we've found the culprit, but we're not sure why this would exhibit as out-of-control memory usage instead of run-of-the-mill deadlock errors.

Anyone have any ideas for further troubleshooting, other than analysis of the SQL query patterns for obvious deadlock possibilities (which we're in the process of right now)?

closed-question customer-reported

Source

lesscodetxm

All 8 comments

Well, we added our SQL query logging and found something rather surprising. At the point at which our server falls over due to lack of memory, we're in the middle of the SaveChanges call of our complex operation. After the expected queries that do the updates and inserts of our data we see that there are queries being logged that seem to be attempting to load all rows, from every table in our database, in alphabetical order... We obviously don't have any code in our solution that attempts to do that, so where could this be coming from?

--2020-10-01 14:43:30.370 -05:00 [INF] [159] Microsoft.EntityFrameworkCore.Database.Command Executed DbCommand (19ms) [Parameters=...]
SET NOCOUNT ON;
UPDATE [Document] SET ...
< THESE ARE OUR EXPECTED UPDATES AND INSERTS FOR OUR OPERATION >
--2020-10-01 14:43:30.413 -05:00 [INF] [159] Microsoft.EntityFrameworkCore.Database.Command Executed DbCommand (1ms) [Parameters=[], CommandType='"Text"', CommandTimeout='30']
SELECT [a].[ActionId], [a].[Description], [a].[Name]
FROM [Action] AS [a] 
--2020-10-01 14:43:30.430 -05:00 [INF] [159] Microsoft.EntityFrameworkCore.Database.Command Executed DbCommand (1ms) [Parameters=[], CommandType='"Text"', CommandTimeout='30']
SELECT [a].[AssociationTypeId], [a].[Description], [a].[Value]
FROM [AssociationType] AS [a] 
--2020-10-01 14:43:30.439 -05:00 [INF] [159] Microsoft.EntityFrameworkCore.Database.Command Executed DbCommand (1ms) [Parameters=[], CommandType='"Text"', CommandTimeout='30']
SELECT [a].[AttributeId], [a].[Description], [a].[Name]
FROM [Attribute] AS [a] 
--2020-10-01 14:43:30.493 -05:00 [INF] [159] Microsoft.EntityFrameworkCore.Database.Command Executed DbCommand (46ms) [Parameters=[], CommandType='"Text"', CommandTimeout='30']
SELECT [c].[CaptureDocumentId], [c].[CreateDate], [c].[Path], [c].[UserId]
FROM [CaptureDocument] AS [c] 
< ETC.. UNTIL WE RUN OUT OF MEMORY >

lesscodetxm on 1 Oct 2020

SELECT [a].[ActionId], [a].[Description], [a].[Name]
FROM [Action] AS [a]

All the queries like above is actually querying whole table. Generally this happens when you catch an exception and tries to serialize the exception which also contains DbContext. Since DbContext contains DbSet properties, to serialize them the serializer will enumerate each of them essentially calling ToList on each DbSet which will generate query like above. Also it is going to cause all the data from database to be loaded in client, depending on the size of data, it is possible to run out of memory.

smitpatel on 2 Oct 2020

👍1

Seen with older versions of Serilog

ErikEJ on 2 Oct 2020

@smitpatel I understand _what_ is happening, I just don't understand why, or from where.
@ErikEJ We are using Serilog, and the first thing I checked was for a Serilog "deconstruct" of the context (which we don't appear to be doing anywhere in our code). Also, this is happening _inside_ the call to SaveChanges, not from our code. I've added some tracing to the context so we might be able to tell exactly what's trying to "evaluate" the context.

lesscodetxm on 4 Oct 2020

Here's a stack trace of the rogue access:

2020-10-05 13:23:39.760 -05:00 [DBG] [38] DataAccess.ECI.Models.ECIContext Action DbSet was accessed:
   at DataAccess.ECI.Models.ECIContext.get_Action()
   at lambda_method(Closure , Object )
   at Serilog.Exceptions.Destructurers.ReflectionBasedDestructurer.DestructureObject(Object value, Type valueType, Int32 level, IDictionary`2 destructuredObjects, Int32& nextCyclicRefId)
   at Serilog.Exceptions.Destructurers.ReflectionBasedDestructurer.DestructureValue(Object value, Int32 level, IDictionary`2 destructuredObjects, Int32& nextCyclicRefId)
   at Serilog.Exceptions.Destructurers.ReflectionBasedDestructurer.DestructureObject(Object value, Type valueType, Int32 level, IDictionary`2 destructuredObjects, Int32& nextCyclicRefId)
   at Serilog.Exceptions.Destructurers.ReflectionBasedDestructurer.DestructureValue(Object value, Int32 level, IDictionary`2 destructuredObjects, Int32& nextCyclicRefId)
   at Serilog.Exceptions.Destructurers.ReflectionBasedDestructurer.DestructureValueEnumerable(IEnumerable value, Int32 level, IDictionary`2 destructuredObjects, Int32& nextCyclicRefId)
   at Serilog.Exceptions.Destructurers.ReflectionBasedDestructurer.DestructureValue(Object value, Int32 level, IDictionary`2 destructuredObjects, Int32& nextCyclicRefId)
   at Serilog.Exceptions.Destructurers.ReflectionBasedDestructurer.AppendProperties(Object value, ReflectionPropertyInfo[] reflectionPropertyInfos, Action`2 addPropertyAction, IDictionary`2 destructuredObjects, Int32 level, Int32& nextCyclicRefId)
   at Serilog.Exceptions.Destructurers.ReflectionBasedDestructurer.Destructure(Exception exception, IExceptionPropertiesBag propertiesBag, Func`2 destructureException)
   at Serilog.Exceptions.Core.ExceptionEnricher.DestructureException(Exception exception)
   at Serilog.Exceptions.Core.ExceptionEnricher.Enrich(LogEvent logEvent, ILogEventPropertyFactory propertyFactory)
   at Serilog.Core.Enrichers.SafeAggregateEnricher.Enrich(LogEvent logEvent, ILogEventPropertyFactory propertyFactory)
   at Serilog.Core.Logger.Dispatch(LogEvent logEvent)
   at Serilog.Extensions.Logging.SerilogLogger.Log[TState](LogLevel logLevel, EventId eventId, TState state, Exception exception, Func`3 formatter)
   at Microsoft.Extensions.Logging.LoggerMessage.<>c__DisplayClass7_0`3.<Define>g__Log|0(ILogger logger, T1 arg1, T2 arg2, T3 arg3, Exception exception)
   at Microsoft.EntityFrameworkCore.Diagnostics.CoreLoggerExtensions.ExecutionStrategyRetrying(IDiagnosticsLogger`1 diagnostics, IReadOnlyList`1 exceptionsEncountered, TimeSpan delay, Boolean async)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementation[TState,TResult](Func`3 operation, Func`3 verifySucceeded, TState state)
   at Microsoft.EntityFrameworkCore.DbContext.SaveChanges(Boolean acceptAllChangesOnSuccess)
   at DataAccess.ECI.Models.ECIContext.SaveChanges()
   at BusinessLogic.ECI.WorkItemCompletion.WorkItemCompleter.ReleaseAsync(WorkItem synced)

lesscodetxm on 5 Oct 2020

@lesscodetxm That stack trace confirms that this is a Serilog issue.