Nlog: Deadlock when under load

Created on 19 Apr 2017  Â·  19Comments  Â·  Source: NLog/NLog

Hi! Thanks for reporting this feature/bug/question!

Please keep / fill in the relevant info from this template so that we can help you as best as possible.

For .NET Core users, please check the platform support: https://github.com/NLog/NLog/wiki/platform-support

Type (choose one):

  • Bug

NLog version: 5.0.0-beta06 (and 5.0.0-beta03)

Platform: .NET Core

Current NLog config (xml or C#, if relevant)

<?xml version="1.0" encoding="utf-8" ?>
<nlog xmlns="http://www.nlog-project.org/schemas/NLog.xsd"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      autoReload="true"
      internalLogLevel="Warn"
      internalLogFile="/tmp/internal-nlog.txt">

  <extensions>
    <add assembly="PlayAuth.Common" />
  </extensions>

  <variable name="correlationLayout" value="${when:when=length('${correlationId}') &gt; 0:inner= Id=${correlationId},}" />
  <variable name="messageLayout" value="Message=${replace-newlines:Message=${truncated:Limit=10000:inner=${message}}" />
  <variable name="exceptionLayout" value="${when:when=length('${exception}') &gt; 0:inner= Type=${exception:Format=Type} Exception=${exception:format=ToString,StackTrace}${newline}}" />

  <targets>
    <target name="splunkfileAsync" xsi:type="AsyncWrapper" queueLimit="20000" overflowAction="Discard">
      <target xsi:type="File" name="file"
              layout="[${date:universalTime=true:format=O}] Level=${level}, Category=${logger},${correlationLayout} ${messageLayout} ${exceptionLayout}"
              fileName="/var/log/starz/playauth.log" archiveFileName="/var/log/starz/log-own.log" archiveAboveSize="52428800" archiveNumbering="Sequence" maxArchiveFiles="2" />
    </target>
    <target xsi:type="Null" name="blackhole" />
  </targets>

  <rules>
    <logger name="Microsoft.*" minlevel="Trace" writeTo="blackhole" final="true" />
    <logger name="NewRelic.*" minlevel="Trace" writeTo="blackhole" final="true" />
    <logger name="Enyim.*" levels="Trace,Debug" writeTo="blackhole" final="true" />
    <logger name="*" minlevel="Trace" writeTo="splunkfileAsync" />
  </rules>
</nlog>

In case of a BUG:

  • What is the current result? Under heavy load, we experience an apparent deadlock and threadpool exhaustion due to a lock somewhere inside of nlog
  • What is the expected result? No deadlocks
  • Did you checked the Internal log? Yes
  • Please post full exception details (message, stacktrace, inner exceptions)
  • Are there any work arrounds? No
  • Is there a version in which it did worked? Not with .net core
  • Can you help us by writing an unit test? Unfortunately, not yet. Trying to reliably reproduce has not yet succeeded.

  • Details:
    We are using NLog 5 with .NET Core in an ASP.NET MVC web api application and encountering what appear to be occasional deadlock issues in production. The problem manifests as a sudden “lock up” of the server. HTTP clients are still able to connect to the socket, but the requests are not serviced. We can see that the number of threads in the application spikes up to the ~30000 number when the problem occurs. Thus far, we have not been able to reproduce it anywhere besides production. We have encountered this problem with 5.0.0-beta03 as well as with 5.0.0-beta06.

We are running in Docker on Linux. After enlisting some help from some folks at Microsoft, we were finally able to get some thread dumps on one of the locked processes. In each case, it appears that all but a handful of system threads are waiting on a monitor, and that monitor was entered by the nlog AsyncTargetWrapper.Write or PrecalculateVolatileLayouts method.

We’ve looked at the nlog source but are not really able to see anything that would explain the deadlock… hoping that the experts in the NLog project would be able to shed some light. :-)

Any details you have would be greatly appreciated. Thank you very much!

  • Thread dump
(lldb) dumpstack        
OS Thread Id: 0x4515 (6754)
TEB information is not available so a stack size of 0xFFFF is assumed
Current frame: libpthread.so.0!__pthread_cond_wait + 0xbf
Child-SP         RetAddr          Caller, Callee
00007FE140F27310 00007fe6f61e04e2 libcoreclr.so!CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 0x142, calling libcoreclr.so!pthread_cond_wait
00007FE140F27370 00007fe6f61e01a0 libcoreclr.so!CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 0x210, calling libcoreclr.so!CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*)
00007FE140F273D0 00007fe6f61e5051 libcoreclr.so!CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int) + 0x751
00007FE140F27570 00007fe6f5e3254b libcoreclr.so!Thread::DoAppropriateWaitWorker(int, void**, int, unsigned int, WaitMode) + 0x22b, calling libcoreclr.so!WaitForMultipleObjectsEx
00007FE140F27610 00007fe6f5e2d2f0 libcoreclr.so!Thread::DoAppropriateWait(int, void**, int, unsigned int, WaitMode, PendingSync*) + 0x50, calling libcoreclr.so!Thread::DoAppropriateWaitWorker(int, void**, int, unsigned int, WaitMode)
00007FE140F27650 00007fe6f5f1f985 libcoreclr.so!CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) + 0x65, calling libcoreclr.so!Thread::DoAppropriateWait(int, void**, int, unsigned int, WaitMode, PendingSync*)
00007FE140F276B0 00007fe6f6c5a27d libc.so.6!__clock_gettime + 0x1d
00007FE140F276F0 00007fe6f5e2cc01 libcoreclr.so!AwareLock::EnterEpilogHelper(Thread*, int) + 0x141, calling libcoreclr.so!CLREventBase::Wait(unsigned int, int, PendingSync*)
00007FE140F27780 00007fe6f5e2c22a libcoreclr.so!AwareLock::Enter() + 0xca, calling libcoreclr.so!AwareLock::EnterEpilogHelper(Thread*, int)
00007FE140F277D0 00007fe6f5e2c932 libcoreclr.so!AwareLock::Contention(int) + 0x402, calling libcoreclr.so!AwareLock::Enter()
00007FE140F27880 00007fe6f5eb3c55 libcoreclr.so!JIT_MonContention_Helper(Object*, unsigned char*, void*) + 0x175, calling libcoreclr.so!AwareLock::Contention(int)
00007FE140F27948 00007fe67d71202b (MethodDesc 00007fe67cdb8d20 + 0x6b NLog.Targets.Target.PrecalculateVolatileLayouts(NLog.LogEventInfo)), calling 00007fe6f5eb4530 (stub for System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef))
00007FE140F279F0 00007fe6f5eb3bed libcoreclr.so!JIT_MonContention_Helper(Object*, unsigned char*, void*) + 0x10d, calling libcoreclr.so!LazyMachStateCaptureState
00007FE140F27A40 00007fe6f5eb4656 libcoreclr.so!JIT_MonReliableEnter_Portable + 0x126, calling libcoreclr.so!JIT_MonContention_Helper(Object*, unsigned char*, void*)
00007FE140F27A60 00007fe67d71202b (MethodDesc 00007fe67cdb8d20 + 0x6b NLog.Targets.Target.PrecalculateVolatileLayouts(NLog.LogEventInfo)), calling 00007fe6f5eb4530 (stub for System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef))
00007FE140F27AD0 00007fe67d711c4c (MethodDesc 00007fe67d4aa958 + 0x3c NLog.Targets.Wrappers.AsyncTargetWrapper.Write(NLog.Common.AsyncLogEventInfo)), calling (MethodDesc 00007fe67cdb8d20 + 0 NLog.Targets.Target.PrecalculateVolatileLayouts(NLog.LogEventInfo))
00007FE140F27B10 00007fe67d711b8f (MethodDesc 00007fe67d4aa960 + 0x3f NLog.Targets.Wrappers.AsyncTargetWrapper.WriteAsyncThreadSafe(NLog.Common.AsyncLogEventInfo))
00007FE140F27B50 00007fe67d7119e8 (MethodDesc 00007fe67cdb8d38 + 0x118 NLog.Targets.Target.WriteAsyncLogEvent(NLog.Common.AsyncLogEventInfo))
00007FE140F27BB0 00007fe67d71173d (MethodDesc 00007fe67d4549f8 + 0xcd NLog.LoggerImpl.WriteToTargetWithFilterChain(NLog.Internal.TargetWithFilterChain, NLog.LogEventInfo, NLog.Common.AsyncContinuation)), calling (MethodDesc 00007fe67cdb8d38 + 0 NLog.Targets.Target.WriteAsyncLogEvent(NLog.Common.AsyncLogEventInfo))
00007FE140F27BF0 00007fe67d7115f6 (MethodDesc 00007fe67d454998 + 0x156 NLog.LoggerImpl.Write(System.Type, NLog.Internal.TargetWithFilterChain, NLog.LogEventInfo, NLog.LogFactory)), calling (MethodDesc 00007fe67d4549f8 + 0 NLog.LoggerImpl.WriteToTargetWithFilterChain(NLog.Internal.TargetWithFilterChain, NLog.LogEventInfo, NLog.Common.AsyncContinuation))
00007FE140F27C00 00007fe67d711378 (MethodDesc 00007fe67cdaaa78 + 0xc8 NLog.Logger.WriteToTargets(NLog.LogLevel, System.IFormatProvider, System.String)), calling (MethodDesc 00007fe67cdaa9a8 + 0 NLog.Logger.PrepareLogEventInfo(NLog.LogEventInfo))
00007FE140F27C40 00007fe67db48f14 (MethodDesc 00007fe67d3a9730 + 0x4b4 PlayAuth.Api.Middleware.HealthCheckMiddleware+<GetInternal>d__5.MoveNext()), calling (MethodDesc 00007fe67cdaabb8 + 0 NLog.Logger.Trace(System.String))
00007FE140F27EB0 00007fe67c7787f2 (MethodDesc 00007fe67c33d9e0 + 0x52 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, System.Private.CoreLib]].Start[[System.__Canon, System.Private.CoreLib]](System.__Canon ByRef))
00007FE140F27EC0 00007fe67c778795 (MethodDesc 00007fe67c3640c0 + 0x45 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, System.Private.CoreLib]].Create()), calling libcoreclr.so!JIT_ByRefWriteBarrier
00007FE140F27EF0 00007fe67db485c4 (MethodDesc 00007fe67d38da38 + 0xf4 PlayAuth.Api.Middleware.HealthCheckMiddleware.GetInternal(System.Collections.Generic.IList`1<PlayAuth.Common.Health.IHealthIndicator>)), calling (MethodDesc 00007fe67c33d9e0 + 0 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, System.Private.CoreLib]].Start[[System.__Canon, System.Private.CoreLib]](System.__Canon ByRef))
00007FE140F27F70 00007fe67db476d1 (MethodDesc 00007fe67d3a9308 + 0x311 PlayAuth.Api.Middleware.HealthCheckMiddleware+<Invoke>d__3.MoveNext()), calling (MethodDesc 00007fe67d38da38 + 0 PlayAuth.Api.Middleware.HealthCheckMiddleware.GetInternal(System.Collections.Generic.IList`1<PlayAuth.Common.Health.IHealthIndicator>))
00007FE140F28250 00007fe67c77b712 (MethodDesc 00007fe67c33ce98 + 0x52 System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[System.__Canon, System.Private.CoreLib]](System.__Canon ByRef))
00007FE140F28260 00007fe67c77b6b5 (MethodDesc 00007fe67c364560 + 0x45 System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Create()), calling libcoreclr.so!JIT_ByRefWriteBarrier
00007FE140F28290 00007fe67db4732f (MethodDesc 00007fe67d38da18 + 0x11f PlayAuth.Api.Middleware.HealthCheckMiddleware.Invoke(Microsoft.AspNetCore.Http.HttpContext, System.Collections.Generic.IEnumerable`1<PlayAuth.Common.Health.IHealthIndicator>, Microsoft.Extensions.Options.IOptions`1<PlayAuth.Common.Configuration.EnvironmentConfig>)), calling (MethodDesc 00007fe67c33ce98 + 0 System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[System.__Canon, System.Private.CoreLib]](System.__Canon ByRef))
00007FE140F28320 00007fe67db25d6e (MethodDesc 00007fe67d520e20 + 0x22e Microsoft.AspNetCore.StaticFiles.StaticFileMiddleware.Invoke(Microsoft.AspNetCore.Http.HttpContext))
00007FE140F28410 00007fe67db24db4 (MethodDesc 00007fe67dcbb4b0 + 0x204 Microsoft.AspNetCore.Hosting.Internal.RequestServicesContainerMiddleware+<Invoke>d__3.MoveNext())
00007FE140F28450 00007fe67db24b25 (MethodDesc 00007fe67dcbb618 + 0x45 System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[Microsoft.AspNetCore.Hosting.Internal.RequestServicesContainerMiddleware+<Invoke>d__3, Microsoft.AspNetCore.Hosting]](<Invoke>d__3 ByRef)), calling (MethodDesc 00007fe67dcbb4b0 + 0 Microsoft.AspNetCore.Hosting.Internal.RequestServicesContainerMiddleware+<Invoke>d__3.MoveNext())
00007FE140F28460 00007fe67d7a97b7 (MethodDesc 00007fe67da502e8 + 0x57 DomainBoundILStubClass.IL_STUB_InstantiatingStub(Microsoft.Extensions.Logging.LogLevel, Microsoft.Extensions.Logging.EventId, System.Object, System.Exception, System.Func`3<System.Object,System.Exception,System.String>))
00007FE140F28490 00007fe67db24aaa (MethodDesc 00007fe67d4042c0 + 0x7a Microsoft.AspNetCore.Hosting.Internal.RequestServicesContainerMiddleware.Invoke(Microsoft.AspNetCore.Http.HttpContext)), calling (MethodDesc 00007fe67dcbb618 + 0 System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[Microsoft.AspNetCore.Hosting.Internal.RequestServicesContainerMiddleware+<Invoke>d__3, Microsoft.AspNetCore.Hosting]](<Invoke>d__3 ByRef))
00007FE140F28520 00007fe67db24a11 (MethodDesc 00007fe67da4d530 + 0x11 Microsoft.AspNetCore.Hosting.Internal.HostingApplication.ProcessRequestAsync(Context))
00007FE140F28530 00007fe67db12226 (MethodDesc 00007fe67dc9d890 + 0x736 Microsoft.AspNetCore.Server.Kestrel.Internal.Http.Frame`1+<RequestProcessingAsync>d__2[[Microsoft.AspNetCore.Hosting.Internal.HostingApplication+Context, Microsoft.AspNetCore.Hosting]].MoveNext())
00007FE140F28620 00007fe67db10d25 (MethodDesc 00007fe67dc9ddf0 + 0x45 System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[Microsoft.AspNetCore.Server.Kestrel.Internal.Http.Frame`1+<RequestProcessingAsync>d__2[[Microsoft.AspNetCore.Hosting.Internal.HostingApplication+Context, Microsoft.AspNetCore.Hosting]], Microsoft.AspNetCore.Server.Kestrel]](<RequestProcessingAsync>d__2<Context> ByRef)), calling (MethodDesc 00007fe67dc9d890 + 0 Microsoft.AspNetCore.Server.Kestrel.Internal.Http.Frame`1+<RequestProcessingAsync>d__2[[Microsoft.AspNetCore.Hosting.Internal.HostingApplication+Context, Microsoft.AspNetCore.Hosting]].MoveNext())
00007FE140F28660 00007fe67db10534 (MethodDesc 00007fe67dc852e0 + 0x94 Microsoft.AspNetCore.Server.Kestrel.Internal.Http.Frame`1[[Microsoft.AspNetCore.Hosting.Internal.HostingApplication+Context, Microsoft.AspNetCore.Hosting]].RequestProcessingAsync()), calling (MethodDesc 00007fe67dc9ddf0 + 0 System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[Microsoft.AspNetCore.Server.Kestrel.Internal.Http.Frame`1+<RequestProcessingAsync>d__2[[Microsoft.AspNetCore.Hosting.Internal.HostingApplication+Context, Microsoft.AspNetCore.Hosting]], Microsoft.AspNetCore.Server.Kestrel]](<RequestProcessingAsync>d__2<Context> ByRef))
00007FE140F28730 00007fe67c76bf8c (MethodDesc 00007fe67c362458 + 0x9c System.Threading.Tasks.Task`1[[System.__Canon, System.Private.CoreLib]].InnerInvoke())
00007FE140F28760 00007fe67c7649c6 (MethodDesc 00007fe67c361800 + 0x46 System.Threading.Tasks.Task.Execute())
00007FE140F28790 00007fe67c714e71 (MethodDesc 00007fe67c357a40 + 0x71 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object))
00007FE140F287A0 00007fe67c76501c (MethodDesc 00007fe67c361880 + 0x2ec System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)), calling libcoreclr.so!JIT_CheckedWriteBarrier
00007FE140F287E0 00007fe67c764fc7 (MethodDesc 00007fe67c361880 + 0x297 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)), calling (MethodDesc 00007fe67c357a40 + 0 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object))
00007FE140F288A0 00007fe67c764ce8 (MethodDesc 00007fe67c361870 + 0x98 System.Threading.Tasks.Task.ExecuteEntry(Boolean)), calling (MethodDesc 00007fe67c361880 + 0 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef))
00007FE140F288C0 00007fe67c77ad25 (MethodDesc 00007fe67c3643a0 + 0x1a5 System.Threading.ThreadPoolWorkQueue.Dispatch())
00007FE140F28920 00007fe6f5f53a37 libcoreclr.so!CallDescrWorkerInternal + 0x7c
00007FE140F28940 00007fe6f5e684bc libcoreclr.so!MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) + 0x36c, calling libcoreclr.so!CallDescrWorkerInternal
00007FE140F28960 00007fe6f5e684c8 libcoreclr.so!MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) + 0x378, calling libcoreclr.so!NativeExceptionHolderBase::~NativeExceptionHolderBase()
00007FE140F289A0 00007fe6f61c254f libcoreclr.so!UnicodeToUTF8 + 0xcf, calling libcoreclr.so!UTF8Encoding::GetBytes(char16_t*, int, unsigned char*, int)
00007FE140F28A00 00007fe6f5e212a8 libcoreclr.so!MetaSig::GetReturnTypeNormalized(TypeHandle*) const + 0x58, calling libcoreclr.so!SigPointer::PeekElemTypeClosed(Module*, SigTypeContext const*) const
00007FE140F28A20 00007fe6f5e212a8 libcoreclr.so!MetaSig::GetReturnTypeNormalized(TypeHandle*) const + 0x58, calling libcoreclr.so!SigPointer::PeekElemTypeClosed(Module*, SigTypeContext const*) const
00007FE140F28B00 00007fe6f5ff4b16 libcoreclr.so!QueueUserWorkItemManagedCallback(void*) + 0xc6, calling libcoreclr.so!MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int)
00007FE140F28C20 00007fe6f5e384ee libcoreclr.so!ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) + 0x1ae
00007FE140F28C30 00007fe6f5e385a3 libcoreclr.so!ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) + 0x263, calling libcoreclr.so!Frame::PopIfChained()
00007FE140F28C40 00007fe6f5db02c1 libcoreclr.so!CExecutionEngine::GetTlsData() + 0x11, calling libcoreclr.so!__tls_get_addr
00007FE140F28D40 00007fe6f5e38ca6 libcoreclr.so!ManagedThreadBase::ThreadPool(ADID, void (*)(void*), void*) + 0x46, calling libcoreclr.so!ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)
00007FE140F28D90 00007fe6f5fd5287 libcoreclr.so!ManagedPerAppDomainTPCount::DispatchWorkItem(bool*, bool*) + 0x127, calling libcoreclr.so!ManagedThreadBase::ThreadPool(ADID, void (*)(void*), void*)
00007FE140F28DE0 00007fe6f6c5a27d libc.so.6!__clock_gettime + 0x1d
00007FE140F28DF0 00007fe6f61cbbe3 libcoreclr.so!GetTickCount + 0x23, calling libcoreclr.so!clock_gettime
00007FE140F28E40 00007fe6f5fd4ae7 libcoreclr.so!PerAppDomainTPCountList::GetAppDomainIndexForThreadpoolDispatch() + 0x47
00007FE140F28E80 00007fe6f5e58a2c libcoreclr.so!ThreadpoolMgr::WorkerThreadStart(void*) + 0x4bc
00007FE140F28EF0 00007fe6f61eb1a9 libcoreclr.so!CorUnix::CPalThread::ThreadEntry(void*) + 0x129
00007FE140F28FB0 00007fe6f6c4d62d libc.so.6!clone + 0x6d

We are also using a custom renderer:

 [LayoutRenderer("correlationId")]
    public class CorrelationIdLayoutRenderer : LayoutRenderer
    {
        private IHttpContextAccessor _httpContextAccessor;

        private IHttpContextAccessor HttpContextAccessor => _httpContextAccessor ??
                                                            (_httpContextAccessor = ServiceLocator.ServiceProvider?.GetService<IHttpContextAccessor>());

        protected override void Append(StringBuilder builder, LogEventInfo logEvent)
        {
            var context = HttpContextAccessor?.HttpContext?.RequestServices?.GetService(typeof(LogContext)) as LogContext;

            if (context != null)
            {
                builder.Append(context.CorrelationId.ToString("N"));
            }
        }
    }
bug

Most helpful comment

Yet another option, that only works for ASP.NET Core is to use HttpContext.TraceIdentifier:

https://github.com/NLog/NLog.Web/issues/92

All 19 comments

Sound like a difficult problem. Means one of the 30000 threads (why so many?) are holding the Target-SyncRoot, but is blocked by something else, and cannot continue. Thus blocking all the other 29999 threads. Like finding a needle in haystack.

Minor observation it is that the AsyncWrapperTarget-SyncRoot-lock that it being blocked, and it is only used for layout-rendering and for initializing / closing. Unless an AppDomain is being created/destroyed, then it is most likely a thread being stuck in a layout-renderer.

Yeah, it's been a very difficult problem to solve. The application normally runs at about 30-40 threads (oftentimes that's for multiple days, but occasionally occurs within a few hours) until this issue shows up and then it jumps to 30k threads.

No AppDomains are being created/destroyed that I am aware of unless Kestrel is somehow trying to do that. Will have a look at the layout renderers we're using and see if anything comes up there.

We are facing a similar issue with the same root cause I believe. Fortunately, I was able to reproduce it on my dev machine sometimes. I was about creating a bug here because using layout renderers doesn't seem to be thread safe all the time. Our scenario:

Thread A:

  • thread A tries to get an object using ServiceProvider.GetService()
  • down the road thread A locks ServiceProvider.ResolvedServices in Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitScoped()
  • thread A is going to log some message and so it tries to lock SyncRoot in NLog.Targets.Target.PrecalculateVolatileLayouts() but has to wait for thread B

Thread B:

  • thread B is going to log some message so it locks SyncRoot in NLog.Targets.Target.PrecalculateVolatileLayouts()
  • in AspNetLayoutRendererBase thread B tries to get HttpContextAccessor using ServiceProvider.GetService()(in the first comment above it might be CorrelationIdLayoutRenderer instead of AspNetLayoutRendererBase)
  • down the road thread B tries to lock ServiceProvider.ResolvedServices in Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitScoped() but has to wait for thread A = DEADLOCK

@Rosevlad Nice of you to drop by and explain why. Guess it is a bad idea to make use of ServiceProvider inside a layout-renderer. As they are executed while holding an NLog Target SyncRoot-lock.

@natethegreat44 Maybe instead of using a custom layout-renderer, then inject the correlationid as a LogEvent-Property ? (This will happen outside the NLog-Target-SyncRoot, and should not cause deadlock)

Example of how to create the raw LogEvent, and append additional properties:

https://github.com/NLog/NLog/wiki/EventProperties-Layout-Renderer

Guess one could make a special extension-/helper-method, that makes it easier to include the CorrelationId by default everywhere..

Then it is just a matter for of adjusting the NLog target to use the standard layout renderer for displaying NLog LogEvent Properties.

P.S. Could be neat if one could make a hook into an instance of NLog Logger, so it would automatically call a delegate when a LogEventInfo is posted This delegate could then be updated to add custom context metadata as LogEventInfo-properties (or modify other stuff). The same delegate could probably also be used to perform custom filtering. Or something similar to #528 where layout-renders are executed on each created LogEvent, and appended as LogEvent-Properties (Need a warning label, as not all layout-renders are thread-safe, and cannot be used in this context, but we want this part to be unsafe).

Yet an alternative way would be to make use of MappedDiagnosticsLogicalContext (MDC):

https://tpodolak.com/blog/2016/07/05/nlog-tracking-flow-of-requests-using-mappeddiagnosticslogicalcontext/

Right now not available but it should come with #2025 (Will be released in 5.0 BETA 7).

Then it should be a matter of setting the request-correlation-id in the MDC on start-request, and then one can refer to it in the NLog Target Layout.

Yet another option, that only works for ASP.NET Core is to use HttpContext.TraceIdentifier:

https://github.com/NLog/NLog.Web/issues/92

Thank you @snakefoot - will give it a try. Simpler is definitely better. :-)

@snakefoot I'm working with @natethegreat44 to debug this issue in our API. After changing the layout to work as described in Nlog/NLog.Web#92 we are seeing the API get into a deadlock state if we start making simple requests to the API as it is starting up.
To further troubleshoot this, I ended up commenting out the body of the DoAppend method so it is just an empty override of AspNetLayoutRendererBase and I am still seeing the API get deadlocked.

This isn't reproducible 100% of the time - more like 50%, but is notably an issue. After removing reference to this layout renderer from NLog.config I can no longer reproduce the deadlocks.

Would a dump from WinDbg, or the stacktraces out of WinDbg be helpful?

@sdaub I just peaked at the code:

https://github.com/NLog/NLog.Web/blob/master/NLog.Web.AspNetCore/LayoutRenderers/AspNetTraceIdentifierLayoutRenderer.cs

That inherits from:

https://github.com/NLog/NLog.Web/blob/master/NLog.Web.AspNetCore/LayoutRenderers/AspNetLayoutRendererBase.cs

That has the exact same bad behavior, that you were trying to avoid:

        public IHttpContextAccessor HttpContextAccessor
        {
            get { return _httpContextAccessor ?? ServiceLocator.ServiceProvider?.GetService<IHttpContextAccessor>(); }
            set { _httpContextAccessor = value; }
        }

It seems the standard NLog ASP.NET CORE renders are not completely safe to use. And it seems difficult to assign the AspNetTraceIdentifierLayoutRenderer.HttpContextAccessor manually unless building the entire config from code.

Know very little about the ASP.NET CORE platform, but I guess a solution would be to kick life into your original horse:

    [LayoutRenderer("correlationId")]
    public class CorrelationIdLayoutRenderer : LayoutRenderer
    {
        private IHttpContextAccessor HttpContextAccessor
        {
                get
                {
                        return _httpContextAccessor ?? (_httpContextAccessor = ServiceLocator.ServiceProvider?.GetService<IHttpContextAccessor>());
                }
        }
        private IHttpContextAccessor _httpContextAccessor;

        protected override void InitializeLayoutRenderer()
        {
               if (HttpContextAccessor == null)
               {
                       NLog.Common.InternalLogger.Warn("{{correlationId}} is missing IHttpContextAccessor. Has it been registered before loading NLog Configuration? Consider reloading NLog Configuration after having registered the IHttpContextAccessor.");
               }
        }

        protected override void Append(StringBuilder builder, LogEventInfo logEvent)
        {
            var context = HttpContextAccessor?.HttpContext;
            if (context != null)
            {
                builder.Append(context.TraceIdentifier);
            }
        }
    }

Know very little about the ASP.NET CORE platform, but I guess a solution would be to kick life into your original horse:

If this is the fix and working correctly, then we could also path it in nlog.web.

I was not aware that ServiceLocator.ServiceProvider?.GetService was locking things.

Thank you for the feedback. We've been working on a few changes for it and here is the solution we arrived at:

[LayoutRenderer("correlationId")]
public class CorrelationIdLayoutRenderer : LayoutRenderer
{
    private IHttpContextAccessor _httpContextAccessor;

    protected override void InitializeLayoutRenderer()
    {
        _httpContextAccessor = ServiceLocator.ServiceProvider?.GetService<IHttpContextAccessor>();

        if (_httpContextAccessor == null)
        {
            NLog.Common.InternalLogger.Warn("{{correlationId}} is missing IHttpContextAccessor. Has it been registered before loading NLog Configuration? Consider reloading NLog Configuration after having registered the IHttpContextAccessor.");
        }
    }

    protected override void Append(StringBuilder builder, LogEventInfo logEvent)
    {
        var context = _httpContextAccessor?.HttpContext;

        if (context != null)
        {
            builder.Append(context.TraceIdentifier);
        }
    }
}

We tested this change and ran some load tests in our QA environment. It seemed to work well enough to promote it to production. It’s very difficult to tell definitively in the load tests if this will fix the issue as we only get servers to deadlock in production over a period of time. This change was deployed to production yesterday and has been fine so far. I will update next week after it has had some time to prove itself in the real world, but so far it looks promising.

On a related note, our production environment is running in AWS, with 10 docker instances servicing requests. While researching a separate issue in our aggregated logs (in Splunk) I discovered that the TraceIdentifier is duplicated across instances so it can be confusing when trying to examine the events logged for a single request to an instance of the API. It depends on how each application is set up, but for us that is not good for a long-term solution. We are looking at the following change to get back to uniquely identifying requests across all instances:

public class TraceIdMiddleware
{
    public async Task Invoke(HttpContext context)
    {
         context.TraceIdentifier = Guid.NewGuid().ToString();
         await _next(context);
    }
}

@sdaub Happy to hear the updated layout-renderer passed the QA-test, lets hope it also survives production. Please keep us updated, would be great to save others from this pitfall (By updating the NLog.Web Layout-Renders).

Amazed that the default GUID generated for TraceIdentifier actually produces duplicate value across instances. Thought that the default Guid would be combination of details from hostname, processid, threadid and timestamp (sounds like a bug in the docker instance configuration).

I like the idea to have a GUID that actually works (kind of expected from the Global-part). But alternative from re-assigning the context.TraceIdentifier, then you could "just" capture an alternative detail like the instance-name (or something different) and generate a combined primary-key.

Just curious. Any reason why you removed the safety-net (in case the layout-renderer was initialized before registering IHttpContextAccessor), or was it just to be completely sure no race-condition would ever occur:

        private IHttpContextAccessor HttpContextAccessor
        {
                get
                {
                        return _httpContextAccessor ?? (_httpContextAccessor = ServiceLocator.ServiceProvider?.GetService<IHttpContextAccessor>());
                }
        }
        private IHttpContextAccessor _httpContextAccessor;

I was surprised when I saw the duplication of TraceIds in the logs. .NET Core is not using a Guid for these Ids. In fact, it's just an encoding of DateTime.Now.Ticks.

We thought about doing something like you propose and add some instanceId, or other instance identifier to the TraceId. We have another project where we are trying out the Middleware to set the TraceId to a new Guid. If that continues to work OK, we'll deploy that next week.

As for the difference in code, the developer that worked on this went through a number of variations. In some cases the HttpContextAccessor was set, but its HttpContext property was always null. Through a number if iterations this is where we landed. If this all works to resolve the current problem, we may revisit the code to try to clean in up a little better.

I'll update next week with how things are working in Production.

Thanks again for your support.

Just following up with an update.

Since this change was deployed to production we had a significant increase in the volume of client requests to our API, and additional load over the weekend due to increased use of our system.

In spite of handling more than double the typical load, our instances had no problems at all! Typically we would have seen at least 1 instance become unresponsive due to deadlocks by now. I'm very optimistic that this latest change has resolved the deadlocking issue. I will learn more today as we have 3 new client applications hitting their respective stores today but I'm confident we won't have any more issues.

Thanks for your support on this issue.

@snakefoot we could fix NLog.Web now?

@304NotModified Yes it should be an easy fix for AspNetLayoutRendererBase.cs

Just replace these lines:

public IHttpContextAccessor HttpContextAccessor
{
        get { return _httpContextAccessor ?? ServiceLocator.ServiceProvider?.GetService<IHttpContextAccessor>(); }
        set { _httpContextAccessor = value; }
}

With these lines:

public IHttpContextAccessor HttpContextAccessor
{
        get { return _httpContextAccessor ?? (_httpContextAccessor = ServiceLocator.ServiceProvider?.GetService<IHttpContextAccessor>()); }
        set { _httpContextAccessor = value; }
}

And then add these lines:

protected override void InitializeLayoutRenderer()
{
       if (HttpContextAccessor == null)
       {
              NLog.Common.InternalLogger.Warn("Missing IHttpContextAccessor. Has it been registered before loading NLog Configuration? Consider reloading NLog Configuration after having registered the IHttpContextAccessor.");
       }
}

thanks @snakefoot !!

Updated NLog.Web and NLog.Extensions.Logging has been released. This should resolve this issue by fixing the official NLog-Layout-Renderer: ${aspnet-traceidentifier}

https://github.com/NLog/NLog.Web/pull/133 has been merged and released in Web.AspNetCore 4.4.0, so closing this.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

geedsen picture geedsen  Â·  3Comments

ErcinDedeoglu picture ErcinDedeoglu  Â·  3Comments

imanushin picture imanushin  Â·  3Comments

ericnewton76 picture ericnewton76  Â·  3Comments

FaMouZx3 picture FaMouZx3  Â·  3Comments