Runtime: Random Thread Pool Starvation on ASP.NET Core 2.2.4

Created on 10 Jun 2019  路  14Comments  路  Source: dotnet/runtime

We're experiencing random thread pool starvations on ASP.NET Core 2.2.4 on CentOS Linux. The .NET Core worker threads are constantly being consumed without freeing up again.

The issue occurs completely random, every 2 or so weeks, sometimes only once a month, sometimes every 2nd day. Sometimes under full production load, sometimes with nearly no load or absolutely no load at all.

When it occurs, the thread pool worker count (we monitor with Dynatrace) constantly goes from 0 to 1000 (our current maximum).

When using lldb + SOS plugin to analyze the cause, we just see that the threads hang in SslState with an exception:

STACK (all the threads look the same):

(lldb) clrstack
OS Thread Id: 0x17f12 (74)
        Child SP               IP Call Site
00007F11DC7F7090 00007f1731cf74ed [HelperMethodFrame: 00007f11dc7f7090]
00007F11DC7F7210 00007F16BF0ADFD6 System.Net.Security.SslState.StartReadFrame(Byte[], Int32, System.Net.AsyncProtocolRequest) [/root/corefx-2518272/src/System.Net.Security/src/System/Net/Security/SslState.cs @ 879]
00007F11DC7F7250 00007F16BF0AF8DC System.Net.Security.SslState.PartialFrameCallback(System.Net.AsyncProtocolRequest) [/root/corefx-2518272/src/System.Net.Security/src/System/Net/Security/SslState.cs @ 1102]
00007F11DC7F7280 00007F16BF0ADB7E System.Net.FixedSizeReader+<ReadPacketAsync>d__1.MoveNext() [/root/corefx-2518272/src/System.Net.Security/src/System/Net/FixedSizeReader.cs @ 63]
00007F11DC7F7360 00007F16B78FEDDD System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/root/coreclr/src/mscorlib/shared/System/Threading/ExecutionContext.cs @ 167]
00007F11DC7F73C0 00007F16BF0AF3D0 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[System.Net.FixedSizeReader+<ReadPacketAsync>d__1, System.Net.Security]].MoveNext() [/root/coreclr/src/mscorlib/src/System/Runtime/CompilerServices/AsyncMethodBuilder.cs @ 568]
00007F11DC7F73F0 00007F16BC50022F System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Runtime.CompilerServices.IAsyncStateMachineBox, Boolean) [/root/coreclr/src/mscorlib/src/System/Threading/Tasks/TaskContinuation.cs @ 798]
00007F11DC7F7430 00007F16B82D6B05 System.Threading.Tasks.Task.RunContinuations(System.Object) [/root/coreclr/src/mscorlib/src/System/Threading/Tasks/Task.cs @ 3275]
00007F11DC7F74D0 00007F16B82D6961 System.Threading.Tasks.Task.FinishContinuations()
00007F11DC7F7500 00007F16BF0CCB0A System.Threading.Tasks.Task`1[[System.Int32, System.Private.CoreLib]].TrySetResult(Int32) [/root/coreclr/src/mscorlib/src/System/Threading/Tasks/future.cs @ 424]
00007F11DC7F7530 00007F16BF0CCA41 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Int32, System.Private.CoreLib]].SetExistingTaskResult(Int32) [/root/coreclr/src/mscorlib/src/System/Runtime/CompilerServices/AsyncMethodBuilder.cs @ 646]
00007F11DC7F7550 00007F16BF952AC6 Microsoft.AspNetCore.Server.Kestrel.Core.Adapter.Internal.RawStream+<ReadAsyncInternal>d__24.MoveNext()
00007F11DC7F77B0 00007F16B78FEDDD System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/root/coreclr/src/mscorlib/shared/System/Threading/ExecutionContext.cs @ 167]
00007F11DC7F7810 00007F16BF958E60 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Int32, System.Private.CoreLib],[Microsoft.AspNetCore.Server.Kestrel.Core.Adapter.Internal.RawStream+<ReadAsyncInternal>d__24, Microsoft.AspNetCore.Server.Kestrel.Core]].MoveNext() [/root/coreclr/src/mscorlib/src/System/Runtime/CompilerServices/AsyncMethodBuilder.cs @ 568]
00007F11DC7F7840 00007F16B78FEDDD System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/root/coreclr/src/mscorlib/shared/System/Threading/ExecutionContext.cs @ 167]
00007F11DC7F78A0 00007F16B82D31F4 System.Threading.ThreadPoolWorkQueue.Dispatch() [/root/coreclr/src/mscorlib/src/System/Threading/ThreadPool.cs @ 588]
00007F11DC7F7C70 00007f17305e017f [DebuggerU2MCatchHandlerFrame: 00007f11dc7f7c70]

THE EXCEPTION:

(lldb) pe
Exception object: 00007f155142b4a8
Exception type:   System.IO.IOException
Message:          Authentication failed because the remote party has closed the transport stream.
InnerException:   <none>
StackTrace (generated):
<none>
StackTraceString: <none>
HResult: 80131620

Excerpt from the thread list:

 174  232 18109 00007F1220026E10  1021220 Preemptive  00007F1351AF3FD8:00007F1351AF5018 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1351af3ec0
 176  233 18117 00007F1220044550  1021220 Preemptive  00007F12512A1C78:00007F12512A28C0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f12512a1b60
 177  234 18118 00007F16A4047BC0  1021220 Preemptive  00007F1351B036C8:00007F1351B05018 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1351b035b0
 179  235 1811b 00007F123C074EB0  1021220 Preemptive  00007F1351AFC1D8:00007F1351AFD018 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1351afc0c0
 180  236 1811c 00007F120C6A7050  1021220 Preemptive  00007F12512A5F08:00007F12512A68C0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f12512a5df0
 183  237 1812b 00007F12341E97F0  1021220 Preemptive  00007F1451C10AA0:00007F1451C10D50 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1451c10988
 184  238 1812c 00007F1690066BE0  1021220 Preemptive  00007F1451C775B8:00007F1451C78DE0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1451c774a0
 188  239 18147 00007F120C0F72F0  1021220 Preemptive  00007F1351B06EC0:00007F1351B07018 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1351b06da8
 189  240 18148 00007F16A4035600  1021220 Preemptive  00007F1451C3E138:00007F1451C3ED80 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1451c3e020
 190  241 18149 00007F16A4023550  1021220 Preemptive  00007F1351B0CFB0:00007F1351B0D018 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1351b0ce98
 191  242 1814a 000000000469BCE0  1021220 Preemptive  00007F12512B11D0:00007F12512B28C0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f12512b10b8
 193  243 1816a 00007F169007BDC0  1021220 Preemptive  00007F12512B6A58:00007F12512B88C0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f12512b6940
 195  244 1816f 00007F11F011BB10  1021220 Preemptive  00007F12512B3AC8:00007F12512B48C0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f12512b39b0
 196  245 18170 00007F16A0095C20  1021220 Preemptive  00007F15514E0410:00007F15514E0680 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f15514e02f8
 197  246 18171 00007F11EA060CD0  1021220 Preemptive  00007F1351B12248:00007F1351B13018 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1351b12130
 199  247 1817f 00007F16A002E700  1021220 Preemptive  00007F12512B6170:00007F12512B68C0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f12512b6058
 201  248 18191 00007F16A40415F0  1021220 Preemptive  00007F155150FA08:00007F15515106B0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f155150f8f0
 205  249 181a3 00007F11F8038DE0  1021220 Preemptive  00007F12512B9C90:00007F12512BA8C0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f12512b9b78
 206  250 181a4 00007F12040434F0  1021220 Preemptive  00007F15514F1EE8:00007F15514F2680 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f15514f1dd0
 207  251 181ae 00007F11E405F210  1021220 Preemptive  00007F12512BC2F0:00007F12512BC8C0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f12512bc1d8
 208  252 181af 00007F121417AF20  1021220 Preemptive  00007F1451C85088:00007F1451C86DE0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1451c84f70
 209  253 181bb 00007F123C0526D0  1021220 Preemptive  00007F1451C73AF8:00007F1451C74DE0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1451c739e0
 210  254 181bf 00007F1238024FF0  1021220 Preemptive  00007F1351B14410:00007F1351B15018 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1351b142f8
 211  255 181c0 00007F1214094750  1021220 Preemptive  00007F1551553338:00007F15515546B0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1551553220
 212  256 181c1 00007F123C00F630  1021220 Preemptive  00007F1551529F20:00007F155152A6B0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1551529e08
 215  258 181cf 00007F169008D990  1021220 Preemptive  00007F155151B920:00007F155151C6B0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f155151b808
 216  257 181d0 00007F1694008570  1021220 Preemptive  00007F1351B20358:00007F1351B21018 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1351b20240
 220  259 181f1 00007F123C01A820  1021220 Preemptive  00007F1351B22310:00007F1351B23018 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1351b221f8
 221  260 181f2 00007F11E406BEE0  1021220 Preemptive  00007F12512C3968:00007F12512C48C0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f12512c3850
 224  261 18201 00007F16A0028950  1021220 Preemptive  00007F1451C7FC70:00007F1451C80DE0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1451c7fb58
 225  262 18202 0000000004909F90  1021220 Preemptive  00007F12512D6A58:00007F12512D88C0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f12512d6940
 228  263 18211 00007F11F00B4F80  1021220 Preemptive  00007F12512CBBE0:00007F12512CC8C0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f12512cbac8
 229  264 18212 00007F11F45B37D0  1021220 Preemptive  00007F1451C9A590:00007F1451C9ADE0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1451c9a478
 230  265 18213 00007F123C027FB0  1021220 Preemptive  00007F1551536628:00007F15515366B0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1551536510
 231  266 18214 00007F123801D1F0  1021220 Preemptive  00007F1451C8F398:00007F1451C90DE0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f1451c8f280
 233  267 1822a 00007F11F41D29C0  1021220 Preemptive  00007F15515551D8:00007F15515566B0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f15515550c0
 232  268 18229 00007F120807F5D0  1021220 Preemptive  00007F12512D4400:00007F12512D48C0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f12512d42e8
 236  269 1823b 00007F120C6DB140  1021220 Preemptive  00007F12512D2868:00007F12512D28C0 0000000001DB4F80 0     Ukn (Threadpool Worker) System.IO.IOException 00007f12512d2750

Up to know we have been unable to diagnose the cause of the problem. Can you help?

area-System.Threading

Most helpful comment

we monitor with Dynatrace

Dynatrace is profiler. It may be hang inside Dynatrace. clrstack -f or bt would confirm or disprove this hypothesis.

All 14 comments

@rh78, for one of these hung threads, can you share the output of clrstack -f rather than just clrstack?

When it occurs, the thread pool worker count (we monitor with Dynatrace) constantly goes from 0 to 1000 (our current maximum).

To make sure I understand, when this happens you're saying you end up with 1000 threads, all with a call stack like the above, all stuck inside a helper frame from StartReadFrame, and you hang at 1000 threads because you've explicitly set the max number of thread pool threads to 1000?

Hello sorry I cannot do clrstack -f because I had to restart the process and minidump failed.

Yes it is like that, we hardcoded to 1000. Before we had no dedicated limit, and it was growing further until it ran into some internal limits.

@jkotas, @janvorli, any thoughts here? The line showing as current in the stack frame is https://github.com/dotnet/corefx/blob/7ff91c4ea9157d27456909e2387af9e1f0ba6b02/src/System.Net.Security/src/System/Net/Security/SslState.cs#L879 which is throwing the exact exception type and error message showing as current on the thread, and it's then apparently hung in some helper runtime call. That suggests to me that it's somehow blocked indefinitely while doing the exception throw, but I've never seen this before. Do you think I'm misinterpreting? Know of any issues like this? Suggestions on what else @rh78 could gather the next time this happens to help diagnose?

@rh78, are there any threads that don't have this same stack? It might be helpful to see the details of those, e.g. clrstack -f, the next time this happens. Maybe there's some lock that all these threads are blocking on and some other thread is erroneously holding it indefinitely?

It looks like a possible hang in the exception handling code. But to be able to tell more, I would need to get the clrstack -f (or just bt) command output at the time of failure.

we monitor with Dynatrace

Dynatrace is profiler. It may be hang inside Dynatrace. clrstack -f or bt would confirm or disprove this hypothesis.

OK I will check next time. What does the -f flag do?

What does the -f flag do?

It includes native frames in addition to the managed ones.

Hello. It occured again. But before I go into detail, here is what I've changed:

  • When I reported this issue our Azure load balancer did TCP based health checks to HTTPS port - I thought this might be a problem - so I switched the Azure load balancer health checks to HTTP on HTTP URL/port
  • I installed updated versions of LLDB to our servers, it is now LLDB 5.0.1

Now, what happened. It now occured on a different server than before, and my improved health checks now discovered the issue while it was ongoing. Currently the thread count is at ~700.

I think the constant thread count increase is due to the load balancer doing the HTTP health checks. This would also describe why the issue occurs at times where no normal traffic occurs. Also it describes why the threads grow so constantly.

Again we have the same stack for all the threads (except ~10 threads), with a different exception this time (this might be due to the change from TCP->HTTPS to HTTP->HTTP health checks).

I then analyzed the 10 threads that were NOT having that common pattern - as @stephentoub suggested. There is one that has an ArgumentNullException, but being blocked. I analyzed using clrstack -f and found:

00007F1286FFA9B0 00007F17DD6DE4ED libpthread.so.0!__lll_lock_wait + 29
00007F1286FFA9C8 00007F17DD6D9DE6 libpthread.so.0!_L_lock_941 + 15
00007F1286FFAA48 00007F17DD6D9CDF libpthread.so.0!__GI___pthread_mutex_lock + 175
00007F1286FFAA50 00007F17DCC2036F libc.so.6!__dl_iterate_phdr + 47
00007F1286FFAAF0 00007F17D93DB42A liboneagentdotnet.so!___lldb_unnamed_symbol19967$$liboneagentdotnet.so + 176
00007F1286FFAC70 00007F17D93DCC9C liboneagentdotnet.so!___lldb_unnamed_symbol19985$$liboneagentdotnet.so + 190
00007F1286FFACA0 00007F17D93DE068 liboneagentdotnet.so!___lldb_unnamed_symbol20000$$liboneagentdotnet.so + 286
00007F1286FFAD70 00007F17D93DE2E6 liboneagentdotnet.so!___lldb_unnamed_symbol20001$$liboneagentdotnet.so + 43
00007F1286FFAF20 00007F17D93DA390 liboneagentdotnet.so!___lldb_unnamed_symbol19957$$liboneagentdotnet.so + 82
00007F1286FFAFF0 00007F17D8C7BCC5 liboneagentdotnet.so!___lldb_unnamed_symbol3022$$liboneagentdotnet.so + 229
00007F1286FFB040 00007F17DD6DF5D0 libpthread.so.0!__restore_rt
00007F1286FFB5E8 00007F17DD6D9CD9 libpthread.so.0!__GI___pthread_mutex_lock + 169
00007F1286FFB5F0 00007F17DCC2036F libc.so.6!__dl_iterate_phdr + 47
00007F1286FFB690 00007F17DCEC2BBF libgcc_s.so.1!_Unwind_Find_FDE + 447
00007F1286FFB6F0 00007F17DCEBFD2C libgcc_s.so.1!___lldb_unnamed_symbol14$$libgcc_s.so.1 + 172
00007F1286FFB780 00007F17DCEC0BD3 libgcc_s.so.1!_Unwind_RaiseException + 147
00007F1286FFBB30 00007F17DD427986 libstdc++.so.6!__cxa_throw + 102
00007F1286FFBB50 00007F17DC21AF66 libcoreclr.so + -1
00007F1286FFBB90 00007F17DC21AF01 libcoreclr.so!RaiseException + 369
00007F1286FFBBD0 00007F17DBE3C693 libcoreclr.so!RaiseTheExceptionInternalOnly(Object*, int, int) + 867
00007F1286FFBCA0 00007F17DBF37520 libcoreclr.so!IL_Throw(Object*) + 416
00007F1286FFBCB0                  [HelperMethodFrame: 00007f1286ffbcb0]
00007F1286FFBE30 00007F1763CD9542 System.Private.CoreLib.dll!System.Text.EncodingTable.GetCodePageFromName(System.String) + 258
00007F1286FFBE60 00007F1763CD92F7 System.Private.CoreLib.dll!System.Text.Encoding.GetEncoding(System.String) + 39
00007F1286FFBE80 00007F176CDF062A RestSharp.dll!RestSharp.Extensions.MiscExtensions.AsString(Byte[], System.String) + 26
00007F1286FFBEB0 00007F176CDF1AFD RestSharp.dll!RestSharp.RestResponseBase.get_Content() + 29

What I see here is Dynatrace OneAgent.

Now I see all those other threads:

36  113 100f3 00007F1288174520  1021220 Preemptive  0000000000000000:0000000000000000 0000000001084F40 0     Ukn (Threadpool Worker) System.Net.Sockets.SocketException 00007f13fcab7310
  38  115 10160 00007F12B822D500  1021220 Preemptive  0000000000000000:0000000000000000 0000000001084F40 0     Ukn (Threadpool Worker) System.Net.Sockets.SocketException 00007f12fcc015c0
  37  116 1015f 00007F12C00C9190  1021220 Preemptive  0000000000000000:0000000000000000 0000000001084F40 0     Ukn (Threadpool Worker) System.Net.Sockets.SocketException 00007f14fc7fefb8
  39  111 10229 00007F12D04BC110  1021220 Preemptive  0000000000000000:0000000000000000 0000000001084F40 0     Ukn (Threadpool Worker) System.Net.Sockets.SocketException 00007f14fc7f5158
  40  106 1027c 00007F12D0017D60  1021220 Preemptive  0000000000000000:0000000000000000 0000000001084F40 0     Ukn (Threadpool Worker) System.Net.Sockets.SocketException 00007f12fcc14720
  43  108 10323 00007F12B81EB6C0  1021220 Preemptive  0000000000000000:0000000000000000 0000000001084F40 0     Ukn (Threadpool Worker) System.Net.Sockets.SocketException 00007f13fcab7ef8
  44  105 10324 00007F12C407F130  1021220 Preemptive  0000000000000000:0000000000000000 0000000001084F40 0     Ukn (Threadpool Worker) System.Net.Sockets.SocketException 00007f14fc7f5898
  45  102 10325 00007F12C8027700  1021220 Preemptive  0000000000000000:0000000000000000 0000000001084F40 0     Ukn (Threadpool Worker) System.Net.Sockets.SocketException 00007f12fcc17a88
  46  103 10326 00007F12900C3B10  1021220 Preemptive  0000000000000000:0000000000000000 0000000001084F40 0     Ukn (Threadpool Worker) System.Net.Sockets.SocketException 00007f15fcc18970

MANY

With clrstack -f of:

(lldb) clrstack -f
OS Thread Id: 0x1a099 (3389)
        Child SP               IP Call Site
00007F11412EDBC0 00007F17DD6DE4ED libpthread.so.0!__lll_lock_wait + 29
00007F11412EDBD8 00007F17DD6D9DE6 libpthread.so.0!_L_lock_941 + 15
00007F11412EDC58 00007F17DD6D9CDF libpthread.so.0!__GI___pthread_mutex_lock + 175
00007F11412EDC60 00007F17DCC2036F libc.so.6!__dl_iterate_phdr + 47
00007F11412EDD00 00007F17DCEC2BBF libgcc_s.so.1!_Unwind_Find_FDE + 447
00007F11412EDD60 00007F17DCEBFD2C libgcc_s.so.1!___lldb_unnamed_symbol14$$libgcc_s.so.1 + 172
00007F11412EDDF0 00007F17DCEC06ED libgcc_s.so.1!___lldb_unnamed_symbol15$$libgcc_s.so.1 + 157
00007F11412EDFB0 00007F17DCEC0B7E libgcc_s.so.1!_Unwind_RaiseException + 62
00007F11412EE360 00007F17DD427986 libstdc++.so.6!__cxa_throw + 102
00007F11412EE380 00007F17DC21AF66 libcoreclr.so + -1
00007F11412EE3C0 00007F17DC21AF01 libcoreclr.so!RaiseException + 369
00007F11412EE400 00007F17DBE3C693 libcoreclr.so!RaiseTheExceptionInternalOnly(Object*, int, int) + 867
00007F11412EE4D0 00007F17DBF37520 libcoreclr.so!IL_Throw(Object*) + 416
00007F11412EE4E0                  [HelperMethodFrame: 00007f11412ee4e0]
00007F11412EE660 00007F176B45750A Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.dll!Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.SocketAwaitableEventArgs.<GetResult>g__ThrowSocketException|7_0(System.Net.Sockets.SocketError) + 42
00007F11412EE680 00007F176B373949 Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.dll!Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.SocketAwaitableEventArgs.GetResult() + 41
00007F11412EE690 00007F176B3731BB Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.dll!Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.SocketConnection+<ProcessReceives>d__24.MoveNext() + 731
00007F11412EE780 00007F17632DEDDD System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) + 141
00007F11412EE7E0 00007F176B457230 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.SocketConnection+<ProcessReceives>d__24, Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets]].MoveNext() + 80
00007F11412EE810 00007F17632DEDDD System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) + 141
00007F11412EE870 00007F1767F7C725 System.Private.CoreLib.dll!System.Threading.QueueUserWorkItemCallback.ExecuteWorkItem() + 133
00007F11412EE8A0 00007F1763CB3044 System.Private.CoreLib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch() + 436
00007F11412EE910 00007F17DBFC717F libcoreclr.so!CallDescrWorkerInternal + 124
00007F11412EE930 00007F17DBEE567A libcoreclr.so!MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) + 954
00007F11412EEB00 00007F17DC068D15 libcoreclr.so!QueueUserWorkItemManagedCallback(void*) + 181
00007F11412EEC20 00007F17DBEB67DF libcoreclr.so!ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) + 431
00007F11412EEC70                  [DebuggerU2MCatchHandlerFrame: 00007f11412eec70]
00007F11412EED40 00007F17DBEB6F90 libcoreclr.so!ManagedThreadBase::ThreadPool(ADID, void (*)(void*), void*) + 64
00007F11412EED90 00007F17DC04CE47 libcoreclr.so!ManagedPerAppDomainTPCount::DispatchWorkItem(bool*, bool*) + 295
00007F11412EEE60 00007F17DBED5903 libcoreclr.so!ThreadpoolMgr::WorkerThreadStart(void*) + 1267
00007F11412EEEC0 00007F17DC254525 libcoreclr.so!CorUnix::CPalThread::ThreadEntry(void*) + 309
00007F11412EEEE0 00007F17DD6D7DD5 libpthread.so.0!start_thread + 197
00007F11412EEF80 00007F17DCBE1EAD libc.so.6!__clone + 109

What do you think?

How many other threads are inside __dl_iterate_phdr? Are all of them stopped on __lll_lock_wait or is one of them making progress?

The problem may be that the exception handling on Linux too slow to keep up with an exception to be thrown for every request in this workload.

cc @janvorli

All of them are stopped, same stack trace, same situation. None is making any progress.

We now disabled Dynatrace monitoring and also created an issue with Dynatrace referencing this one.

What I think happens is:

Some issue occurs in our code. E.g. exception above.

It occurs here: https://github.com/restsharp/RestSharp/blob/master/RestSharp/Extensions/MiscExtensions.cs line 92 - it is immediately being catched and handled, so it should be a non-issue to do this.

BUT: Dynatrace wants to report that exception and blocks everything somehow, the code never goes beyond line 92.

Now our load balancer continues to issue requests on our health check endpoint. I dont know what exactly causes the other exceptions, but maybe its because of how Azure LB is crafting the requests. Nevermind - the handling of the exception is blocked, and continuously threads start to pile up until we're run out of threads.

So for you it seems this case is solved, it seems to be a Dynatrace issue.

For you, maybe it would be interesting why the SocketExceptions occur.

What do you think?

So for you it seems this case is solved, it seems to be a Dynatrace issue.

Thanks. If you don't mind, it'd be great to hear back from you after you hear from Dynatrace, in case others run into a similar situation and could learn from your efforts.

For you, maybe it would be interesting why the SocketExceptions occur.

In the https example, the exception is occurring because the server has closed the connection. We're expecting data related to establishing the SSL connection and instead we're getting back end-of-stream. I can't see details on the SocketException in the second example, but it's probably something similar, e.g. the connection getting closed by the server unexpectedly.

Hello yes I will keep you updated if I learn something from Dynatrace. Thanks for helping!

Hello Dynatrace seem to have released a fix only for me which will get public as soon as I confirm that it fixed the issue:

"We have now published agent version 1.169.208 on your tenant - please update to that agent version and retry. As mentioned, a fix for deadlocking around fork/exec is included and the developers are pretty sure that you are affected by this issue as well."

I'll try early next week and keep an eye on it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yahorsi picture yahorsi  路  3Comments

bencz picture bencz  路  3Comments

nalywa picture nalywa  路  3Comments

btecu picture btecu  路  3Comments

chunseoklee picture chunseoklee  路  3Comments