We're experiencing random thread pool starvations on ASP.NET Core 2.2.4 on CentOS Linux. The .NET Core worker threads are constantly being consumed without freeing up again.
The issue occurs completely random, every 2 or so weeks, sometimes only once a month, sometimes every 2nd day. Sometimes under full production load, sometimes with nearly no load or absolutely no load at all.
When it occurs, the thread pool worker count (we monitor with Dynatrace) constantly goes from 0 to 1000 (our current maximum).
When using lldb + SOS plugin to analyze the cause, we just see that the threads hang in SslState with an exception:
STACK (all the threads look the same):
(lldb) clrstack
OS Thread Id: 0x17f12 (74)
Child SP IP Call Site
00007F11DC7F7090 00007f1731cf74ed [HelperMethodFrame: 00007f11dc7f7090]
00007F11DC7F7210 00007F16BF0ADFD6 System.Net.Security.SslState.StartReadFrame(Byte[], Int32, System.Net.AsyncProtocolRequest) [/root/corefx-2518272/src/System.Net.Security/src/System/Net/Security/SslState.cs @ 879]
00007F11DC7F7250 00007F16BF0AF8DC System.Net.Security.SslState.PartialFrameCallback(System.Net.AsyncProtocolRequest) [/root/corefx-2518272/src/System.Net.Security/src/System/Net/Security/SslState.cs @ 1102]
00007F11DC7F7280 00007F16BF0ADB7E System.Net.FixedSizeReader+<ReadPacketAsync>d__1.MoveNext() [/root/corefx-2518272/src/System.Net.Security/src/System/Net/FixedSizeReader.cs @ 63]
00007F11DC7F7360 00007F16B78FEDDD System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/root/coreclr/src/mscorlib/shared/System/Threading/ExecutionContext.cs @ 167]
00007F11DC7F73C0 00007F16BF0AF3D0 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[System.Net.FixedSizeReader+<ReadPacketAsync>d__1, System.Net.Security]].MoveNext() [/root/coreclr/src/mscorlib/src/System/Runtime/CompilerServices/AsyncMethodBuilder.cs @ 568]
00007F11DC7F73F0 00007F16BC50022F System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Runtime.CompilerServices.IAsyncStateMachineBox, Boolean) [/root/coreclr/src/mscorlib/src/System/Threading/Tasks/TaskContinuation.cs @ 798]
00007F11DC7F7430 00007F16B82D6B05 System.Threading.Tasks.Task.RunContinuations(System.Object) [/root/coreclr/src/mscorlib/src/System/Threading/Tasks/Task.cs @ 3275]
00007F11DC7F74D0 00007F16B82D6961 System.Threading.Tasks.Task.FinishContinuations()
00007F11DC7F7500 00007F16BF0CCB0A System.Threading.Tasks.Task`1[[System.Int32, System.Private.CoreLib]].TrySetResult(Int32) [/root/coreclr/src/mscorlib/src/System/Threading/Tasks/future.cs @ 424]
00007F11DC7F7530 00007F16BF0CCA41 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Int32, System.Private.CoreLib]].SetExistingTaskResult(Int32) [/root/coreclr/src/mscorlib/src/System/Runtime/CompilerServices/AsyncMethodBuilder.cs @ 646]
00007F11DC7F7550 00007F16BF952AC6 Microsoft.AspNetCore.Server.Kestrel.Core.Adapter.Internal.RawStream+<ReadAsyncInternal>d__24.MoveNext()
00007F11DC7F77B0 00007F16B78FEDDD System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/root/coreclr/src/mscorlib/shared/System/Threading/ExecutionContext.cs @ 167]
00007F11DC7F7810 00007F16BF958E60 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Int32, System.Private.CoreLib],[Microsoft.AspNetCore.Server.Kestrel.Core.Adapter.Internal.RawStream+<ReadAsyncInternal>d__24, Microsoft.AspNetCore.Server.Kestrel.Core]].MoveNext() [/root/coreclr/src/mscorlib/src/System/Runtime/CompilerServices/AsyncMethodBuilder.cs @ 568]
00007F11DC7F7840 00007F16B78FEDDD System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/root/coreclr/src/mscorlib/shared/System/Threading/ExecutionContext.cs @ 167]
00007F11DC7F78A0 00007F16B82D31F4 System.Threading.ThreadPoolWorkQueue.Dispatch() [/root/coreclr/src/mscorlib/src/System/Threading/ThreadPool.cs @ 588]
00007F11DC7F7C70 00007f17305e017f [DebuggerU2MCatchHandlerFrame: 00007f11dc7f7c70]
THE EXCEPTION:
(lldb) pe
Exception object: 00007f155142b4a8
Exception type: System.IO.IOException
Message: Authentication failed because the remote party has closed the transport stream.
InnerException: <none>
StackTrace (generated):
<none>
StackTraceString: <none>
HResult: 80131620
Excerpt from the thread list:
174 232 18109 00007F1220026E10 1021220 Preemptive 00007F1351AF3FD8:00007F1351AF5018 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1351af3ec0
176 233 18117 00007F1220044550 1021220 Preemptive 00007F12512A1C78:00007F12512A28C0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f12512a1b60
177 234 18118 00007F16A4047BC0 1021220 Preemptive 00007F1351B036C8:00007F1351B05018 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1351b035b0
179 235 1811b 00007F123C074EB0 1021220 Preemptive 00007F1351AFC1D8:00007F1351AFD018 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1351afc0c0
180 236 1811c 00007F120C6A7050 1021220 Preemptive 00007F12512A5F08:00007F12512A68C0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f12512a5df0
183 237 1812b 00007F12341E97F0 1021220 Preemptive 00007F1451C10AA0:00007F1451C10D50 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1451c10988
184 238 1812c 00007F1690066BE0 1021220 Preemptive 00007F1451C775B8:00007F1451C78DE0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1451c774a0
188 239 18147 00007F120C0F72F0 1021220 Preemptive 00007F1351B06EC0:00007F1351B07018 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1351b06da8
189 240 18148 00007F16A4035600 1021220 Preemptive 00007F1451C3E138:00007F1451C3ED80 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1451c3e020
190 241 18149 00007F16A4023550 1021220 Preemptive 00007F1351B0CFB0:00007F1351B0D018 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1351b0ce98
191 242 1814a 000000000469BCE0 1021220 Preemptive 00007F12512B11D0:00007F12512B28C0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f12512b10b8
193 243 1816a 00007F169007BDC0 1021220 Preemptive 00007F12512B6A58:00007F12512B88C0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f12512b6940
195 244 1816f 00007F11F011BB10 1021220 Preemptive 00007F12512B3AC8:00007F12512B48C0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f12512b39b0
196 245 18170 00007F16A0095C20 1021220 Preemptive 00007F15514E0410:00007F15514E0680 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f15514e02f8
197 246 18171 00007F11EA060CD0 1021220 Preemptive 00007F1351B12248:00007F1351B13018 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1351b12130
199 247 1817f 00007F16A002E700 1021220 Preemptive 00007F12512B6170:00007F12512B68C0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f12512b6058
201 248 18191 00007F16A40415F0 1021220 Preemptive 00007F155150FA08:00007F15515106B0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f155150f8f0
205 249 181a3 00007F11F8038DE0 1021220 Preemptive 00007F12512B9C90:00007F12512BA8C0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f12512b9b78
206 250 181a4 00007F12040434F0 1021220 Preemptive 00007F15514F1EE8:00007F15514F2680 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f15514f1dd0
207 251 181ae 00007F11E405F210 1021220 Preemptive 00007F12512BC2F0:00007F12512BC8C0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f12512bc1d8
208 252 181af 00007F121417AF20 1021220 Preemptive 00007F1451C85088:00007F1451C86DE0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1451c84f70
209 253 181bb 00007F123C0526D0 1021220 Preemptive 00007F1451C73AF8:00007F1451C74DE0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1451c739e0
210 254 181bf 00007F1238024FF0 1021220 Preemptive 00007F1351B14410:00007F1351B15018 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1351b142f8
211 255 181c0 00007F1214094750 1021220 Preemptive 00007F1551553338:00007F15515546B0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1551553220
212 256 181c1 00007F123C00F630 1021220 Preemptive 00007F1551529F20:00007F155152A6B0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1551529e08
215 258 181cf 00007F169008D990 1021220 Preemptive 00007F155151B920:00007F155151C6B0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f155151b808
216 257 181d0 00007F1694008570 1021220 Preemptive 00007F1351B20358:00007F1351B21018 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1351b20240
220 259 181f1 00007F123C01A820 1021220 Preemptive 00007F1351B22310:00007F1351B23018 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1351b221f8
221 260 181f2 00007F11E406BEE0 1021220 Preemptive 00007F12512C3968:00007F12512C48C0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f12512c3850
224 261 18201 00007F16A0028950 1021220 Preemptive 00007F1451C7FC70:00007F1451C80DE0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1451c7fb58
225 262 18202 0000000004909F90 1021220 Preemptive 00007F12512D6A58:00007F12512D88C0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f12512d6940
228 263 18211 00007F11F00B4F80 1021220 Preemptive 00007F12512CBBE0:00007F12512CC8C0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f12512cbac8
229 264 18212 00007F11F45B37D0 1021220 Preemptive 00007F1451C9A590:00007F1451C9ADE0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1451c9a478
230 265 18213 00007F123C027FB0 1021220 Preemptive 00007F1551536628:00007F15515366B0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1551536510
231 266 18214 00007F123801D1F0 1021220 Preemptive 00007F1451C8F398:00007F1451C90DE0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f1451c8f280
233 267 1822a 00007F11F41D29C0 1021220 Preemptive 00007F15515551D8:00007F15515566B0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f15515550c0
232 268 18229 00007F120807F5D0 1021220 Preemptive 00007F12512D4400:00007F12512D48C0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f12512d42e8
236 269 1823b 00007F120C6DB140 1021220 Preemptive 00007F12512D2868:00007F12512D28C0 0000000001DB4F80 0 Ukn (Threadpool Worker) System.IO.IOException 00007f12512d2750
Up to know we have been unable to diagnose the cause of the problem. Can you help?
@rh78, for one of these hung threads, can you share the output of clrstack -f rather than just clrstack?
When it occurs, the thread pool worker count (we monitor with Dynatrace) constantly goes from 0 to 1000 (our current maximum).
To make sure I understand, when this happens you're saying you end up with 1000 threads, all with a call stack like the above, all stuck inside a helper frame from StartReadFrame, and you hang at 1000 threads because you've explicitly set the max number of thread pool threads to 1000?
Hello sorry I cannot do clrstack -f because I had to restart the process and minidump failed.
Yes it is like that, we hardcoded to 1000. Before we had no dedicated limit, and it was growing further until it ran into some internal limits.
@jkotas, @janvorli, any thoughts here? The line showing as current in the stack frame is https://github.com/dotnet/corefx/blob/7ff91c4ea9157d27456909e2387af9e1f0ba6b02/src/System.Net.Security/src/System/Net/Security/SslState.cs#L879 which is throwing the exact exception type and error message showing as current on the thread, and it's then apparently hung in some helper runtime call. That suggests to me that it's somehow blocked indefinitely while doing the exception throw, but I've never seen this before. Do you think I'm misinterpreting? Know of any issues like this? Suggestions on what else @rh78 could gather the next time this happens to help diagnose?
@rh78, are there any threads that don't have this same stack? It might be helpful to see the details of those, e.g. clrstack -f, the next time this happens. Maybe there's some lock that all these threads are blocking on and some other thread is erroneously holding it indefinitely?
It looks like a possible hang in the exception handling code. But to be able to tell more, I would need to get the clrstack -f (or just bt) command output at the time of failure.
we monitor with Dynatrace
Dynatrace is profiler. It may be hang inside Dynatrace. clrstack -f or bt would confirm or disprove this hypothesis.
OK I will check next time. What does the -f flag do?
What does the -f flag do?
It includes native frames in addition to the managed ones.
Hello. It occured again. But before I go into detail, here is what I've changed:
Now, what happened. It now occured on a different server than before, and my improved health checks now discovered the issue while it was ongoing. Currently the thread count is at ~700.
I think the constant thread count increase is due to the load balancer doing the HTTP health checks. This would also describe why the issue occurs at times where no normal traffic occurs. Also it describes why the threads grow so constantly.
Again we have the same stack for all the threads (except ~10 threads), with a different exception this time (this might be due to the change from TCP->HTTPS to HTTP->HTTP health checks).
I then analyzed the 10 threads that were NOT having that common pattern - as @stephentoub suggested. There is one that has an ArgumentNullException, but being blocked. I analyzed using clrstack -f and found:
00007F1286FFA9B0 00007F17DD6DE4ED libpthread.so.0!__lll_lock_wait + 29
00007F1286FFA9C8 00007F17DD6D9DE6 libpthread.so.0!_L_lock_941 + 15
00007F1286FFAA48 00007F17DD6D9CDF libpthread.so.0!__GI___pthread_mutex_lock + 175
00007F1286FFAA50 00007F17DCC2036F libc.so.6!__dl_iterate_phdr + 47
00007F1286FFAAF0 00007F17D93DB42A liboneagentdotnet.so!___lldb_unnamed_symbol19967$$liboneagentdotnet.so + 176
00007F1286FFAC70 00007F17D93DCC9C liboneagentdotnet.so!___lldb_unnamed_symbol19985$$liboneagentdotnet.so + 190
00007F1286FFACA0 00007F17D93DE068 liboneagentdotnet.so!___lldb_unnamed_symbol20000$$liboneagentdotnet.so + 286
00007F1286FFAD70 00007F17D93DE2E6 liboneagentdotnet.so!___lldb_unnamed_symbol20001$$liboneagentdotnet.so + 43
00007F1286FFAF20 00007F17D93DA390 liboneagentdotnet.so!___lldb_unnamed_symbol19957$$liboneagentdotnet.so + 82
00007F1286FFAFF0 00007F17D8C7BCC5 liboneagentdotnet.so!___lldb_unnamed_symbol3022$$liboneagentdotnet.so + 229
00007F1286FFB040 00007F17DD6DF5D0 libpthread.so.0!__restore_rt
00007F1286FFB5E8 00007F17DD6D9CD9 libpthread.so.0!__GI___pthread_mutex_lock + 169
00007F1286FFB5F0 00007F17DCC2036F libc.so.6!__dl_iterate_phdr + 47
00007F1286FFB690 00007F17DCEC2BBF libgcc_s.so.1!_Unwind_Find_FDE + 447
00007F1286FFB6F0 00007F17DCEBFD2C libgcc_s.so.1!___lldb_unnamed_symbol14$$libgcc_s.so.1 + 172
00007F1286FFB780 00007F17DCEC0BD3 libgcc_s.so.1!_Unwind_RaiseException + 147
00007F1286FFBB30 00007F17DD427986 libstdc++.so.6!__cxa_throw + 102
00007F1286FFBB50 00007F17DC21AF66 libcoreclr.so + -1
00007F1286FFBB90 00007F17DC21AF01 libcoreclr.so!RaiseException + 369
00007F1286FFBBD0 00007F17DBE3C693 libcoreclr.so!RaiseTheExceptionInternalOnly(Object*, int, int) + 867
00007F1286FFBCA0 00007F17DBF37520 libcoreclr.so!IL_Throw(Object*) + 416
00007F1286FFBCB0 [HelperMethodFrame: 00007f1286ffbcb0]
00007F1286FFBE30 00007F1763CD9542 System.Private.CoreLib.dll!System.Text.EncodingTable.GetCodePageFromName(System.String) + 258
00007F1286FFBE60 00007F1763CD92F7 System.Private.CoreLib.dll!System.Text.Encoding.GetEncoding(System.String) + 39
00007F1286FFBE80 00007F176CDF062A RestSharp.dll!RestSharp.Extensions.MiscExtensions.AsString(Byte[], System.String) + 26
00007F1286FFBEB0 00007F176CDF1AFD RestSharp.dll!RestSharp.RestResponseBase.get_Content() + 29
What I see here is Dynatrace OneAgent.
Now I see all those other threads:
36 113 100f3 00007F1288174520 1021220 Preemptive 0000000000000000:0000000000000000 0000000001084F40 0 Ukn (Threadpool Worker) System.Net.Sockets.SocketException 00007f13fcab7310
38 115 10160 00007F12B822D500 1021220 Preemptive 0000000000000000:0000000000000000 0000000001084F40 0 Ukn (Threadpool Worker) System.Net.Sockets.SocketException 00007f12fcc015c0
37 116 1015f 00007F12C00C9190 1021220 Preemptive 0000000000000000:0000000000000000 0000000001084F40 0 Ukn (Threadpool Worker) System.Net.Sockets.SocketException 00007f14fc7fefb8
39 111 10229 00007F12D04BC110 1021220 Preemptive 0000000000000000:0000000000000000 0000000001084F40 0 Ukn (Threadpool Worker) System.Net.Sockets.SocketException 00007f14fc7f5158
40 106 1027c 00007F12D0017D60 1021220 Preemptive 0000000000000000:0000000000000000 0000000001084F40 0 Ukn (Threadpool Worker) System.Net.Sockets.SocketException 00007f12fcc14720
43 108 10323 00007F12B81EB6C0 1021220 Preemptive 0000000000000000:0000000000000000 0000000001084F40 0 Ukn (Threadpool Worker) System.Net.Sockets.SocketException 00007f13fcab7ef8
44 105 10324 00007F12C407F130 1021220 Preemptive 0000000000000000:0000000000000000 0000000001084F40 0 Ukn (Threadpool Worker) System.Net.Sockets.SocketException 00007f14fc7f5898
45 102 10325 00007F12C8027700 1021220 Preemptive 0000000000000000:0000000000000000 0000000001084F40 0 Ukn (Threadpool Worker) System.Net.Sockets.SocketException 00007f12fcc17a88
46 103 10326 00007F12900C3B10 1021220 Preemptive 0000000000000000:0000000000000000 0000000001084F40 0 Ukn (Threadpool Worker) System.Net.Sockets.SocketException 00007f15fcc18970
MANY
With clrstack -f of:
(lldb) clrstack -f
OS Thread Id: 0x1a099 (3389)
Child SP IP Call Site
00007F11412EDBC0 00007F17DD6DE4ED libpthread.so.0!__lll_lock_wait + 29
00007F11412EDBD8 00007F17DD6D9DE6 libpthread.so.0!_L_lock_941 + 15
00007F11412EDC58 00007F17DD6D9CDF libpthread.so.0!__GI___pthread_mutex_lock + 175
00007F11412EDC60 00007F17DCC2036F libc.so.6!__dl_iterate_phdr + 47
00007F11412EDD00 00007F17DCEC2BBF libgcc_s.so.1!_Unwind_Find_FDE + 447
00007F11412EDD60 00007F17DCEBFD2C libgcc_s.so.1!___lldb_unnamed_symbol14$$libgcc_s.so.1 + 172
00007F11412EDDF0 00007F17DCEC06ED libgcc_s.so.1!___lldb_unnamed_symbol15$$libgcc_s.so.1 + 157
00007F11412EDFB0 00007F17DCEC0B7E libgcc_s.so.1!_Unwind_RaiseException + 62
00007F11412EE360 00007F17DD427986 libstdc++.so.6!__cxa_throw + 102
00007F11412EE380 00007F17DC21AF66 libcoreclr.so + -1
00007F11412EE3C0 00007F17DC21AF01 libcoreclr.so!RaiseException + 369
00007F11412EE400 00007F17DBE3C693 libcoreclr.so!RaiseTheExceptionInternalOnly(Object*, int, int) + 867
00007F11412EE4D0 00007F17DBF37520 libcoreclr.so!IL_Throw(Object*) + 416
00007F11412EE4E0 [HelperMethodFrame: 00007f11412ee4e0]
00007F11412EE660 00007F176B45750A Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.dll!Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.SocketAwaitableEventArgs.<GetResult>g__ThrowSocketException|7_0(System.Net.Sockets.SocketError) + 42
00007F11412EE680 00007F176B373949 Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.dll!Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.SocketAwaitableEventArgs.GetResult() + 41
00007F11412EE690 00007F176B3731BB Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.dll!Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.SocketConnection+<ProcessReceives>d__24.MoveNext() + 731
00007F11412EE780 00007F17632DEDDD System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) + 141
00007F11412EE7E0 00007F176B457230 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.SocketConnection+<ProcessReceives>d__24, Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets]].MoveNext() + 80
00007F11412EE810 00007F17632DEDDD System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) + 141
00007F11412EE870 00007F1767F7C725 System.Private.CoreLib.dll!System.Threading.QueueUserWorkItemCallback.ExecuteWorkItem() + 133
00007F11412EE8A0 00007F1763CB3044 System.Private.CoreLib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch() + 436
00007F11412EE910 00007F17DBFC717F libcoreclr.so!CallDescrWorkerInternal + 124
00007F11412EE930 00007F17DBEE567A libcoreclr.so!MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) + 954
00007F11412EEB00 00007F17DC068D15 libcoreclr.so!QueueUserWorkItemManagedCallback(void*) + 181
00007F11412EEC20 00007F17DBEB67DF libcoreclr.so!ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) + 431
00007F11412EEC70 [DebuggerU2MCatchHandlerFrame: 00007f11412eec70]
00007F11412EED40 00007F17DBEB6F90 libcoreclr.so!ManagedThreadBase::ThreadPool(ADID, void (*)(void*), void*) + 64
00007F11412EED90 00007F17DC04CE47 libcoreclr.so!ManagedPerAppDomainTPCount::DispatchWorkItem(bool*, bool*) + 295
00007F11412EEE60 00007F17DBED5903 libcoreclr.so!ThreadpoolMgr::WorkerThreadStart(void*) + 1267
00007F11412EEEC0 00007F17DC254525 libcoreclr.so!CorUnix::CPalThread::ThreadEntry(void*) + 309
00007F11412EEEE0 00007F17DD6D7DD5 libpthread.so.0!start_thread + 197
00007F11412EEF80 00007F17DCBE1EAD libc.so.6!__clone + 109
What do you think?
How many other threads are inside __dl_iterate_phdr? Are all of them stopped on __lll_lock_wait or is one of them making progress?
The problem may be that the exception handling on Linux too slow to keep up with an exception to be thrown for every request in this workload.
cc @janvorli
All of them are stopped, same stack trace, same situation. None is making any progress.
We now disabled Dynatrace monitoring and also created an issue with Dynatrace referencing this one.
What I think happens is:
Some issue occurs in our code. E.g. exception above.
It occurs here: https://github.com/restsharp/RestSharp/blob/master/RestSharp/Extensions/MiscExtensions.cs line 92 - it is immediately being catched and handled, so it should be a non-issue to do this.
BUT: Dynatrace wants to report that exception and blocks everything somehow, the code never goes beyond line 92.
Now our load balancer continues to issue requests on our health check endpoint. I dont know what exactly causes the other exceptions, but maybe its because of how Azure LB is crafting the requests. Nevermind - the handling of the exception is blocked, and continuously threads start to pile up until we're run out of threads.
So for you it seems this case is solved, it seems to be a Dynatrace issue.
For you, maybe it would be interesting why the SocketExceptions occur.
What do you think?
So for you it seems this case is solved, it seems to be a Dynatrace issue.
Thanks. If you don't mind, it'd be great to hear back from you after you hear from Dynatrace, in case others run into a similar situation and could learn from your efforts.
For you, maybe it would be interesting why the SocketExceptions occur.
In the https example, the exception is occurring because the server has closed the connection. We're expecting data related to establishing the SSL connection and instead we're getting back end-of-stream. I can't see details on the SocketException in the second example, but it's probably something similar, e.g. the connection getting closed by the server unexpectedly.
Hello yes I will keep you updated if I learn something from Dynatrace. Thanks for helping!
Hello Dynatrace seem to have released a fix only for me which will get public as soon as I confirm that it fixed the issue:
"We have now published agent version 1.169.208 on your tenant - please update to that agent version and retry. As mentioned, a fix for deadlocking around fork/exec is included and the developers are pretty sure that you are affected by this issue as well."
I'll try early next week and keep an eye on it.
Most helpful comment
Dynatrace is profiler. It may be hang inside Dynatrace.
clrstack -forbtwould confirm or disprove this hypothesis.