Kestrelhttpserver: Kestrel randomly locked and stop responding.

Created on 21 Dec 2016  路  76Comments  路  Source: aspnet/KestrelHttpServer

I run ASP.NET Core 1.1 on Ubuntu 16.04 in Docker container (with Ubuntu image).
This works fine, but 3 times in two weeks I catch error, that Kestrel stop responding.
No errors in Exception filter, in logs - only last record with start Action Executing, but not completed.
Why problem in Kestrel ? - Because it locked and not processing new requests.
At locked state, CPU usage is 4-5% (per core). I wait 4 hours, but not result.

TOP command result:

PID    USER      PR    NI    VIRT       RES          SHR S       %CPU      %MEM      TIME+       COMMAND
10788   root      20   0     11.396g    185060       53704 S     4.7       2.3       14:17.56     dotnet

Restarting docker container is solve issue.
I can't attach and trace it, because all tools, that I use - works only on Windows.

duplicate

Most helpful comment

We're seeing similar. I can exec into the container and see the process still running. It's not responding to curl. It's sat at 4-6% CPU constantly.

I can 100% reproduce it. It happens randomly under heavy load from our protractor unit tests hitting our web app, so just takes anywhere from 0-20 minutes to trigger.

Our app listens on port 1924 and has nginx in front of it. Nginx returns a 502 or 504.

Curl inside container:
```[root@66e41d245901 tmp]# curl 127.0.0.1:1924 -v

  • Rebuilt URL to: 127.0.0.1:1924/
  • Trying 127.0.0.1...
  • Connected to 127.0.0.1 (127.0.0.1) port 1924 (#0)

GET / HTTP/1.1
Host: 127.0.0.1:1924
User-Agent: curl/7.50.1
Accept: /

Hangs and has to be killed.

netstat inside container:

[root@66e41d245901 vq]# netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 1976 0 localhost:xiip localhost:48872 CLOSE_WAIT
tcp 515 0 localhost:xiip localhost:48866 CLOSE_WAIT
tcp 0 0 localhost:49636 localhost:57050 ESTABLISHED
tcp 79 0 localhost:xiip localhost:48938 CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48980 CLOSE_WAIT
tcp 32 0 66e41d245901:54064 192.168.1.11:https CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48934 CLOSE_WAIT
tcp 0 0 localhost:postgres localhost:56804 ESTABLISHED
tcp 2076 0 localhost:xiip localhost:48884 CLOSE_WAIT
tcp 2002 0 localhost:xiip localhost:48950 CLOSE_WAIT
tcp 236 0 localhost:xiip localhost:49010 CLOSE_WAIT
tcp 0 0 66e41d245901:54308 192.168.1.11:https ESTABLISHED
tcp 0 0 localhost:postgres localhost:56838 ESTABLISHED
tcp 0 0 localhost:xiip localhost:48754 CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48958 CLOSE_WAIT
tcp 77 0 localhost:xiip localhost:48962 CLOSE_WAIT
tcp 0 0 66e41d245901:50248 mirrors.melbourne.:http TIME_WAIT
tcp 77 0 localhost:xiip localhost:48960 CLOSE_WAIT
tcp 0 0 localhost:56804 localhost:postgres ESTABLISHED
tcp 0 0 localhost:57050 localhost:54525 ESTABLISHED
tcp 0 0 localhost:56838 localhost:postgres ESTABLISHED
tcp 236 0 localhost:xiip localhost:49008 CLOSE_WAIT
tcp 0 0 localhost:57050 localhost:49636 ESTABLISHED
tcp 515 0 localhost:xiip localhost:48818 CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48966 CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48970 CLOSE_WAIT
tcp 489 0 localhost:xiip localhost:48848 CLOSE_WAIT
tcp 0 0 localhost:54525 localhost:57050 ESTABLISHED
tcp 2002 0 localhost:xiip localhost:48978 CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48888 CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48964 CLOSE_WAIT
udp6 0 0 localhost:56619 localhost:56619 ESTABLISHED
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags Type State I-Node Path
unix 3 [ ] STREAM CONNECTED 22431
unix 3 [ ] STREAM CONNECTED 24418 /tmp/kestrel_0a937d9e657e40f98d00f6ec9e1d1388
unix 3 [ ] STREAM CONNECTED 27652 /tmp/kestrel_ac09a04d99044ce686585e802a31f7ed
unix 3 [ ] STREAM CONNECTED 28605
unix 3 [ ] STREAM CONNECTED 28606
unix 3 [ ] STREAM CONNECTED 26748
```

All 76 comments

Just to be clear, are you running everything (host and containers) on Linux?

Are you doing remote access from a Windows console window? It could be this if so: https://github.com/aspnet/KestrelHttpServer/issues/1259

Yes, it's all running on Ubuntu. 2-core server with no server CPU (G4400) and 8GB RAM
I know problem with console locking in Windows, with mouse e.t.c, but this is not same issue, because docker run in background and I not access it directly. It stop responding without any user activity with server.
Kestrel doing something, but i don't understand what:
CPU load 4.7% per core (at locked state) is equals to 3-5 fast requests to main page in normal state.

It very hard to reproduce and i want help you to fix it, Can you advice trace tools for collect debug info at .NET Core in Linux environment ?

Issue always appear after few days without any user activity on server.

UPD. Just now server blocked again. 2 days after restarted container

@ZOXEXIVO are you passing a cancellation token to any of your response writing API calls (if you have any)

@davidfowl No. Only simple async requests to MongoDB without passing CancellationToken.

We're seeing similar. I can exec into the container and see the process still running. It's not responding to curl. It's sat at 4-6% CPU constantly.

I can 100% reproduce it. It happens randomly under heavy load from our protractor unit tests hitting our web app, so just takes anywhere from 0-20 minutes to trigger.

Our app listens on port 1924 and has nginx in front of it. Nginx returns a 502 or 504.

Curl inside container:
```[root@66e41d245901 tmp]# curl 127.0.0.1:1924 -v

  • Rebuilt URL to: 127.0.0.1:1924/
  • Trying 127.0.0.1...
  • Connected to 127.0.0.1 (127.0.0.1) port 1924 (#0)

GET / HTTP/1.1
Host: 127.0.0.1:1924
User-Agent: curl/7.50.1
Accept: /

Hangs and has to be killed.

netstat inside container:

[root@66e41d245901 vq]# netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 1976 0 localhost:xiip localhost:48872 CLOSE_WAIT
tcp 515 0 localhost:xiip localhost:48866 CLOSE_WAIT
tcp 0 0 localhost:49636 localhost:57050 ESTABLISHED
tcp 79 0 localhost:xiip localhost:48938 CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48980 CLOSE_WAIT
tcp 32 0 66e41d245901:54064 192.168.1.11:https CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48934 CLOSE_WAIT
tcp 0 0 localhost:postgres localhost:56804 ESTABLISHED
tcp 2076 0 localhost:xiip localhost:48884 CLOSE_WAIT
tcp 2002 0 localhost:xiip localhost:48950 CLOSE_WAIT
tcp 236 0 localhost:xiip localhost:49010 CLOSE_WAIT
tcp 0 0 66e41d245901:54308 192.168.1.11:https ESTABLISHED
tcp 0 0 localhost:postgres localhost:56838 ESTABLISHED
tcp 0 0 localhost:xiip localhost:48754 CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48958 CLOSE_WAIT
tcp 77 0 localhost:xiip localhost:48962 CLOSE_WAIT
tcp 0 0 66e41d245901:50248 mirrors.melbourne.:http TIME_WAIT
tcp 77 0 localhost:xiip localhost:48960 CLOSE_WAIT
tcp 0 0 localhost:56804 localhost:postgres ESTABLISHED
tcp 0 0 localhost:57050 localhost:54525 ESTABLISHED
tcp 0 0 localhost:56838 localhost:postgres ESTABLISHED
tcp 236 0 localhost:xiip localhost:49008 CLOSE_WAIT
tcp 0 0 localhost:57050 localhost:49636 ESTABLISHED
tcp 515 0 localhost:xiip localhost:48818 CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48966 CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48970 CLOSE_WAIT
tcp 489 0 localhost:xiip localhost:48848 CLOSE_WAIT
tcp 0 0 localhost:54525 localhost:57050 ESTABLISHED
tcp 2002 0 localhost:xiip localhost:48978 CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48888 CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48964 CLOSE_WAIT
udp6 0 0 localhost:56619 localhost:56619 ESTABLISHED
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags Type State I-Node Path
unix 3 [ ] STREAM CONNECTED 22431
unix 3 [ ] STREAM CONNECTED 24418 /tmp/kestrel_0a937d9e657e40f98d00f6ec9e1d1388
unix 3 [ ] STREAM CONNECTED 27652 /tmp/kestrel_ac09a04d99044ce686585e802a31f7ed
unix 3 [ ] STREAM CONNECTED 28605
unix 3 [ ] STREAM CONNECTED 28606
unix 3 [ ] STREAM CONNECTED 26748
```

I work with @philjones88 and to confirm this is a self-contained app running against NetCore 1.0.1 on a CentOS 7 container.

we enabled more verbose logging and captured:

2017-01-05 14:06:40.951911 Microsoft.AspNetCore.Hosting.Internal.WebHost Request starting HTTP/1.1 GET http://localhost/static/themes/base.css?cb=1483621285065 2017-01-05 14:06:40.951911 RequestId : 0HL1LEEL4VG15 Component : VQPortal Method : OWIN Pipeline Message : Request Start: /static/themes/base.css?cb=1483621285065 2017-01-05 14:06:40.951911 Microsoft.AspNetCore.StaticFiles.StaticFileMiddleware Sending file. Request path: '/themes/base.css'. Physical path: '/vq/portal/static/themes/base.css' 2017-01-05 14:06:40.977634 Microsoft.AspNetCore.Server.Kestrel Connection id "0HL1LEEL40UDH" communication error Microsoft.AspNetCore.Server.Kestrel.Internal.Networking.UvException: Error -32 EPIPE broken pipe 2017-01-05 14:06:40.977634 Microsoft.AspNetCore.Server.Kestrel Connection id "0HL1LEEL40UDH" disconnecting.

After using NGINX before Kestrel, problem disappear (2 weeks uptime)

@ZOXEXIVO we are experiencing the same symptoms in the same containerized environment. Since deploying nginx before kestrel , have you had any locks?

@euclid47 behind nginx server, it works perfect! I
Here, https://github.com/aspnet/KestrelHttpServer/issues/1304 problem was in StaticFiles middleware.

Any idea when #1304 will be pushed out? I am running .net core 1.1 in an Azure web app and am similarly having the problem where it will crash and never start serving content again until the app is swapped or restarted

@ZOXEXIVO @runxc1 #1304 should be released in Feb.

Hi,
Do we have any plans to ship v1.1.1 as we are also getting the same error while deploying dot net core web app on Linux docker - ubntu ?

@tavisca-syedave we plan to have the patch out hopefully later this month.

Kestrel 1.1.1 is now released. Is anyone still running into this issue?

I think I've just got it with 1.1.2 using unix socket (behind nginx). Will enable verbose logging and come back if I can reproduce it.

I have StaticFiles middleware disabled in production (static files are served by nginx).

Just got it too with 1.1.2 begind nginx.

I also just got it on 1.1.2. Nginx reverse proxy to Kestrel in Debian. In my case it was working in production for more more than 1 month then suddenly this it stopped working. Now it stopped working every hour

Best advice -> upgrade to v2.0

@ZOXEXIVO Did upgrading to v2 fixed your problem?

@jeff-pang Problem disappeared when i host kestrel behind nginx

@ZOXEXIVO I am already hosting Kestrel behind Nginx, but I will give it a try anyway. Thanks

Upgrading to 2.0 fixed the issue

The same issue on 2.0.0, please check this
(kestrel hangs)
telnet port is oke, but curl http => hangs

(lldb) thread list
Process 4262 stopped
thread #1: tid = 4262, 0x00007f7cf6f106d5 libpthread.so.0`__pthread_cond_wait + 197, name = 'dotnet', stop reason = trace

  • thread #2: tid = 4263, 0x00007f7cf640fe2d libc.so.6, name = 'dotnet'
    thread #3: tid = 4264, 0x00007f7cf6f10a82 libpthread.so.0__pthread_cond_timedwait + 306, name = 'dotnet' thread #4: tid = 4265, 0x00007f7cf6f13c3d libpthread.so.0, name = 'dotnet' thread #5: tid = 4266, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #6: tid = 4267, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #7: tid = 4268, 0x00007f7cf6f10a82 libpthread.so.0__pthread_cond_timedwait + 306, name = 'dotnet'
    thread #8: tid = 4269, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #9: tid = 4270, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #10: tid = 4271, 0x00007f7cf6f1349d libpthread.so.0, name = 'dotnet'
    thread #11: tid = 4272, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #12: tid = 4275, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #13: tid = 4276, 0x00007f7cf6f106d4 libpthread.so.0__pthread_cond_wait + 196, name = 'dotnet' thread #14: tid = 4277, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #15: tid = 4278, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #16: tid = 4279, 0x00007f7cf6f1349d libpthread.so.0, name = 'dotnet' thread #17: tid = 4280, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #18: tid = 4281, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #19: tid = 4282, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #20: tid = 4283, 0x00007f7cf6f1349d libpthread.so.0, name = 'dotnet'
    thread #21: tid = 4284, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #22: tid = 4285, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #23: tid = 4286, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #24: tid = 4287, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #25: tid = 4288, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #26: tid = 4290, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #27: tid = 4291, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #28: tid = 4292, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #29: tid = 4293, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #30: tid = 4294, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #31: tid = 4295, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #32: tid = 4296, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #33: tid = 4297, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #34: tid = 4298, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #35: tid = 4299, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #36: tid = 4300, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #37: tid = 4301, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #38: tid = 4308, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #39: tid = 4309, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #40: tid = 4310, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #41: tid = 4311, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #42: tid = 4317, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #43: tid = 4326, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #44: tid = 4328, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #45: tid = 4332, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #46: tid = 4374, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #47: tid = 4375, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #48: tid = 4376, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #49: tid = 4377, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #50: tid = 4378, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #51: tid = 4379, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #52: tid = 4381, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #53: tid = 4382, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #54: tid = 4383, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #55: tid = 4384, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #56: tid = 4385, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #57: tid = 4386, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #58: tid = 4387, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #59: tid = 4392, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #60: tid = 4435, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #61: tid = 4436, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #62: tid = 4437, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #63: tid = 4438, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #64: tid = 4558, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
    thread #65: tid = 4559, 0x00007f7cf6f106d5 libpthread.so.0`__pthread_cond_wait + 197, name = 'dotnet'

(lldb) sos DumpStack
OS Thread Id: 0x10a7 (2)
TEB information is not available so a stack size of 0xFFFF is assumed
Current frame: libc.so.6 + 0xffffffff
Child-SP RetAddr Caller, Callee
00007F7CF4E4DDF0 00007f7cf5b70cb4 libcoreclr.so!CorUnix::CPalSynchronizationManager::ReadBytesFromProcessPipe(int, unsigned char, int) + 0x64, calling libcoreclr.so!poll
00007F7CF4E4DE40 00007f7cf5b70284 libcoreclr.so!CorUnix::CPalSynchronizationManager::WorkerThread(void
) + 0x84, calling libcoreclr.so!CorUnix::CPalSynchronizationManager::ReadBytesFromProcessPipe(int, unsigned char, int)
00007F7CF4E4DE60 00007f7cf733a430 ld-linux-x86-64.so.2!_dl_runtime_resolve + 0x50, calling ld-linux-x86-64.so.2!_dl_fixup
00007F7CF4E4DEF0 00007f7cf5b79342 libcoreclr.so!CorUnix::CPalThread::ThreadEntry(void
) + 0x132
00007F7CF4E4DFB0 00007f7cf641a76d libc.so.6!__clone + 0x6d

@vinhhrv Could you upload your dump somewhere and share it with us?

@cesarbs
i will send my dump to your email is [email protected] ?
please confirm, (my email is [email protected])

Sounds good, though depending on the size you might not be able to email it.

@cesarbs, i sent to your email.
Thanks for support.

@vinhhrv About this:

telnet port is oke, but curl http => hangs

Can you share more details about the specific request that causes it to hang? Or does it hang on any request?

@pieceofsummer @ilyabreev @jeff-pang Can you share more details about the issues you've ran into with Kestrel 1.1.2? In @vinhhrv's case, it appears he is running into the connection limit that we introduced in 2.0. But that limit isn't there in 1.1.2, so that can't be the cause of your issues.

Could any of you share a memory dump captured when Kestrel hangs?

i got a same problem and i send you the dump zipped
please give me some advice
Thanks

I'm also having the problem, it appears the whole dotnet process 'stops' rather than just kestrel as I have some background threads that stop doing anything also. The dotnet memory usage goes down bigly when this happens also.

Sometimes I see this output

^[[40m^[[1m^[[33mwarn^[[39m^[[22m^[[49m: Microsoft.AspNetCore.Server.Kestrel[22]
      Heartbeat took longer than "00:00:01" at "09/28/2017 18:00:58 +00:00".

But there does not seem to be a correlation between this message and the hangs.

Hello. Same problem. But my problem is a little bit different. So... I restart server every one or two week. It looks like server getting tired. When tired is not responding and then nginx getting 502 error. I am testing mobile app and getting and posting more datas and server get this error every 2 days. And i restart the server and then solve my problem for two days. I use ubuntu's service from /etc/systemd/system. I create webapi.service and webui.service for this dotnet crash problem. Dotnet automatically starting after crash but not healty and crashing more and more times. What is my problem. Please solve my problem. Thanks.

@muratg
@davidfowl

I am also using Ubuntu and running through the supervisor system.

@anthonywanted @sanishx @hasanoruc We've seen issues where the server seemingly hangs due to thread pool exhaustion. It is always the case that somewhere in the application a sync call is being made (e.g. SomethingAsync().GetAwaiter().GetResult()). This is unfortunately hidden in unexpected places sometimes, like some Azure SDKs. Have you checked if that could be the case?

@cesarbs i will check.

and. more information.
my system did not show this error when it was 12gb ram, 6cores processor.
but my system is now 2cores processors and 4gbram. this error look when my system has been smaller. under the load(looking 45 persons per second) %2 cpu usage(2cores). not problem. but i test my app and I proccess more operation and dotnet is crashing. it seems like sockets. but i dont know.
sorry for my bad english.

@hasanoruc do you use non-async (e.g. blocking) apis? May be some solutions in https://github.com/aspnet/KestrelHttpServer/issues/2104#issuecomment-337170742

With 2 cores the threadpool will have a lower minthreads value; so you may hit threadpool starvation (it will inject threads; but not as fast as the blocking apis are stopping them)

@benaadams No. My all api methods are async.
Like this:

    [HttpPost("postxmesajsil")]
    [Produces("application/json")]
    public async Task<bool> PostXamarinMesajsil([FromHeader]string token, [FromBody]JObject obj)
    {
        string identity = (string)obj["userId"];

        await TokenControl(token, identity);

        int userId = (int)obj["userId"];
        int friendId = (int)obj["friendId"];

        return await messagesDal.sp_mesajsilxamarin(userId, friendId);
    }

I'm also having the problem, I'm using dotnet core 2.0. It appears the whole dotnet process 'stops' rather than just kestrel as I have some background threads that stop doing anything also. This issue only happens when a reverse proxy server(HAProxy) is setting in front of kestrel and concurrency is high(above 50).

i got it, when HAProxy session sticky policy is on, kestrel will stop responding; but if I remove session sticky, kestrel works. Very strange.

You are not disposing the http client for a start. Was there an xmldocument.load () in there somewhere?

@sanishmistry resuse (e.g. static) HttpClient or Dispose it (e.g. using)? You may be hitting max outbound connections + keep alives on client?

Also pump up your outbound connection limits?

// Add at Main()
System.Net.ServicePointManager.DefaultConnectionLimit = 256; // Max concurrent outbound requests`

i will wrap the block in using statements and see if that fixes it, I think it that was likely the issue as it seemed to happen exactly 7 days after starting the process, and that block ran at a fixed time interval. I was not aware HttpClients needed to be disposed!

I was not aware HttpClients needed to be disposed!

You can also just use a single static one and not Dispose it. However you can't change things like DefaultHeaders per call as that wouldn't be thread safe

I solved my own problem for ubuntu 16.04.
I ran top command and then shift+m and then shift+h. And i saw many many dotnet actions loading the ram and it did not died after time by itself and it did not clear the ram by itself. And i looked to google and i add this line to crontab.(/etc/crontab):

*/5 * * * * root killall --older-than 40m -s SIGKILL dotnet

What is this?
Every 5 minutes run this command. What command? killall dotnet actions older than 40 minutes.
And extra note: Add this extra line to crontab for every 5 minutes to clear all buffer/cache. And restart crontab for this:

*/5 * * * * root sync && echo 3 > /proc/sys/vm/drop_caches

sudo /etc/init.d/cron reload
sudo /etc/init.d/cron restart

And see this command to clear ram every 45 minutes: watch -n 3 free -m

@hasanoruc What's starting your dotnet processes? When should be stopping the dotnet processes?

@halter73
If every request to the server and every transaction in the database causes a dotnet process in the ram in my ubuntu server, and this does not disappear by itself. I need to clear these operations. I can clean them with crontab. When you delete this information and the cache on the server, the user is connected a bit late. But if it remains within the time specified in the crontab, it reads again from the cache and fast.

I'm also trying to track down this issue with two of our sites. It seems to happen when there is load on the site and it doesn't fix itself. We use a lot of async redis calls / http client calls via the nest client

@niemyjski It could be threadpool starvation. Have you collected a memory dump when Kestrel isn't responding?

Same problem here. Randomly stops responding and consumes 1-2% CPU. dotnet core 2.0.3

@ilyabreev Any details?

Information that might be useful:

  1. OS version
  2. Whether or not your running under IIS/ANCM or behind another reverse proxy
  3. Frequency of occurence
  4. Amount of load the server was under when it stopped responding
  5. A core dump taken when the server is not responding
  6. Anything out of the ordinary that could be related.

Also, I recommend opening a new issue. Even if the behavior described in this issue is similar to what you're experiencing, this issue describes a bug that was fixed prior to the release of 2.0.0.

@halter73 yes, I have some memory dumps.

@niemyjski Can you share these? You can email me with a link at the address listed in my github profile. Thanks.

@halter73

  1. Ubuntu 14.04.5 LTS as host system. dotnet process runs inside docker container which based on ubuntu:xenial
  2. Running behind nginx as reverse proxy.
  3. 0-6 times in a day.
  4. Don't know exactly. I don't think it's constantly too high. Maybe some kind of CPU spikes occured before locking.
  5. Will try to email you link to memory dump.
  6. Nothing special.

As this issue has been closed, do we have another new issue to track down the "dotnet process stops" problem? We're facing the same issue as the dotnet process stops unexpectedly.

@mazong1123 please feel free to file another bug to discuss the current issues.

This blog post could help you increasing max open files and configure Nginx to close finished request connections

I guess i solved my own problem. A few days ago, i looked nginx error logs in /var/log/nginx/error.log. and saw error logs this: "upstream prematurely closed connection while reading response header from upstream". And i think about it and i looked my bad nginx config. And I changed the following config:

location / {
proxy_pass http://backend;  #throw error this line
}
location /api {
proxy_pass http://localhost:5021; #this is not throwing an error because config is ok.
}
upstream backend {
server 127.0.0.1:5020; # this throwing error. because i think coming request going to 2 port. and confusing.
server 127.0.0.1:19920;
}

my new config is:

location / {
proxy_pass http://localhost:5020;
}
location /api {
proxy_pass http://localhost:5021;
}

And not using the upstream because not necessary my system.

Sorry for my bad english.

@halter73 just a heads up I've been running 1.5 weeks in production without issues. I've seen high thread count and have been trying to get the profiler attached in azure to take a look (without success). Considering I've seen thread exhaustion in the past I want to ensure I'm not leaking any thing but the high thread count is normal for my load (but I never remember seeing this many of threads in the legacy asp.net app). I got a dump but all the threads were native and empty stack traces. Currently serving up 5+ billion requests a month across 4-5 large azure web app instances. Seeing 70-80% cpu and a couple hundred threads (600+). I was able to narrow down one of my issues to this: https://github.com/StackExchange/StackExchange.Redis/issues/759 .

Interesting note here is I wish Microsoft would throw some resources at this project (Redis client) as it is used by Microsoft for the distributed cache implementation and is used for various scenarios inside of asp.net ;)

I've made a BlockingDetector package to output warnings to the logs when you block:
https://github.com/benaadams/Ben.BlockingDetector

Only detects when blocking actually happens; so doesn't pick up calls that may block but don't or warn about coding practices that lead to blocking (or blocking that happens at OS rather than .NET level e.g. File.Read) - but it may help pick up things.

Same problem here, randomly no response on my asp.net core project on Azure app service (IIS and ASP.NET Core Module).
Another related issue: https://github.com/aspnet/Home/issues/1950

The typical metric of requests indicators as shown below, when no response error happens, the number of threads being used in the dotnet core will jump up to 400 or 500 rapidly.

image

Same problem in my environment, ASP NET CORE 2.0 published to ubuntu 16.04.
Application deployed in AWS with the following instances:
instance #1 contains our API which hang from time to time
instance #2 contains our static html files, nginx which host the static files and forward the requests to API (instance #1) and other instances which serves our other microservices.
(API Process hangs every 2-4 days)

I saw that this process has 650 dead threads, is it relevant?

(lldb) sos Threads
ThreadCount: 677
UnstartedThread: 0
BackgroundThread: 25
PendingThread: 0
DeadThread: 650
Hosted Runtime: no

Any Idea?

I had the same problem before, it's related to HttpClient library .

I wrote this blog post about this issue, it may help you https://medium.com/@mshanak/soved-dotnet-core-too-many-open-files-in-system-when-using-postgress-with-entity-framework-c6e30eeff6d1

If you update to 2.1 do you still run into the same problem? I am just curious if the switch from libuv to sockets inside of Kestrel fixes this issue?

I'm familiar with your post and we are using a single httpclient.

@shaykeren do use the HTTPclient to send requests periodically to an Url? if so, use cron job instead.

@mshanak @shaykeren when we experienced this issue it happened with async calls to the user identity and/or redis for cached user data within the razor view. The only solution was to bounce the application.

Edit: Forgot to add razor view

Catch it many times last week
NET Core 2.1, new sockets transport. Application stuck up

@ZOXEXIVO Are you running on 2.1.0 or one of the patches? It's possible you could be running into dotnet/corefx#29785 if you're still on 2.1.0.

@halter73 Facing same issue on Centos

dotnet --version:
2.1.302

I have used @benaadams 's blocking detector and saw plenty of blocking call, however i do not think that it could be the actual issue because the same app runs on windows for days without any problem.

With that said if i make following changes:

System.Net.ServicePointManager.DefaultConnectionLimit = 256; 
System.Threading.ThreadPool.GetMaxThreads(out int _, out int completionThreads);
System.Threading.ThreadPool.SetMinThreads(256, completionThreads);

The app remains running for a couple of more days..

any ideas?

Sorry to re-open this. We are experiencing this issue in our application as well, we run on CentOS in docker and behind nginx. Our app has had this issue since with 2.0 and also with 2.1. Increasing the thread pool seemed to help, but sometimes we experience kestrel hanging with very little load while for the majority of time we have long up times with high load, which makes it difficult to debug.

Has there been any update on this or any additional troubleshooting advice? We are planning to take a full core dump of kesterel/dotnet the next time this happens and look at what's happening, we suspect it's got to do with some deadlocks but we can't really see how our code would result in deadlocks, so maybe we are triggering some bug at the framework level.

@kyuz0 If increasing the number of threadpool threads helped, that's a good sign you're experiencing threadpool starvation. This could be caused by a deadlock like you suspect, but there could be other causes too like blocking I/O.

To find the exact cause, collecting a core dump is your best bet. Unfortunately, analyzing linux core dumps isn't yet quite as easy as it is on Windows, but you can find a guide here. "clrstack -all" should show all the managed stacks and what they're blocked on.

@halter73 thanks for your reply. Is there any debugging that we can enable in Kestrel to get some logs/alerts when the threadpool is full? The difficulty at the moment is that when the application reaches this state, there's simply nothing happening in the logs.

Something else I've seen in our code is that we make raw TCP connections to some endpoints like this:

using(var client = new TcpClient())
            {
                    try
                    {
                        var result = client.BeginConnect(IPAddress.Parse(externalIp), port, null, null);
                        var success = result.AsyncWaitHandle.WaitOne(TimeSpan.FromMilliseconds(TimeoutReadyCheckMilliseconds));
                        if (!success)
                        {
                            return false;
                        }

                        client.EndConnect(result);
                    }
                    catch (SocketException e)
                    {
                        return false;
                    }
            }

The using statement should ensure that the TcpClient gets disposed of in all cases, it should not matter that we don't explicitly call EndConnect() in all cases... at least this was my understanding, but maybe this is what's causing issues as the socket will not receive an explicit EndConnect in case success is false or in case of exceptions.

Blocking on the IAsyncResult returned by TcpClient.BeginConnect is a possible candidate for the cause of your threadpool starvation. To fix this, you'll want to call the newer TcpClient.ConnectAsync method and await it in an async method while being careful to avoid calls to Task.Wait(), Task.Result or anything else in your codebase that could block a thread waiting on a Task.

The BeginConnect/EndConnect method pair implement an older asynchronous API pattern called the Asynchronous Programing Model or APM. It's possible to use APM APIs without blocking, but it's much more difficult. You're almost always better off using a Task-returning API with async/await over using an APM API.

Also, did you leave some code out in the TcpClient initialization sample? The using block ends right after the call to TcpClient.EndConnect which means the connection has only just been established.

As for threadpool starvation events, you could look for the ThreadPoolWorkerThreadAdjustmentAdjustment event with "0x06 - Starvation" Reason using perfcollect and perfview. That being said, if this issues causes any health checks to fail, it's probably easier to have a watchdog process collect a core dump before killing the server process when the health check fails.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

chtbof picture chtbof  路  3Comments

txdv picture txdv  路  7Comments

pathcore-tasos picture pathcore-tasos  路  5Comments

halter73 picture halter73  路  5Comments

netcore-jroger picture netcore-jroger  路  7Comments