Kestrelhttpserver: Kestrel randomly locked and stop responding.

Created on 21 Dec 2016 · 76Comments · Source: aspnet/KestrelHttpServer

I run ASP.NET Core 1.1 on Ubuntu 16.04 in Docker container (with Ubuntu image).
This works fine, but 3 times in two weeks I catch error, that Kestrel stop responding.
No errors in Exception filter, in logs - only last record with start Action Executing, but not completed.
Why problem in Kestrel ? - Because it locked and not processing new requests.
At locked state, CPU usage is 4-5% (per core). I wait 4 hours, but not result.

TOP command result:

PID    USER      PR    NI    VIRT       RES          SHR S       %CPU      %MEM      TIME+       COMMAND
10788   root      20   0     11.396g    185060       53704 S     4.7       2.3       14:17.56     dotnet

Restarting docker container is solve issue.
I can't attach and trace it, because all tools, that I use - works only on Windows.

duplicate

Source

ZOXEXIVO

👍3

Most helpful comment

We're seeing similar. I can exec into the container and see the process still running. It's not responding to curl. It's sat at 4-6% CPU constantly.

I can 100% reproduce it. It happens randomly under heavy load from our protractor unit tests hitting our web app, so just takes anywhere from 0-20 minutes to trigger.

Our app listens on port 1924 and has nginx in front of it. Nginx returns a 502 or 504.

Curl inside container:
```[root@66e41d245901 tmp]# curl 127.0.0.1:1924 -v

Rebuilt URL to: 127.0.0.1:1924/
Trying 127.0.0.1...
Connected to 127.0.0.1 (127.0.0.1) port 1924 (#0)

GET / HTTP/1.1
Host: 127.0.0.1:1924
User-Agent: curl/7.50.1
Accept: /

Hangs and has to be killed.

netstat inside container:

[root@66e41d245901 vq]# Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 1976 0 localhost:xiip tcp 515 0 localhost:xiip tcp 0 0 localhost:49636 tcp 79 0 localhost:xiip tcp 79 0 localhost:xiip tcp 32 0 66e41d245901:54064 192.168.1.11:https CLOSE_WAIT
tcp 79 0 localhost:xiip tcp 0 0 localhost:postgres tcp 2076 0 localhost:xiip tcp 2002 0 localhost:xiip tcp 236 0 localhost:xiip tcp 0 0 66e41d245901:54308 192.168.1.11:https ESTABLISHED
tcp 0 0 localhost:postgres tcp 0 0 localhost:xiip tcp 79 0 localhost:xiip tcp 77 0 localhost:xiip tcp 0 0 66e41d245901:50248 mirrors.melbourne.:http TIME_WAIT
tcp 77 0 localhost:xiip tcp 0 0 localhost:56804 tcp 0 0 localhost:57050 tcp 0 0 localhost:56838 tcp 236 0 localhost:xiip tcp 0 0 localhost:57050 tcp 515 0 localhost:xiip tcp 79 0 localhost:xiip tcp 79 0 localhost:xiip tcp 489 0 localhost:xiip tcp 0 0 localhost:54525 tcp 2002 0 localhost:xiip tcp 79 0 localhost:xiip tcp 79 0 localhost:xiip udp6 0 0 localhost:56619 Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags Type State I-Node Path
unix 3 [ ] STREAM CONNECTED 22431
unix 3 [ ] STREAM CONNECTED 24418 /tmp/kestrel_0a937d9e657e40f98d00f6ec9e1d1388
unix 3 [ ] STREAM CONNECTED 28606
unix 3 [ ] STREAM CONNECTED 26748
``` netstat
localhost:48872 CLOSE_WAIT
localhost:48866 CLOSE_WAIT
localhost:57050 ESTABLISHED
localhost:48938 CLOSE_WAIT
localhost:48980 CLOSE_WAIT
localhost:48934 CLOSE_WAIT
localhost:56804 ESTABLISHED
localhost:48884 CLOSE_WAIT
localhost:48950 CLOSE_WAIT
localhost:49010 CLOSE_WAIT
localhost:56838 ESTABLISHED
localhost:48754 CLOSE_WAIT
localhost:48958 CLOSE_WAIT
localhost:48962 CLOSE_WAIT
localhost:48960 CLOSE_WAIT
localhost:postgres ESTABLISHED
localhost:54525 ESTABLISHED
localhost:postgres ESTABLISHED
localhost:49008 CLOSE_WAIT
localhost:49636 ESTABLISHED
localhost:48818 CLOSE_WAIT
localhost:48966 CLOSE_WAIT
localhost:48970 CLOSE_WAIT
localhost:48848 CLOSE_WAIT
localhost:57050 ESTABLISHED
localhost:48978 CLOSE_WAIT
localhost:48888 CLOSE_WAIT
localhost:48964 CLOSE_WAIT
localhost:56619 ESTABLISHED
/> />

philjones88 on 5 Jan 2017

👍3

All 76 comments

Just to be clear, are you running everything (host and containers) on Linux?

Are you doing remote access from a Windows console window? It could be this if so: https://github.com/aspnet/KestrelHttpServer/issues/1259

cesarblum on 21 Dec 2016

Yes, it's all running on Ubuntu. 2-core server with no server CPU (G4400) and 8GB RAM
I know problem with console locking in Windows, with mouse e.t.c, but this is not same issue, because docker run in background and I not access it directly. It stop responding without any user activity with server.
Kestrel doing something, but i don't understand what:
CPU load 4.7% per core (at locked state) is equals to 3-5 fast requests to main page in normal state.

It very hard to reproduce and i want help you to fix it, Can you advice trace tools for collect debug info at .NET Core in Linux environment ?

Issue always appear after few days without any user activity on server.

ZOXEXIVO on 21 Dec 2016

UPD. Just now server blocked again. 2 days after restarted container

ZOXEXIVO on 23 Dec 2016

@ZOXEXIVO are you passing a cancellation token to any of your response writing API calls (if you have any)

davidfowl on 4 Jan 2017

@davidfowl No. Only simple async requests to MongoDB without passing CancellationToken.

ZOXEXIVO on 4 Jan 2017

We're seeing similar. I can exec into the container and see the process still running. It's not responding to curl. It's sat at 4-6% CPU constantly.

I can 100% reproduce it. It happens randomly under heavy load from our protractor unit tests hitting our web app, so just takes anywhere from 0-20 minutes to trigger.

Our app listens on port 1924 and has nginx in front of it. Nginx returns a 502 or 504.

Curl inside container:
```[root@66e41d245901 tmp]# curl 127.0.0.1:1924 -v

Rebuilt URL to: 127.0.0.1:1924/
Trying 127.0.0.1...
Connected to 127.0.0.1 (127.0.0.1) port 1924 (#0)

GET / HTTP/1.1
Host: 127.0.0.1:1924
User-Agent: curl/7.50.1
Accept: /

Hangs and has to be killed.

netstat inside container:

[root@66e41d245901 vq]# netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 1976 0 localhost:xiip localhost:48872 CLOSE_WAIT
tcp 515 0 localhost:xiip localhost:48866 CLOSE_WAIT
tcp 0 0 localhost:49636 localhost:57050 ESTABLISHED
tcp 79 0 localhost:xiip localhost:48938 CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48980 CLOSE_WAIT
tcp 32 0 66e41d245901:54064 192.168.1.11:https CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48934 CLOSE_WAIT
tcp 0 0 localhost:postgres localhost:56804 ESTABLISHED
tcp 2076 0 localhost:xiip localhost:48884 CLOSE_WAIT
tcp 2002 0 localhost:xiip localhost:48950 CLOSE_WAIT
tcp 236 0 localhost:xiip localhost:49010 CLOSE_WAIT
tcp 0 0 66e41d245901:54308 192.168.1.11:https ESTABLISHED
tcp 0 0 localhost:postgres localhost:56838 ESTABLISHED
tcp 0 0 localhost:xiip localhost:48754 CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48958 CLOSE_WAIT
tcp 77 0 localhost:xiip localhost:48962 CLOSE_WAIT
tcp 0 0 66e41d245901:50248 mirrors.melbourne.:http TIME_WAIT
tcp 77 0 localhost:xiip localhost:48960 CLOSE_WAIT
tcp 0 0 localhost:56804 localhost:postgres ESTABLISHED
tcp 0 0 localhost:57050 localhost:54525 ESTABLISHED
tcp 0 0 localhost:56838 localhost:postgres ESTABLISHED
tcp 236 0 localhost:xiip localhost:49008 CLOSE_WAIT
tcp 0 0 localhost:57050 localhost:49636 ESTABLISHED
tcp 515 0 localhost:xiip localhost:48818 CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48966 CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48970 CLOSE_WAIT
tcp 489 0 localhost:xiip localhost:48848 CLOSE_WAIT
tcp 0 0 localhost:54525 localhost:57050 ESTABLISHED
tcp 2002 0 localhost:xiip localhost:48978 CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48888 CLOSE_WAIT
tcp 79 0 localhost:xiip localhost:48964 CLOSE_WAIT
udp6 0 0 localhost:56619 localhost:56619 ESTABLISHED
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags Type State I-Node Path
unix 3 [ ] STREAM CONNECTED 22431
unix 3 [ ] STREAM CONNECTED 24418 /tmp/kestrel_0a937d9e657e40f98d00f6ec9e1d1388
unix 3 [ ] STREAM CONNECTED 27652 /tmp/kestrel_ac09a04d99044ce686585e802a31f7ed
unix 3 [ ] STREAM CONNECTED 28605
unix 3 [ ] STREAM CONNECTED 28606
unix 3 [ ] STREAM CONNECTED 26748
```

philjones88 on 5 Jan 2017

👍3

I work with @philjones88 and to confirm this is a self-contained app running against NetCore 1.0.1 on a CentOS 7 container.

jchannon on 5 Jan 2017

we enabled more verbose logging and captured:

2017-01-05 14:06:40.951911 Microsoft.AspNetCore.Hosting.Internal.WebHost Request starting HTTP/1.1 GET http://localhost/static/themes/base.css?cb=1483621285065 2017-01-05 14:06:40.951911 RequestId : 0HL1LEEL4VG15 Component : VQPortal Method : OWIN Pipeline Message : Request Start: /static/themes/base.css?cb=1483621285065 2017-01-05 14:06:40.951911 Microsoft.AspNetCore.StaticFiles.StaticFileMiddleware Sending file. Request path: '/themes/base.css'. Physical path: '/vq/portal/static/themes/base.css' 2017-01-05 14:06:40.977634 Microsoft.AspNetCore.Server.Kestrel Connection id "0HL1LEEL40UDH" communication error Microsoft.AspNetCore.Server.Kestrel.Internal.Networking.UvException: Error -32 EPIPE broken pipe 2017-01-05 14:06:40.977634 Microsoft.AspNetCore.Server.Kestrel Connection id "0HL1LEEL40UDH" disconnecting.

philjones88 on 5 Jan 2017

👍1

After using NGINX before Kestrel, problem disappear (2 weeks uptime)

ZOXEXIVO on 19 Jan 2017

@ZOXEXIVO we are experiencing the same symptoms in the same containerized environment. Since deploying nginx before kestrel , have you had any locks?

euclid47 on 2 Feb 2017

@euclid47 behind nginx server, it works perfect! I
Here, https://github.com/aspnet/KestrelHttpServer/issues/1304 problem was in StaticFiles middleware.

ZOXEXIVO on 2 Feb 2017

Any idea when #1304 will be pushed out? I am running .net core 1.1 in an Azure web app and am similarly having the problem where it will crash and never start serving content again until the app is swapped or restarted

runxc1 on 3 Feb 2017

@ZOXEXIVO @runxc1 #1304 should be released in Feb.

muratg on 3 Feb 2017

👍1

Closing as a dup of https://github.com/aspnet/KestrelHttpServer/issues/1304.

Eilon on 8 Feb 2017

Hi,
Do we have any plans to ship v1.1.1 as we are also getting the same error while deploying dot net core web app on Linux docker - ubntu ?

tavisca-syedave on 15 Feb 2017

@tavisca-syedave we plan to have the patch out hopefully later this month.

Eilon on 15 Feb 2017

Kestrel 1.1.1 is now released. Is anyone still running into this issue?

cesarblum on 14 Mar 2017

I think I've just got it with 1.1.2 using unix socket (behind nginx). Will enable verbose logging and come back if I can reproduce it.

I have StaticFiles middleware disabled in production (static files are served by nginx).

pieceofsummer on 13 May 2017

👍1

Just got it too with 1.1.2 begind nginx.

ilyabreev on 13 Jul 2017

I also just got it on 1.1.2. Nginx reverse proxy to Kestrel in Debian. In my case it was working in production for more more than 1 month then suddenly this it stopped working. Now it stopped working every hour

jeff-pang on 25 Aug 2017

😕1

Best advice -> upgrade to v2.0

ZOXEXIVO on 25 Aug 2017

@ZOXEXIVO Did upgrading to v2 fixed your problem?

jeff-pang on 25 Aug 2017

@jeff-pang Problem disappeared when i host kestrel behind nginx

ZOXEXIVO on 25 Aug 2017

@ZOXEXIVO I am already hosting Kestrel behind Nginx, but I will give it a try anyway. Thanks

jeff-pang on 25 Aug 2017

Upgrading to 2.0 fixed the issue

jeff-pang on 26 Aug 2017

The same issue on 2.0.0, please check this
(kestrel hangs)
telnet port is oke, but curl http => hangs

(lldb) thread list
Process 4262 stopped
thread #1: tid = 4262, 0x00007f7cf6f106d5 libpthread.so.0`__pthread_cond_wait + 197, name = 'dotnet', stop reason = trace

thread #2: tid = 4263, 0x00007f7cf640fe2d libc.so.6, name = 'dotnet'
thread #3: tid = 4264, 0x00007f7cf6f10a82 libpthread.so.0__pthread_cond_timedwait + 306, name = 'dotnet' thread #4: tid = 4265, 0x00007f7cf6f13c3d libpthread.so.0, name = 'dotnet' thread #5: tid = 4266, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #6: tid = 4267, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #7: tid = 4268, 0x00007f7cf6f10a82 libpthread.so.0__pthread_cond_timedwait + 306, name = 'dotnet'
thread #8: tid = 4269, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #9: tid = 4270, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #10: tid = 4271, 0x00007f7cf6f1349d libpthread.so.0, name = 'dotnet'
thread #11: tid = 4272, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #12: tid = 4275, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #13: tid = 4276, 0x00007f7cf6f106d4 libpthread.so.0__pthread_cond_wait + 196, name = 'dotnet' thread #14: tid = 4277, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #15: tid = 4278, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #16: tid = 4279, 0x00007f7cf6f1349d libpthread.so.0, name = 'dotnet' thread #17: tid = 4280, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #18: tid = 4281, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #19: tid = 4282, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #20: tid = 4283, 0x00007f7cf6f1349d libpthread.so.0, name = 'dotnet'
thread #21: tid = 4284, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #22: tid = 4285, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #23: tid = 4286, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #24: tid = 4287, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #25: tid = 4288, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #26: tid = 4290, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #27: tid = 4291, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #28: tid = 4292, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #29: tid = 4293, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #30: tid = 4294, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #31: tid = 4295, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #32: tid = 4296, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #33: tid = 4297, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #34: tid = 4298, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #35: tid = 4299, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #36: tid = 4300, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #37: tid = 4301, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #38: tid = 4308, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #39: tid = 4309, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #40: tid = 4310, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #41: tid = 4311, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #42: tid = 4317, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #43: tid = 4326, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #44: tid = 4328, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #45: tid = 4332, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #46: tid = 4374, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #47: tid = 4375, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #48: tid = 4376, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #49: tid = 4377, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #50: tid = 4378, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #51: tid = 4379, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #52: tid = 4381, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #53: tid = 4382, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #54: tid = 4383, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #55: tid = 4384, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #56: tid = 4385, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #57: tid = 4386, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #58: tid = 4387, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #59: tid = 4392, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #60: tid = 4435, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #61: tid = 4436, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #62: tid = 4437, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #63: tid = 4438, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet' thread #64: tid = 4558, 0x00007f7cf6f106d5 libpthread.so.0__pthread_cond_wait + 197, name = 'dotnet'
thread #65: tid = 4559, 0x00007f7cf6f106d5 libpthread.so.0`__pthread_cond_wait + 197, name = 'dotnet'

(lldb) sos DumpStack
OS Thread Id: 0x10a7 (2)
TEB information is not available so a stack size of 0xFFFF is assumed
Current frame: libc.so.6 + 0xffffffff
Child-SP RetAddr Caller, Callee
00007F7CF4E4DDF0 00007f7cf5b70cb4 libcoreclr.so!CorUnix::CPalSynchronizationManager::ReadBytesFromProcessPipe(int, unsigned char, int) + 0x64, calling libcoreclr.so!poll
00007F7CF4E4DE40 00007f7cf5b70284 libcoreclr.so!CorUnix::CPalSynchronizationManager::WorkerThread(void) + 0x84, calling libcoreclr.so!CorUnix::CPalSynchronizationManager::ReadBytesFromProcessPipe(int, unsigned char, int)
00007F7CF4E4DE60 00007f7cf733a430 ld-linux-x86-64.so.2!_dl_runtime_resolve + 0x50, calling ld-linux-x86-64.so.2!_dl_fixup
00007F7CF4E4DEF0 00007f7cf5b79342 libcoreclr.so!CorUnix::CPalThread::ThreadEntry(void) + 0x132
00007F7CF4E4DFB0 00007f7cf641a76d libc.so.6!__clone + 0x6d

vinhhrv on 27 Aug 2017

@vinhhrv Could you upload your dump somewhere and share it with us?

cesarblum on 8 Sep 2017

@cesarbs
i will send my dump to your email is [email protected] ?
please confirm, (my email is [email protected])

vinhhrv on 8 Sep 2017

Sounds good, though depending on the size you might not be able to email it.

cesarblum on 8 Sep 2017

@cesarbs, i sent to your email.
Thanks for support.

vinhhrv on 11 Sep 2017

@vinhhrv About this:

telnet port is oke, but curl http => hangs

Can you share more details about the specific request that causes it to hang? Or does it hang on any request?

cesarblum on 13 Sep 2017

@pieceofsummer @ilyabreev @jeff-pang Can you share more details about the issues you've ran into with Kestrel 1.1.2? In @vinhhrv's case, it appears he is running into the connection limit that we introduced in 2.0. But that limit isn't there in 1.1.2, so that can't be the cause of your issues.

Could any of you share a memory dump captured when Kestrel hangs?

cesarblum on 18 Sep 2017

i got a same problem and i send you the dump zipped
please give me some advice
Thanks

anthonywanted on 22 Sep 2017

I'm also having the problem, it appears the whole dotnet process 'stops' rather than just kestrel as I have some background threads that stop doing anything also. The dotnet memory usage goes down bigly when this happens also.

Sometimes I see this output

^[[40m^[[1m^[[33mwarn^[[39m^[[22m^[[49m: Microsoft.AspNetCore.Server.Kestrel[22]
      Heartbeat took longer than "00:00:01" at "09/28/2017 18:00:58 +00:00".

But there does not seem to be a correlation between this message and the hangs.

sanishx on 28 Sep 2017

Hello. Same problem. But my problem is a little bit different. So... I restart server every one or two week. It looks like server getting tired. When tired is not responding and then nginx getting 502 error. I am testing mobile app and getting and posting more datas and server get this error every 2 days. And i restart the server and then solve my problem for two days. I use ubuntu's service from /etc/systemd/system. I create webapi.service and webui.service for this dotnet crash problem. Dotnet automatically starting after crash but not healty and crashing more and more times. What is my problem. Please solve my problem. Thanks.

@muratg
@davidfowl

hasanoruc on 1 Oct 2017

I am also using Ubuntu and running through the supervisor system.

sanishx on 1 Oct 2017

@anthonywanted @sanishx @hasanoruc We've seen issues where the server seemingly hangs due to thread pool exhaustion. It is always the case that somewhere in the application a sync call is being made (e.g. SomethingAsync().GetAwaiter().GetResult()). This is unfortunately hidden in unexpected places sometimes, like some Azure SDKs. Have you checked if that could be the case?

cesarblum on 6 Oct 2017

@cesarbs i will check.

and. more information.
my system did not show this error when it was 12gb ram, 6cores processor.
but my system is now 2cores processors and 4gbram. this error look when my system has been smaller. under the load(looking 45 persons per second) %2 cpu usage(2cores). not problem. but i test my app and I proccess more operation and dotnet is crashing. it seems like sockets. but i dont know.
sorry for my bad english.

hasanoruc on 25 Oct 2017

@hasanoruc do you use non-async (e.g. blocking) apis? May be some solutions in https://github.com/aspnet/KestrelHttpServer/issues/2104#issuecomment-337170742

With 2 cores the threadpool will have a lower minthreads value; so you may hit threadpool starvation (it will inject threads; but not as fast as the blocking apis are stopping them)

benaadams on 25 Oct 2017

@benaadams No. My all api methods are async.
Like this:

    [HttpPost("postxmesajsil")]
    [Produces("application/json")]
    public async Task<bool> PostXamarinMesajsil([FromHeader]string token, [FromBody]JObject obj)
    {
        string identity = (string)obj["userId"];

        await TokenControl(token, identity);

        int userId = (int)obj["userId"];
        int friendId = (int)obj["friendId"];

        return await messagesDal.sp_mesajsilxamarin(userId, friendId);
    }

hasanoruc on 25 Oct 2017

I'm also having the problem, I'm using dotnet core 2.0. It appears the whole dotnet process 'stops' rather than just kestrel as I have some background threads that stop doing anything also. This issue only happens when a reverse proxy server(HAProxy) is setting in front of kestrel and concurrency is high(above 50).

i got it, when HAProxy session sticky policy is on, kestrel will stop responding; but if I remove session sticky, kestrel works. Very strange.

ChenWenBrian on 30 Oct 2017

You are not disposing the http client for a start. Was there an xmldocument.load () in there somewhere?

Drawaes on 30 Oct 2017

👍1

@sanishmistry resuse (e.g. static) HttpClient or Dispose it (e.g. using)? You may be hitting max outbound connections + keep alives on client?

benaadams on 30 Oct 2017

Also pump up your outbound connection limits?

// Add at Main()
System.Net.ServicePointManager.DefaultConnectionLimit = 256; // Max concurrent outbound requests`

benaadams on 30 Oct 2017

i will wrap the block in using statements and see if that fixes it, I think it that was likely the issue as it seemed to happen exactly 7 days after starting the process, and that block ran at a fixed time interval. I was not aware HttpClients needed to be disposed!

sanishmistry on 30 Oct 2017

I was not aware HttpClients needed to be disposed!

You can also just use a single static one and not Dispose it. However you can't change things like DefaultHeaders per call as that wouldn't be thread safe

benaadams on 2 Nov 2017

I solved my own problem for ubuntu 16.04.
I ran top command and then shift+m and then shift+h. And i saw many many dotnet actions loading the ram and it did not died after time by itself and it did not clear the ram by itself. And i looked to google and i add this line to crontab.(/etc/crontab):

*/5 * * * * root killall --older-than 40m -s SIGKILL dotnet

What is this?
Every 5 minutes run this command. What command? killall dotnet actions older than 40 minutes.
And extra note: Add this extra line to crontab for every 5 minutes to clear all buffer/cache. And restart crontab for this:

*/5 * * * * root sync && echo 3 > /proc/sys/vm/drop_caches

sudo /etc/init.d/cron reload
sudo /etc/init.d/cron restart

And see this command to clear ram every 45 minutes: watch -n 3 free -m

hasanoruc on 3 Dec 2017

@hasanoruc What's starting your dotnet processes? When should be stopping the dotnet processes?

halter73 on 4 Dec 2017

@halter73
If every request to the server and every transaction in the database causes a dotnet process in the ram in my ubuntu server, and this does not disappear by itself. I need to clear these operations. I can clean them with crontab. When you delete this information and the cache on the server, the user is connected a bit late. But if it remains within the time specified in the crontab, it reads again from the cache and fast.

hasanoruc on 6 Dec 2017

I'm also trying to track down this issue with two of our sites. It seems to happen when there is load on the site and it doesn't fix itself. We use a lot of async redis calls / http client calls via the nest client

niemyjski on 15 Dec 2017

@niemyjski It could be threadpool starvation. Have you collected a memory dump when Kestrel isn't responding?

halter73 on 15 Dec 2017

Same problem here. Randomly stops responding and consumes 1-2% CPU. dotnet core 2.0.3

ilyabreev on 18 Dec 2017

@ilyabreev Any details?

Information that might be useful:

OS version
Whether or not your running under IIS/ANCM or behind another reverse proxy
Frequency of occurence
Amount of load the server was under when it stopped responding
A core dump taken when the server is not responding
Anything out of the ordinary that could be related.

Also, I recommend opening a new issue. Even if the behavior described in this issue is similar to what you're experiencing, this issue describes a bug that was fixed prior to the release of 2.0.0.

halter73 on 18 Dec 2017

@halter73 yes, I have some memory dumps.

niemyjski on 18 Dec 2017

@niemyjski Can you share these? You can email me with a link at the address listed in my github profile. Thanks.

halter73 on 18 Dec 2017

@halter73

Ubuntu 14.04.5 LTS as host system. dotnet process runs inside docker container which based on ubuntu:xenial
Running behind nginx as reverse proxy.
0-6 times in a day.
Don't know exactly. I don't think it's constantly too high. Maybe some kind of CPU spikes occured before locking.
Will try to email you link to memory dump.
Nothing special.

ilyabreev on 19 Dec 2017

As this issue has been closed, do we have another new issue to track down the "dotnet process stops" problem? We're facing the same issue as the dotnet process stops unexpectedly.

mazong1123 on 20 Dec 2017

@mazong1123 please feel free to file another bug to discuss the current issues.

muratg on 20 Dec 2017

This blog post could help you increasing max open files and configure Nginx to close finished request connections

mshanak on 24 Dec 2017

I guess i solved my own problem. A few days ago, i looked nginx error logs in /var/log/nginx/error.log. and saw error logs this: "upstream prematurely closed connection while reading response header from upstream". And i think about it and i looked my bad nginx config. And I changed the following config:

location / {
proxy_pass http://backend;  #throw error this line
}
location /api {
proxy_pass http://localhost:5021; #this is not throwing an error because config is ok.
}
upstream backend {
server 127.0.0.1:5020; # this throwing error. because i think coming request going to 2 port. and confusing.
server 127.0.0.1:19920;
}

my new config is:

location / {
proxy_pass http://localhost:5020;
}
location /api {
proxy_pass http://localhost:5021;
}

And not using the upstream because not necessary my system.

Sorry for my bad english.

hasanoruc on 29 Dec 2017

@halter73 just a heads up I've been running 1.5 weeks in production without issues. I've seen high thread count and have been trying to get the profiler attached in azure to take a look (without success). Considering I've seen thread exhaustion in the past I want to ensure I'm not leaking any thing but the high thread count is normal for my load (but I never remember seeing this many of threads in the legacy asp.net app). I got a dump but all the threads were native and empty stack traces. Currently serving up 5+ billion requests a month across 4-5 large azure web app instances. Seeing 70-80% cpu and a couple hundred threads (600+). I was able to narrow down one of my issues to this: https://github.com/StackExchange/StackExchange.Redis/issues/759 .

Interesting note here is I wish Microsoft would throw some resources at this project (Redis client) as it is used by Microsoft for the distributed cache implementation and is used for various scenarios inside of asp.net ;)

niemyjski on 3 Jan 2018

I've made a BlockingDetector package to output warnings to the logs when you block:
https://github.com/benaadams/Ben.BlockingDetector

Only detects when blocking actually happens; so doesn't pick up calls that may block but don't or warn about coding practices that lead to blocking (or blocking that happens at OS rather than .NET level e.g. File.Read) - but it may help pick up things.

benaadams on 10 Jan 2018

Same problem here, randomly no response on my asp.net core project on Azure app service (IIS and ASP.NET Core Module).
Another related issue: https://github.com/aspnet/Home/issues/1950

The typical metric of requests indicators as shown below, when no response error happens, the number of threads being used in the dotnet core will jump up to 400 or 500 rapidly.

Itoktsnhc on 1 Mar 2018

Same problem in my environment, ASP NET CORE 2.0 published to ubuntu 16.04.
Application deployed in AWS with the following instances:
instance #1 contains our API which hang from time to time
instance #2 contains our static html files, nginx which host the static files and forward the requests to API (instance #1) and other instances which serves our other microservices.
(API Process hangs every 2-4 days)

I saw that this process has 650 dead threads, is it relevant?

(lldb) sos Threads
ThreadCount: 677
UnstartedThread: 0
BackgroundThread: 25
PendingThread: 0
DeadThread: 650
Hosted Runtime: no

Any Idea?

shaykeren on 16 May 2018

I had the same problem before, it's related to HttpClient library .

I wrote this blog post about this issue, it may help you https://medium.com/@mshanak/soved-dotnet-core-too-many-open-files-in-system-when-using-postgress-with-entity-framework-c6e30eeff6d1

mshanak on 16 May 2018

If you update to 2.1 do you still run into the same problem? I am just curious if the switch from libuv to sockets inside of Kestrel fixes this issue?

runxc1 on 16 May 2018

😕1

I'm familiar with your post and we are using a single httpclient.

shaykeren on 16 May 2018

@shaykeren do use the HTTPclient to send requests periodically to an Url? if so, use cron job instead.

mshanak on 17 May 2018

@mshanak @shaykeren when we experienced this issue it happened with async calls to the user identity and/or redis for cached user data within the razor view. The only solution was to bounce the application.

Edit: Forgot to add razor view

euclid47 on 17 May 2018

Catch it many times last week
NET Core 2.1, new sockets transport. Application stuck up

ZOXEXIVO on 13 Jul 2018

@ZOXEXIVO Are you running on 2.1.0 or one of the patches? It's possible you could be running into dotnet/corefx#29785 if you're still on 2.1.0.

halter73 on 13 Jul 2018

👍2

@halter73 Facing same issue on Centos

dotnet --version:
2.1.302

I have used @benaadams 's blocking detector and saw plenty of blocking call, however i do not think that it could be the actual issue because the same app runs on windows for days without any problem.

With that said if i make following changes:

System.Net.ServicePointManager.DefaultConnectionLimit = 256; 
System.Threading.ThreadPool.GetMaxThreads(out int _, out int completionThreads);
System.Threading.ThreadPool.SetMinThreads(256, completionThreads);

The app remains running for a couple of more days..

any ideas?

HassanHashemi on 14 Aug 2018

👍1

Sorry to re-open this. We are experiencing this issue in our application as well, we run on CentOS in docker and behind nginx. Our app has had this issue since with 2.0 and also with 2.1. Increasing the thread pool seemed to help, but sometimes we experience kestrel hanging with very little load while for the majority of time we have long up times with high load, which makes it difficult to debug.

Has there been any update on this or any additional troubleshooting advice? We are planning to take a full core dump of kesterel/dotnet the next time this happens and look at what's happening, we suspect it's got to do with some deadlocks but we can't really see how our code would result in deadlocks, so maybe we are triggering some bug at the framework level.

kyuz0 on 15 Nov 2018

@kyuz0 If increasing the number of threadpool threads helped, that's a good sign you're experiencing threadpool starvation. This could be caused by a deadlock like you suspect, but there could be other causes too like blocking I/O.

To find the exact cause, collecting a core dump is your best bet. Unfortunately, analyzing linux core dumps isn't yet quite as easy as it is on Windows, but you can find a guide here. "clrstack -all" should show all the managed stacks and what they're blocked on.

halter73 on 16 Nov 2018

@halter73 thanks for your reply. Is there any debugging that we can enable in Kestrel to get some logs/alerts when the threadpool is full? The difficulty at the moment is that when the application reaches this state, there's simply nothing happening in the logs.

Something else I've seen in our code is that we make raw TCP connections to some endpoints like this:

using(var client = new TcpClient())
            {
                    try
                    {
                        var result = client.BeginConnect(IPAddress.Parse(externalIp), port, null, null);
                        var success = result.AsyncWaitHandle.WaitOne(TimeSpan.FromMilliseconds(TimeoutReadyCheckMilliseconds));
                        if (!success)
                        {
                            return false;
                        }

                        client.EndConnect(result);
                    }
                    catch (SocketException e)
                    {
                        return false;
                    }
            }

The using statement should ensure that the TcpClient gets disposed of in all cases, it should not matter that we don't explicitly call EndConnect() in all cases... at least this was my understanding, but maybe this is what's causing issues as the socket will not receive an explicit EndConnect in case success is false or in case of exceptions.

kyuz0 on 16 Nov 2018

Blocking on the IAsyncResult returned by TcpClient.BeginConnect is a possible candidate for the cause of your threadpool starvation. To fix this, you'll want to call the newer TcpClient.ConnectAsync method and await it in an async method while being careful to avoid calls to Task.Wait(), Task.Result or anything else in your codebase that could block a thread waiting on a Task.

The BeginConnect/EndConnect method pair implement an older asynchronous API pattern called the Asynchronous Programing Model or APM. It's possible to use APM APIs without blocking, but it's much more difficult. You're almost always better off using a Task-returning API with async/await over using an APM API.

Also, did you leave some code out in the TcpClient initialization sample? The using block ends right after the call to TcpClient.EndConnect which means the connection has only just been established.

As for threadpool starvation events, you could look for the ThreadPoolWorkerThreadAdjustmentAdjustment event with "0x06 - Starvation" Reason using perfcollect and perfview. That being said, if this issues causes any health checks to fail, it's probably easier to have a watchdog process collect a core dump before killing the server process when the health check fails.

halter73 on 16 Nov 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Microsoft.AspNetCore.Server.Kestrel[0]

chtbof · 3Comments

uv_buf_t size faulty on windows?

txdv · 7Comments

InvalidOperationException: Response Content-Length mismatch when streaming a bitmap

pathcore-tasos · 5Comments

Switch back to libuv as the default transport

halter73 · 5Comments

about pool recycling process.

netcore-jroger · 7Comments