1.1.5
raw sockets
macOS, linux
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode)
It seems like client tcp socket connections are not properly closed and leaking file descriptors. When testing the server with telnet no leak occurs. The issue is within the client code. The server part is only included for a complete, reproducable example.
Do the read and write channel need separate closing or is something else not handled correctly in the example?
The output is:
Server said: 'hello 0'
open file descriptors: 64
Server said: 'hello 1'
open file descriptors: 67
Server said: 'hello 2'
open file descriptors: 71
Example:
fun main() = runBlocking {
val osMxBean = ManagementFactory.getOperatingSystemMXBean() as UnixOperatingSystemMXBean
val address = InetSocketAddress("127.0.0.1", 2323)
val connectJob = launch {
// repeatedly connect to the server
repeat(3) {
delay(1_000)
aSocket(ActorSelectorManager(Dispatchers.IO)).tcp().connect(address).use { socket ->
val input = socket.openReadChannel()
val output = socket.openWriteChannel(autoFlush = true)
output.writeStringUtf8("hello $it\r\n")
val response = input.readUTF8Line()
println("Server said: '$response'")
}
println("open file descriptors: ${osMxBean.openFileDescriptorCount}")
}
}
// server - irrelevant for the issue
val serverJob = launch {
aSocket(ActorSelectorManager(Dispatchers.IO)).tcp().bind(address).use { server ->
while (true) {
val socket = server.accept()
launch {
socket.use {
val input = socket.openReadChannel()
val output = socket.openWriteChannel(autoFlush = true)
while (true) {
val line = input.readUTF8Line()
output.writeStringUtf8("$line\r\n")
}
}
}
}
}
}
connectJob.join()
serverJob.cancel()
}
Both input and output channels need to be closedt. In the example above output channel is not closed so native socket is not released since we are still waiting for output bytes
Unfortunately explicitly closing the output doesn't help either. Even when no input or output channels are opened at all, the leak is still there.
repeat(3) {
delay(1_000)
val socket = aSocket(ActorSelectorManager(Dispatchers.IO)).tcp().connect(address)
socket.close()
println("open file descriptors: ${osMxBean.openFileDescriptorCount}")
}
Aside: Wouldn't it make sense from an API usablity standpoint to automatically close the input and output channel when the socket is closed? I don't see much value in closing the socket but keeping the output channel open.
Also note that sockets are not closed immediately but they live in state TIME_WAIT for some system-dependant period of time (usually minutes). See TCP states for more details. This is driven by OS and out of our control.
Use ss | grep 2323 on Linux and netstat on macos to see socket states
Thanks for your quick replys. The sockets seem to be closed but there seems to be a file descriptor leak. On our production system we run out of os file descriptors after a certain amount of connections.
The system connects to port 3445, 3446, 3447, 3448 and 3449 on localhost. It drops the connection from time to time and reconnects later. There is no output channel opened, the app is just reading from the sockets.
Just 4 sockets show up in ss which is good.
$ ss -n | egrep ':3445|:3446|:3447|:3448|:3449'
tcp ESTAB 0 0 127.0.0.1:50814 127.0.0.1:3447
tcp ESTAB 0 0 127.0.0.1:49862 127.0.0.1:3449
tcp ESTAB 0 0 127.0.0.1:56700 127.0.0.1:3445
tcp ESTAB 0 0 127.0.0.1:35204 127.0.0.1:3446
In lsof everything looks good too. The first 3 are from incoming connections to the app which should be irrelevant.
$ lsof -P -i -a -p 31339
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 31339 xxxxx 36u IPv4 1065797 0t0 TCP *:3444 (LISTEN)
java 31339 xxxxx 55u IPv4 1080562 0t0 TCP xxxxx.xxxxxx.xxxxxxxxxxx:3444->xxxxx-dev.xxxxxx.xxxxxxxxxxx:43118 (ESTABLISHED)
java 31339 xxxxx 56u IPv4 1080565 0t0 TCP xxxxx.xxxxxx.xxxxxxxxxxx:3444->xxxxx.xxxxxx.xxxxxxxxxxx:34796 (ESTABLISHED)
java 31339 xxxxx 2465u IPv4 1391263 0t0 TCP localhost:50838->localhost:3449 (ESTABLISHED)
java 31339 xxxxx 2474u IPv4 1389362 0t0 TCP localhost:35204->localhost:3446 (ESTABLISHED)
java 31339 xxxxx 2477u IPv4 1389045 0t0 TCP localhost:56700->localhost:3445 (ESTABLISHED)
java 31339 xxxxx 2496u IPv4 1391272 0t0 TCP localhost:51774->localhost:3447 (ESTABLISHED)
However in /proc/PID/fd/ there are lots of entries (2518 at time of this writing). Most of the entries look like this.
lrwx------ 1 xxxxx xxxxx 64 14. Mai 09:19 981 -> anon_inode:[eventpoll]
lr-x------ 1 xxxxx xxxxx 64 14. Mai 09:19 982 -> pipe:[1170329]
l-wx------ 1 xxxxx xxxxx 64 14. Mai 09:19 983 -> pipe:[1170329]
lr-x------ 1 xxxxx xxxxx 64 14. Mai 09:19 984 -> pipe:[1170961]
l-wx------ 1 xxxxx xxxxx 64 14. Mai 09:19 985 -> pipe:[1170961]
lrwx------ 1 xxxxx xxxxx 64 14. Mai 09:19 986 -> anon_inode:[eventpoll]
lr-x------ 1 xxxxx xxxxx 64 14. Mai 09:19 987 -> pipe:[1170962]
l-wx------ 1 xxxxx xxxxx 64 14. Mai 09:19 988 -> pipe:[1170962]
lrwx------ 1 xxxxx xxxxx 64 14. Mai 09:19 989 -> anon_inode:[eventpoll]
lrwx------ 1 xxxxx xxxxx 64 13. Mai 20:52 99 -> socket:[1068003]
lrwx------ 1 xxxxx xxxxx 64 14. Mai 09:19 990 -> anon_inode:[eventpoll]
lr-x------ 1 xxxxx xxxxx 64 14. Mai 09:19 991 -> pipe:[1175035]
l-wx------ 1 xxxxx xxxxx 64 14. Mai 09:19 992 -> pipe:[1170271]
lrwx------ 1 xxxxx xxxxx 64 14. Mai 09:19 993 -> anon_inode:[eventpoll]
lr-x------ 1 xxxxx xxxxx 64 14. Mai 09:19 994 -> pipe:[1171071]
lr-x------ 1 xxxxx xxxxx 64 14. Mai 09:19 995 -> pipe:[1170424]
lrwx------ 1 xxxxx xxxxx 64 14. Mai 09:19 996 -> anon_inode:[eventpoll]
lr-x------ 1 xxxxx xxxxx 64 14. Mai 09:19 997 -> pipe:[1170330]
l-wx------ 1 xxxxx xxxxx 64 14. Mai 09:19 998 -> pipe:[1170330]
lrwx------ 1 xxxxx xxxxx 64 14. Mai 09:19 999 -> anon_inode:[eventpoll]
There are also 137 for to sockets, out of which 127 are for the same socket device!
lrwx------ 1 xxxxx xxxxx 64 14. Mai 09:19 874 -> socket:[1068003]
lrwx------ 1 xxxxx xxxxx 64 14. Mai 09:19 891 -> socket:[1068003]
lrwx------ 1 xxxxx xxxxx 64 14. Mai 09:19 909 -> socket:[1068003]
lrwx------ 1 xxxxx xxxxx 64 13. Mai 20:52 91 -> socket:[1068003]
lrwx------ 1 xxxxx xxxxx 64 14. Mai 09:19 915 -> socket:[1068003]
lrwx------ 1 xxxxx xxxxx 64 14. Mai 09:19 922 -> socket:[1068003]
lrwx------ 1 xxxxx xxxxx 64 14. Mai 09:19 942 -> socket:[1068003]
lrwx------ 1 xxxxx xxxxx 64 13. Mai 20:52 95 -> socket:[1068003]
lrwx------ 1 xxxxx xxxxx 64 14. Mai 09:19 954 -> socket:[1068003]
lrwx------ 1 xxxxx xxxxx 64 14. Mai 09:19 955 -> socket:[1068003]
lrwx------ 1 xxxxx xxxxx 64 13. Mai 20:52 99 -> socket:[1068003]
Do you have any idea how that could happen?
I will later try to reproduce this on my local macOS developement environment in the simple example.
But the symptoms (ever increasing file descriptor count) are the same on macOS with the simple example.
ss -na ?
-n is to not resolve service names. It's used to print raw port numbers.
-a is "all". Just added it to see if more shows up with it. it is the same result without this option.
Edited because the reply by e-mail feature messed up the response.
@cy6erGn0m Oops, I totally misinterpreted your comment reading it on my mobile phone before. Sorry about that.
The output doesn't change with the -a flag. After running it several times, I cought it once with one in TIME-WAIT state. Shortly after the TIME-WAIT one went away as expected.
tcp ESTAB 0 0 127.0.0.1:36574 127.0.0.1:3449
tcp ESTAB 0 0 127.0.0.1:43160 127.0.0.1:3445
tcp TIME-WAIT 0 0 127.0.0.1:36060 127.0.0.1:3449
tcp ESTAB 0 0 127.0.0.1:37036 127.0.0.1:3447
tcp ESTAB 0 0 127.0.0.1:48856 127.0.0.1:3446
After some digging I found the socket type file descriptors that show up in /proc/PID/fd/ also with ss.
$ ss -anpe | grep pid=31339
u_str ESTAB 0 0 * 1068003 * 0 users:(("java",pid=31339,fd=3474),("java",pid=31339,fd=3398),("java",pid=31339,fd=3322),("java",pid=31339,fd=3260),("java",pid=31339,fd=3248),("java",pid=31339,fd=3196),("java",pid=31339,fd=3121),("java",pid=31339,fd=3112),("java",pid=31339,fd=3109),("java",pid=31339,fd=3009),("java",pid=31339,fd=2855),("java",pid=31339,fd=2683),("java",pid=31339,fd=2637),("java",pid=31339,fd=2542),("java",pid=31339,fd=2530),("java",pid=31339,fd=2517),("java",pid=31339,fd=2514),("java",pid=3133,fd=2511),("java",pid=31339,fd=2508),("java",pid=31339,fd=2468),("java",pid=31339,fd=2347),("java",pid=31339,fd=2318),("java",pid=31339,fd=2309),("java",pid=31339,fd=2188),("java",pid=31339,fd=2061),("java",pid=31339,fd=1897),("java",pid=31339,fd=1888),("java",pid=31339,fd=1815),("java",pid=31339,fd=1751),("java",pid=31339,fd=1732),("java",pid=31339,fd=1683),("java",pid=31339,fd=1595),("java",pid=31339,fd=1531),("java",pid=31339,fd=1449),("java",pid=31339,fd=1414),("java",pid=31339,fd=1400),("java",pid=31339,fd=1399),("java",pid=31339,fd=1393),("java",pid=31339,fd=1385),("java",pid=31339,fd=1372),("java",pid=31339,fd=1360),("java",pid=31339,fd=1359),("java",pid=31339,fd=1341),("java",pid=31339,fd=1323),("java",pid=31339,fd=1266),("java",pid=31339,fd=1261),("java",pid=31339,fd=1244),("java",pid=31339,fd=1210),("java",pid=31339,fd=1173),("java",pid=31339,fd=1142),("java",pid=31339,fd=1139),("java",pid=31339,fd=1128),("java",pid=31339,fd=1120),("java",pid=31339,fd=1116),("java",pid=31339,fd=1104),("java",pid=31339,fd=1081),("java",pid=31339,fd=1069),("java",pid=31339,fd=1052),("java",pid=31339,fd=1028),("java",pid=31339,fd=1012),("java",pid=31339,fd=955),("java",pid=31339,fd=954),("java",pid=31339,fd=942),("java",pid=31339,fd=922),("java",pid=31339,fd=915),("java",pid=31339,fd=909),("java",pid=31339,fd=891),("java",pid=31339,fd=874),("java",pid=31339,fd=870),("java",pid=31339,fd=857),("java",pid=31339,fd=827),("java",pid=31339,fd=826),("java",pid=31339,fd=808),("java",pid=31339,fd=805),("java",pid=31339,fd=796),("java",pid=31339,fd=771),("java",pid=31339,fd=728),("java",pid=31339,fd=699),("java",pid=31339,fd=679),("java",pid=31339,fd=672),("java",pid=31339,fd=662),("java",pid=31339,fd=659),("java",pid=31339,fd=650),("java",pid=31339,fd=633),("java",pid=31339,fd=629),("java",pid=31339,fd=616),("java",pid=31339,fd=591),("java",pid=31339,fd=587),("java"pid=31339,fd=583),("java",pid=31339,fd=579),("java",pid=31339,fd=569),("java",pid=31339,fd=563),("java",pid=31339,fd=532),("java",pid=31339,fd=519),("java",pid=31339,fd=513),("java",pid=31339,fd=512),("java",pid=31339,fd=503),("java",pid=31339,fd=495),("java",pid=31339,fd=482),("java",pid=31339,fd=460),("java",pid=31339,fd=455),("java",pid=31339,fd=451),("java",pid=31339,fd=443),("java",pid=31339,fd=439),("java",pid=31339,fd=433),("java",pid=31339,fd=432),("java",pid=31339,fd=428),("java",pid=31339,fd=415),("java",pid=31339,fd=414),("java",pid=31339,fd=410),("java",pid=31339,fd=405),("java",pid=31339,fd=395),("java",pid=31339,fd=394),("java",pid=31339,fd=390),("java",pid=31339,fd=386),("java",pid=31339,fd=383),("java",pid=31339,fd=376),("java",pid=31339,fd=369),("java",pid=31339,fd=362),("java",pid=31339,fd=351),("java",pid=31339,fd=349),("java",pid=31339,fd=343),("java",pid=31339,fd=342),("java",pid=31339,fd=338),("java",pid=31339,fd=334),("java",pid=31339,fd=327),("java",pid=31339,fd=326),("java",pid=31339,fd=322),("java",pid=31339,fd=318),("java",pid=31339,fd=312),("java",pid=31339,fd=306),("java",pid=31339,fd=305),("java",pid=31339,fd=290),("java",pid=31339,fd=286),("java",pid=31339,fd=279),("java",pid=31339,fd=273),("java",pid=31339,fd=271),("java",pid=31339,fd=268),("java",pid=31339,fd=257),("java",pid=31339,fd=254),("java",pid=31339,fd=250),("java",pid=31339,fd=241),("java",pid=31339,fd=239),("java",pid=31339,fd=236),("java",pid=31339,fd=222),("java",pid=31339,fd=221),("java",pid=31339,fd=214),("java",pid=31339,fd=208),("java",pid=31339,fd=206),("java",pid=31339,fd=197),("java",pid=31339,fd=188),("java",pid=31339,fd=184),("java",pid=31339,fd=177),("java",pid=31339,fd=173),("java",pid=31339,fd=170),("java",pid=31339,fd=169),("java",pid=31339,fd=165),("java",pid=31339,fd=161),("java",pid=31339,fd=157),("java",pid=31339,fd=149),("java",pid=31339,fd=145),("java",pid=31339,fd=143),("java",pid=31339,fd=139),("java",pid=31339,fd=118),("java",pid=31339,fd=114),("java",pid=31339,fd=110),("java",pid=31339,fd=106),("java",pid=31339,fd=100),("java",pid=31339,fd=99),("java",pid=31339,fd=95),("java",pid=31339,fd=91),("java",pid=31339,fd=83),("java",pid=31339,fd=79),("java",pid=31339,fd=75),("java",pid=31339,fd=72),("java",pid=31339,fd=67),("java",pid=31339,fd=63),("java",pid=31339,fd=54),("java",pid=31339,fd=50),("java",pid=31339,fd=49),("java",pid=31339,fd=45),("java",pid=31339,fd=41)) ---
They are all open to the same u_str type socket.
Reproduced the issue also on macOS with a more reduced example without any server part.
The file descriptor count starts at 58 and grows up to 455 within the 100 loops. It never decreases during the "monitoring" phase.
open file descriptors: 58
open file descriptors: 62
…
open file descriptors: 455
open file descriptors: 455
The open sockets monitored with netstat during the "monitoring" phase start at 100 entries in state TIME_WAIT and quite quickly decreased to 0 over time.
$ netstat -n -p tcp | grep 88.99.67.18
tcp4 0 0 192.168.1.165.62767 88.99.67.18.80 TIME_WAIT
tcp4 0 0 192.168.1.165.62768 88.99.67.18.80 TIME_WAIT
…
With lsof I can see the following entries becoming more and more. The 123u line is the currently open connection. It changes from IPv6 to unix after the connection is closed and just looks like the 119u line.
$ lsof -p PID
…
java 3844 fgutmann 118u KQUEUE count=0, state=0x8
java 3844 fgutmann 119u unix 0xa94dae0f78700eaf 0t0 ->(none)
java 3844 fgutmann 120 PIPE 0xa94dae0f7c30b03f 16384 ->0xa94dae0f7c30c53f
java 3844 fgutmann 121 PIPE 0xa94dae0f7c30c53f 16384 ->0xa94dae0f7c30b03f
java 3844 fgutmann 122u KQUEUE count=0, state=0x8
java 3844 fgutmann 123u IPv6 0xa94dae0f70f4f6bf 0t0 TCP 192.168.1.165:63108->static.18.67.99.88.clients.your-server.de:http (ESTABLISHED)
java 3844 fgutmann 124 PIPE 0xa94dae0f7c30b6ff 16384 ->0xa94dae0f7c30bcff
java 3844 fgutmann 125 PIPE 0xa94dae0f7c30bcff 16384 ->0xa94dae0f7c30b6ff
java 3844 fgutmann 126u KQUEUE count=1, state=0x8
Here's the full code. Do you see any issues in that code?
Feel free to run it unmodified for testing. The IP used is one of our servers.
import com.sun.management.UnixOperatingSystemMXBean
import io.ktor.network.selector.ActorSelectorManager
import io.ktor.network.sockets.aSocket
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.delay
import kotlinx.coroutines.launch
import kotlinx.coroutines.runBlocking
import java.lang.management.ManagementFactory
import java.net.InetSocketAddress
fun main() = runBlocking {
val osMxBean = ManagementFactory.getOperatingSystemMXBean() as UnixOperatingSystemMXBean
val address = InetSocketAddress("88.99.67.18", 80)
launch {
// repeatedly connect to the server
repeat(100) {
val socket = aSocket(ActorSelectorManager(Dispatchers.IO)).tcp().connect(address)
delay(100)
socket.close()
println("open file descriptors: ${osMxBean.openFileDescriptorCount}")
}
}
// monitor for some time to see if things change while the TIME_WAIT sockets go away
repeat(10) {
delay(10_000)
println("open file descriptors: ${osMxBean.openFileDescriptorCount}")
}
}
Found it! The ActorSelectorManager needs to be closed.
I think the documentation here is quite unclear of this fact. In none of the examples in it, it is closed.
I filed a separate issue to fix the documentation. Closed.
Most helpful comment
Found it! The ActorSelectorManager needs to be closed.
I think the documentation here is quite unclear of this fact. In none of the examples in it, it is closed.