Flux: Flux leaks file descriptors and runs out of file descriptors

Created on 8 Jan 2019  路  10Comments  路  Source: fluxcd/flux

We had the weave-flux-agent running for a few days and noticed that it stopped syncing to weave cloud.

There was an error message as follows in the logs:

ts=2019-01-08T12:59:51.36780816Z caller=upstream.go:113 component=upstream err="executing websocket wss://cloud.weave.works./api/flux/v10/daemon: dial tcp: lookup cloud.weave.works. on 172.20.0.10:53: dial udp 172.20.0.10:53: socket: too many open files"

We configured our AWS EKS AL2 Nodes to have the following ulimits:

 "default-ulimits": {
   "nofile": {
     "Name": "nofile",
     "Soft": 2048,
     "Hard": 8192
   }
 }

On our node we had the following output:

sudo ls -l /proc/20190/fd/ | wc
  2049   22530  131528

From lsof

COMMAND   PID USER   FD      TYPE DEVICE SIZE/OFF     NODE NAME
fluxd   20190 root  cwd       DIR  0,347     4096  3539637 /home/flux
fluxd   20190 root  rtd       DIR  0,347     4096  3539897 /
fluxd   20190 root  txt       REG  0,347 42909823  3539827 /usr/local/bin/fluxd
fluxd   20190 root  mem       REG 202,80           3539827 /usr/local/bin/fluxd (stat: No such file or directory)
fluxd   20190 root    0u     sock    0,8      0t0 38846617 protocol: TCP
fluxd   20190 root    1w     FIFO   0,11      0t0    91816 pipe
fluxd   20190 root    2w     FIFO   0,11      0t0    91817 pipe
fluxd   20190 root    3u     sock    0,8      0t0    91967 protocol: TCP
fluxd   20190 root    4u  a_inode   0,12        0     7747 [eventpoll]
fluxd   20190 root    5u     sock    0,8      0t0   133739 protocol: TCPv6
fluxd   20190 root    6u     sock    0,8      0t0  3685001 protocol: TCP
fluxd   20190 root    7u     sock    0,8      0t0   178126 protocol: TCP
fluxd   20190 root    8u     sock    0,8      0t0   575783 protocol: TCP
fluxd   20190 root    9u     sock    0,8      0t0   178174 protocol: TCP
fluxd   20190 root   10u     sock    0,8      0t0   178549 protocol: TCP
fluxd   20190 root   11u     sock    0,8      0t0   179416 protocol: TCP
fluxd   20190 root   12u     sock    0,8      0t0  2386384 protocol: TCP
fluxd   20190 root   13u     sock    0,8      0t0   218841 protocol: TCPv6
fluxd   20190 root   14u     sock    0,8      0t0  1333669 protocol: TCP
fluxd   20190 root   15u     sock    0,8      0t0   181075 protocol: TCP
.....
fluxd   20190 root 2042u     sock    0,8      0t0 38742366 protocol: TCP
fluxd   20190 root 2043u     sock    0,8      0t0 38894759 protocol: TCP
fluxd   20190 root 2044u     sock    0,8      0t0 38764757 protocol: TCP
fluxd   20190 root 2045u     sock    0,8      0t0 38782203 protocol: TCP
fluxd   20190 root 2046u     sock    0,8      0t0 38793189 protocol: TCP
fluxd   20190 root 2047u     sock    0,8      0t0 39311101 protocol: TCP

We connect to bitbucket repo's.

I had to delete the pod to get flux going to unblock our pipeline.

Possibly related to: http://github.com/weaveworks/flux/issues/1602

bug

All 10 comments

Netstat showed no connected sockets.

@agcooke thanks for reporting this, we will look into it.

Netstat showed no connected sockets.

It would have been nice to see more details about the sockets though

@agcooke have you managed to reproduce it? what version of flux was the pod running?

Duplicate of #1602 ?

@2opremio I do not think so. I was away on for some weeks, but we did see it happen again. I will see if I can find logs for that.

I've had another report of this and have some logs I might be able to share to shed some light on it.

@foot Can you DM me those logs? Ta

In my case the problem was in unreachable registries.
Since I don't use "Automated deployment of new container images" I added "- --registry-exclude-image=*" option and the unclosed socket problem was solved.

So, the problem was caused by registries not being reachable and (probably) the registry client leaking sockets (probably HTTP response bodies). Has anyone seen this problem recently?

@indrekh Would you be so kind of re-testing this? (assuming you are still using Flux).

@squaremo do you recall what happened with this?

Was this page helpful?
0 / 5 - 0 ratings