We had the weave-flux-agent running for a few days and noticed that it stopped syncing to weave cloud.
There was an error message as follows in the logs:
ts=2019-01-08T12:59:51.36780816Z caller=upstream.go:113 component=upstream err="executing websocket wss://cloud.weave.works./api/flux/v10/daemon: dial tcp: lookup cloud.weave.works. on 172.20.0.10:53: dial udp 172.20.0.10:53: socket: too many open files"
We configured our AWS EKS AL2 Nodes to have the following ulimits:
"default-ulimits": {
"nofile": {
"Name": "nofile",
"Soft": 2048,
"Hard": 8192
}
}
On our node we had the following output:
sudo ls -l /proc/20190/fd/ | wc
2049 22530 131528
From lsof
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
fluxd 20190 root cwd DIR 0,347 4096 3539637 /home/flux
fluxd 20190 root rtd DIR 0,347 4096 3539897 /
fluxd 20190 root txt REG 0,347 42909823 3539827 /usr/local/bin/fluxd
fluxd 20190 root mem REG 202,80 3539827 /usr/local/bin/fluxd (stat: No such file or directory)
fluxd 20190 root 0u sock 0,8 0t0 38846617 protocol: TCP
fluxd 20190 root 1w FIFO 0,11 0t0 91816 pipe
fluxd 20190 root 2w FIFO 0,11 0t0 91817 pipe
fluxd 20190 root 3u sock 0,8 0t0 91967 protocol: TCP
fluxd 20190 root 4u a_inode 0,12 0 7747 [eventpoll]
fluxd 20190 root 5u sock 0,8 0t0 133739 protocol: TCPv6
fluxd 20190 root 6u sock 0,8 0t0 3685001 protocol: TCP
fluxd 20190 root 7u sock 0,8 0t0 178126 protocol: TCP
fluxd 20190 root 8u sock 0,8 0t0 575783 protocol: TCP
fluxd 20190 root 9u sock 0,8 0t0 178174 protocol: TCP
fluxd 20190 root 10u sock 0,8 0t0 178549 protocol: TCP
fluxd 20190 root 11u sock 0,8 0t0 179416 protocol: TCP
fluxd 20190 root 12u sock 0,8 0t0 2386384 protocol: TCP
fluxd 20190 root 13u sock 0,8 0t0 218841 protocol: TCPv6
fluxd 20190 root 14u sock 0,8 0t0 1333669 protocol: TCP
fluxd 20190 root 15u sock 0,8 0t0 181075 protocol: TCP
.....
fluxd 20190 root 2042u sock 0,8 0t0 38742366 protocol: TCP
fluxd 20190 root 2043u sock 0,8 0t0 38894759 protocol: TCP
fluxd 20190 root 2044u sock 0,8 0t0 38764757 protocol: TCP
fluxd 20190 root 2045u sock 0,8 0t0 38782203 protocol: TCP
fluxd 20190 root 2046u sock 0,8 0t0 38793189 protocol: TCP
fluxd 20190 root 2047u sock 0,8 0t0 39311101 protocol: TCP
We connect to bitbucket repo's.
I had to delete the pod to get flux going to unblock our pipeline.
Possibly related to: http://github.com/weaveworks/flux/issues/1602
Netstat showed no connected sockets.
@agcooke thanks for reporting this, we will look into it.
Netstat showed no connected sockets.
It would have been nice to see more details about the sockets though
@agcooke have you managed to reproduce it? what version of flux was the pod running?
Duplicate of #1602 ?
@2opremio I do not think so. I was away on for some weeks, but we did see it happen again. I will see if I can find logs for that.
I've had another report of this and have some logs I might be able to share to shed some light on it.
@foot Can you DM me those logs? Ta
In my case the problem was in unreachable registries.
Since I don't use "Automated deployment of new container images" I added "- --registry-exclude-image=*" option and the unclosed socket problem was solved.
So, the problem was caused by registries not being reachable and (probably) the registry client leaking sockets (probably HTTP response bodies). Has anyone seen this problem recently?
@indrekh Would you be so kind of re-testing this? (assuming you are still using Flux).
@squaremo do you recall what happened with this?