Nomad v0.8.4 (dbee1d7d051619e90a809c23cf7e55750900742a)
We have a script that periodically call nomad api(we request nomad server, not client) to make some statistic about launched jobs. Our nomad setup is a federation between 4 regions(in nomad terminology). In this script we call nomad api and if response took too long time(timeout), close connection
The code below demonstrate what we do(requests python lib used)
l_allocStatusParams = {"region": _region, "node_id": _allocation['NodeID']}
l_allocStatus = _session.get(NOMAD_ADDR + "/v1/client/allocation/" + _allocation["ID"] + "/stats", params=l_allocStatusParams, headers={"X-Nomad-Token": NOMAD_TOKEN}, timeout=2)
So if timeout happens we have increasly sockets inside nomad process in CLOSE_WAIT state, which live forever, until nomad server restart
After some investigate and turn on DEBUG logs we fond follow in logs
Aug 20 17:38:02 vol-cl-control-01 systemd[1]: Stopping Nomad agent...
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: ==> Caught signal: interrupt
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02.478993 [DEBUG] http: Shutting down http server
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02.479026 [INFO] agent: requesting shutdown
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02.479030 [INFO] nomad: shutting down server
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02 [WARN] serf: Shutdown without a Leave
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02.513798 [ERR] http: Request /v1/client/allocation/a22564b0-bd6c-b303-23f9-75efbbfbe5a1/stats?region=aav01&node_id
=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02.513807 [ERR] http: Request /v1/client/allocation/ae396939-77b9-6a78-e435-c1b5c3efc25b/stats?region=aav01&node_id
=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02.513819 [ERR] http: Request /v1/client/allocation/ae396939-77b9-6a78-e435-c1b5c3efc25b/stats?region=aav01&node_id
=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02.513931 [DEBUG] http: Request GET /v1/client/allocation/ae396939-77b9-6a78-e435-c1b5c3efc25b/stats?region=aav01&n
ode_id=8c6b0236-e923-1925-a963-5f53c414d403 (22m59.270636301s)
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02.513940 [DEBUG] http: Request GET /v1/client/allocation/ae396939-77b9-6a78-e435-c1b5c3efc25b/stats?region=aav01&n
ode_id=8c6b0236-e923-1925-a963-5f53c414d403 (10m1.296302508s)
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02.513989 [DEBUG] http: Request GET /v1/client/allocation/a22564b0-bd6c-b303-23f9-75efbbfbe5a1/stats?region=aav01&n
ode_id=8c6b0236-e923-1925-a963-5f53c414d403 (9m54.212359036s)
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02.513993 [ERR] http: Request /v1/client/allocation/ae396939-77b9-6a78-e435-c1b5c3efc25b/stats?region=aav01&node_id
=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02.514006 [ERR] http: Request /v1/client/allocation/a22564b0-bd6c-b303-23f9-75efbbfbe5a1/stats?region=aav01&node_id
=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02.514011 [ERR] http: Request /v1/client/allocation/2388d8b4-2564-7dc0-42d4-42a5315378c9/stats?region=aav01&node_id
=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02.513990 [ERR] http: Request /v1/client/allocation/ae396939-77b9-6a78-e435-c1b5c3efc25b/stats?region=aav01&node_id
=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02.514104 [DEBUG] http: Request GET /v1/client/allocation/a22564b0-bd6c-b303-23f9-75efbbfbe5a1/stats?region=aav01&n
ode_id=8c6b0236-e923-1925-a963-5f53c414d403 (27m54.332350804s)
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02.514124 [DEBUG] http: Request GET /v1/client/allocation/ae396939-77b9-6a78-e435-c1b5c3efc25b/stats?region=aav01&n
ode_id=8c6b0236-e923-1925-a963-5f53c414d403 (23m58.527153171s)
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02.514131 [ERR] http: Request /v1/client/allocation/ae396939-77b9-6a78-e435-c1b5c3efc25b/stats?region=aav01&node_id
=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02.514008 [ERR] http: Request /v1/client/allocation/2388d8b4-2564-7dc0-42d4-42a5315378c9/stats?region=aav01&node_id
=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: 2018/08/20 17:38:02.514150 [DEBUG] http: Request GET /v1/client/allocation/ae396939-77b9-6a78-e435-c1b5c3efc25b/stats?region=aav01&n
ode_id=8c6b0236-e923-1925-a963-5f53c414d403 (25m58.826122779s)
As you can see only after nomad begins stop, hung http api connection become alive, and we see huge api request times - more than 10 minutes
Its seems that func (p *ConnPool) RPC(region string, addr net.Addr, version int, method string, args interface{}, reply interface{}) error { in helper/pool/pool.go hung forever in case when rpc are made (in particularly when in proxyed to another region)
The same happens on client nodes, except logs slightly different, because on client rcp pool used(on client node we doens't alow debug logs so info about request timeout not present)
Aug 21 13:35:14 vol-h-docker-04 systemd[1]: Stopping Nomad agent...
Aug 21 13:35:14 vol-h-docker-04 nomad.sh[32669]: ==> Caught signal: interrupt
Aug 21 13:35:14 vol-h-docker-04 nomad.sh[32669]: 2018/08/21 13:35:14.448612 [ERR] nomad: "ClientAllocations.Stats" RPC failed to server 172.16.9.87:4647:
rpc error: EOF
Aug 21 13:35:14 vol-h-docker-04 nomad.sh[32669]: 2018/08/21 13:35:14.448626 [ERR] http: Request /v1/client/allocation/a22564b0-bd6c-b303-23f9-75efbbfbe5a
1/stats?region=aav01&node_id=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 21 13:35:14 vol-h-docker-04 nomad.sh[32669]: 2018/08/21 13:35:14.448856 [ERR] nomad: "Node.GetClientAllocs" RPC failed to server 172.16.9.35:4647: rp
c error: EOF
Aug 21 13:35:14 vol-h-docker-04 nomad.sh[32669]: 2018/08/21 13:35:14.448873 [ERR] nomad: "ClientAllocations.Stats" RPC failed to server 172.16.9.89:4647:
rpc error: EOF
and if we made many http requests to nomad api with timeouts we see many follow sockets (lsof -p <nomad pid>):
nomad 23374 root 72u IPv4 2999079758 0t0 TCP 127.0.0.1:4646->127.0.0.1:39826 (CLOSE_WAIT)
and in our case this looks like this

Tcp Sockets Allocated in graph above mostly is sockets in CLOSE_WAIT state. Obviously this sockets also eats file handlers
This happens again and again, after we attach delve debbuger we found this(tons of goriuntines):
Goroutine 68321 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:308 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).keepalive (0xbb5aae)
Goroutine 68335 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:207 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).AcceptStream (0xbb5028)
Goroutine 68386 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 68484 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 68503 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 68837 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb/watch_few.go:16 github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb.watchFew (0xecb2ba)
Goroutine 68882 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 69415 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 69552 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 69949 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 70141 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 70523 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 70570 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 70693 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 71081 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 71232 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 71352 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 71384 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 71650 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:207 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).AcceptStream (0xbb5028)
Goroutine 71651 - User: /usr/local/go/src/runtime/netpoll.go:173 internal/poll.runtime_pollWait (0x4297c7)
Goroutine 71652 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:392 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).send (0xbb66a0)
Goroutine 71653 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:308 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).keepalive (0xbb5aae)
Goroutine 71654 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb/watch_few.go:16 github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb.watchFew (0xecb2ba)
Goroutine 71661 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 71786 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 71835 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 72247 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 72777 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 73032 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 73059 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 73884 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
Goroutine 74056 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
stack trace for one of them is:
(dlv) bt 30
0 0x000000000042ee2a in runtime.gopark
at /usr/local/go/src/runtime/proc.go:292
1 0x000000000043eb40 in runtime.selectgo
at /usr/local/go/src/runtime/select.go:392
2 0x0000000000bb84d1 in github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133
3 0x00000000004f93be in bufio.(*Reader).fill
at /usr/local/go/src/bufio/bufio.go:100
4 0x00000000004f9c49 in bufio.(*Reader).ReadByte
at /usr/local/go/src/bufio/bufio.go:242
5 0x00000000007ab4da in github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec.(*ioDecReader).readn1
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec/decode.go:90
6 0x00000000007bb425 in github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec.(*msgpackDecDriver).initReadNext
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec/msgpack.go:540
7 0x00000000007ae6b9 in github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec.(*Decoder).decode
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec/decode.go:635
8 0x00000000007ae63d in github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec.(*Decoder).Decode
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec/decode.go:630
9 0x0000000000d478a7 in github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc.(*MsgpackCodec).read
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc/codec.go:121
10 0x0000000000d47511 in github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc.(*MsgpackCodec).ReadResponseHeader
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc/codec.go:74
11 0x0000000000d46b2f in github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc.CallWithCodec
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc/client.go:29
12 0x0000000000d4bcbb in github.com/hashicorp/nomad/helper/pool.(*ConnPool).RPC
at /opt/gopath/src/github.com/hashicorp/nomad/helper/pool/pool.go:422
13 0x0000000000f9a612 in github.com/hashicorp/nomad/nomad.(*Server).forwardRegion
at /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:465
14 0x0000000000f99b7b in github.com/hashicorp/nomad/nomad.(*Server).forward
at /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:361
15 0x0000000000f438fa in github.com/hashicorp/nomad/nomad.(*ClientAllocations).Stats
at /opt/gopath/src/github.com/hashicorp/nomad/nomad/client_alloc_endpoint.go:129
16 0x0000000000459fcb in runtime.call64
at /usr/local/go/src/runtime/asm_amd64.s:574
17 0x00000000004c0e79 in reflect.Value.call
at /usr/local/go/src/reflect/value.go:447
18 0x00000000004c03f4 in reflect.Value.Call
at /usr/local/go/src/reflect/value.go:308
19 0x00000000007a728e in net/rpc.(*service).call
at /usr/local/go/src/net/rpc/server.go:384
20 0x00000000007a8155 in net/rpc.(*Server).ServeRequest
at /usr/local/go/src/net/rpc/server.go:504
21 0x0000000000fa6ae4 in github.com/hashicorp/nomad/nomad.(*Server).RPC
at /opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:1394
22 0x00000000014d52ef in github.com/hashicorp/nomad/command/agent.(*HTTPServer).allocStats
at /opt/gopath/src/github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:213
23 0x00000000014d426c in github.com/hashicorp/nomad/command/agent.(*HTTPServer).ClientAllocRequest
at /opt/gopath/src/github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:94
24 0x000000000150e6d8 in github.com/hashicorp/nomad/command/agent.(*HTTPServer).ClientAllocRequest-fm
at /opt/gopath/src/github.com/hashicorp/nomad/command/agent/http.go:168
25 0x000000000150b0ac in github.com/hashicorp/nomad/command/agent.(*HTTPServer).wrap.func1
at /opt/gopath/src/github.com/hashicorp/nomad/command/agent/http.go:283
26 0x00000000007067b4 in net/http.HandlerFunc.ServeHTTP
at /usr/local/go/src/net/http/server.go:1947
27 0x00000000014ab47e in github.com/hashicorp/nomad/vendor/github.com/rs/cors.(*Cors).Handler.func1
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/rs/cors/cors.go:200
28 0x00000000007067b4 in net/http.HandlerFunc.ServeHTTP
at /usr/local/go/src/net/http/server.go:1947
29 0x0000000000708520 in net/http.(*ServeMux).ServeHTTP
at /usr/local/go/src/net/http/server.go:2337
30 0x00000000014a8769 in github.com/hashicorp/nomad/vendor/github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/NYTimes/gziphandler/gzip.go:269
For now we add timeout to 10 seconds in rpc calls to prevent dead stuck connections
sc.stream.SetDeadline(time.Now().Add(10 * time.Second))
in func (p *ConnPool) RPC(region string, addr net.Addr, version int, method string, args interface{}, reply interface{}) error method. But this is very coarse workaround
@dadgar @preetapan
This happes again, but we now get more info about that
On remote region to where request proxied one of nomad servers(ip-172-20-4-194) have grows gorountine count. And this nomad server can't itself handle request so for example nomad node status 0d87a811 on it hung

then we attach to working process of that server with debugger and see huge count(as expected) of gorountines(output is reduced)
Goroutine 111246 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
Goroutine 111257 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
Goroutine 111314 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
Goroutine 111315 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
Goroutine 111321 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
Goroutine 111383 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
Goroutine 111384 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
Goroutine 111389 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
Goroutine 111395 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
Goroutine 111995 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
Goroutine 111999 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
Goroutine 112000 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
With follow backtrace on each:
0 0x000000000042ee2a in runtime.gopark
at /usr/local/go/src/runtime/proc.go:292
1 0x000000000043eb40 in runtime.selectgo
at /usr/local/go/src/runtime/select.go:392
2 0x0000000000bdcb4f in github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159
3 0x0000000000bdc9eb in github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).Open
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:142
4 0x0000000000f4d19d in github.com/hashicorp/nomad/nomad.NodeRpc
at /opt/gopath/src/github.com/hashicorp/nomad/nomad/client_rpc.go:220
5 0x0000000000f48128 in github.com/hashicorp/nomad/nomad.(*ClientAllocations).Stats
at /opt/gopath/src/github.com/hashicorp/nomad/nomad/client_alloc_endpoint.go:174
6 0x0000000000459fcb in runtime.call64
at /usr/local/go/src/runtime/asm_amd64.s:574
7 0x00000000004c0e09 in reflect.Value.call
at /usr/local/go/src/reflect/value.go:447
8 0x00000000004c0384 in reflect.Value.Call
at /usr/local/go/src/reflect/value.go:308
9 0x00000000007a6f2e in net/rpc.(*service).call
at /usr/local/go/src/net/rpc/server.go:384
10 0x00000000007a7df5 in net/rpc.(*Server).ServeRequest
at /usr/local/go/src/net/rpc/server.go:504
11 0x0000000000f9cf13 in github.com/hashicorp/nomad/nomad.(*Server).handleNomadConn
at /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:241
So when we apply timeouts for RPC, we remove only part of the problem(huge count of gorountine was removed from server origin of the request: each minute we make stats collection from all allocations between all region that we use)
We monitor allocations with follow python script(this is provided only as info to better understand what we are doing specifically)
def getAlocationResourceUsage(_session, _region, _stats, _allocation):
l_allocStatusParams = {"region": _region, "node_id": _allocation['NodeID']}
l_allocStatus = _session.get(NOMAD_ADDR + "/v1/client/allocation/" + _allocation["ID"] + "/stats", params=l_allocStatusParams, headers={"X-Nomad-Token": NOMAD_TOKEN}, timeout=1)
l_resource = l_allocStatus.json()["ResourceUsage"]
_stats.timing("mem.Cache", l_resource["MemoryStats"]["Cache"])
_stats.timing("mem.RSS", l_resource["MemoryStats"]["RSS"])
_stats.timing("mem.MaxUsage", l_resource["MemoryStats"]["MaxUsage"])
_stats.timing("cpu.Percent", l_resource["CpuStats"]["Percent"])
_stats.timing("cpu.ThrottledPeriods", l_resource["CpuStats"]["ThrottledPeriods"])
_stats.timing("cpu.TotalTicks", l_resource["CpuStats"]["TotalTicks"])
Also i must said that this happens only when we query allocations from particular client not all. But when we query allocations for that client from another nomad servers in the same region, we doesn't see any issues
Noting this looks closely related to #6620 so I've assigned this to myself.
@notnoop @tgross
After some time we have patched yamux a little bit (https://github.com/hashicorp/yamux/pull/79) and finally we think, that problem not in yamux itself(yes it have minor bugs, but they can't lead to this issue).
Debugging on server nodes with this issue, shows to us that yamux keepalive loop worked perfectly(so not any reason to think that happened some dataflow degradation), but we also saw, that despite this creating yamux stream doesn't went good: synCh channel (https://github.com/hashicorp/yamux/blob/master/session.go#L102) constantly increased untill it not filled fully. So I made a conclusion that other side doesn't call Accept(https://github.com/hashicorp/yamux/blob/master/session.go#L201), and I think that reason of this follow lines of code:
https://github.com/hashicorp/nomad/blob/master/helper/pool/pool.go#L277-L281
if p.connListener != nil {
select {
case p.connListener <- c.session:
default:
}
}
If p.connListener overfills, new session doesn't will be reported in https://github.com/hashicorp/nomad/blob/master/client/rpc.go#L242-L262
// rpcConnListener is a long lived function that listens for new connections
// being made on the connection pool and starts an RPC listener for each
// connection.
func (c *Client) rpcConnListener() {
// Make a channel for new connections.
conns := make(chan *yamux.Session, 4) // <<<<<<<<<<<<<<Buffer here is too small
c.connPool.SetConnListener(conns)
for {
select {
case <-c.shutdownCh:
return
case session, ok := <-conns:
if !ok {
continue
}
go c.listenConn(session)
}
}
}
and we can get behavior described here(due buffer of conns channel is very easy to overfill). Now we increase this buffer in 10 times (to 40), and continue to watch(in our case this issue happens about 1 time in a week), additionally we add a little hack in keepalive loop in yamux library to check overflow of synCh chanel and if this happens we simply close upper Session object https://github.com/hashicorp/yamux/pull/79/files#diff-f2fccfb96064097b73fbe8eeae3703eeR389-R407
After 1-2 weeks i will writing here does our changes lead to positive result
@tantra35 Thank you for digging into it and following with the report. I mistakenly thought this was addressed by the leadership flapping issue! Sorry! I'll dig into this one a bit more now.
As for the channel buffer being 4 - we can increase the limit. Though, I would have expected a small buffer to be sufficient here - the loop should dequeue and start goroutines very fast, or hopefully faster than server issues RPC calls.
Also, I may have noticed a question about RpcMultiplexV2 - we are switching to RpcMultiplexV2 in 0.11.0 for server-to-server RPC communication. We may have missed a spot in the switch - let us know if you see something odd.
Also, I suspect the loop is buggy - it currently goes in a very tight loop when conns get closed - it should return instead:
case session, ok := <-conns:
if !ok {
return
}
go c.listenConn(session)
@notnoop increase of buffer doesn't helps(so our conclusion is wrong, in any case Warning must be placed in p.connListener eat loop), but we found that real hung happens between servers, not client server communications,
+------------------Allocations.Stats-----------------------+ +------------------------+
| | Proxied RPC | | | |
| server1 |+------------>| server2 | | server3 |
| | ^ | | | |
+-------------------+ | +-----------------------+ +------------------------+
| |
| |
| |
| |
| |
| |
| Allocations.Stats
| RPC
| |
| |
+ | +----------------------+
here hung | | |
| | |
+-------->| nomad client node |
| |
+----------------------+
and can you clarify follow code in NodeRpc method(https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/client_rpc.go#L226-L230)
// Write the RpcNomad byte to set the mode
if _, err := stream.Write([]byte{byte(pool.RpcNomad)}); err != nil {
stream.Close()
return err
}
this implies, that communications happens in RpcMultiplexV2, but in versions before https://github.com/hashicorp/nomad/commit/7144efe0c503540b1a18905c572c92c573fe58ce, communication protocol is RpcMultiplex (https://github.com/hashicorp/nomad/blob/v0.10.4/helper/pool/pool.go#L334-L338) in ConnPool
// Write the multiplex byte to set the mode
if _, err := conn.Write([]byte{byte(RpcMultiplex)}); err != nil {
conn.Close()
return nil, err
}
how this actually works in that case?
That's a very good question, and it does look suspicious - when working on RpcMultiplexV2, I also witnessed a hang when I didn't get the initial bytes correct. I'll test and update you today!
So dug and this might be a red harring. NodeRpc[1] opens a new session to the nomad client, running on an existing connection. No connection pooling or helper/pool is involved.
In Nomad client, the RPC handler expects every new yamux session to start with RpcNomad or RpcStreaming . The conns/streams in there is a bit confusing.
I have added a more explicit test in https://github.com/hashicorp/nomad/commit/a421a841b09369608c3f6afc5bb9ec1f7c2227ef and confirmed that the test is passing https://circleci.com/gh/hashicorp/nomad/46421 .
One thing I'm noticing is that our handling for RPC errors misses closing the connection in some failures. I wonder if we ought to port the equivalent of https://github.com/hashicorp/nomad/pull/7045 here.
Can you see if there is a corrolation between client.rpc.request_error or client.streaming_rpc.request_error metrics and the problem?
[1] https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/client_rpc.go#L218-L230
[2] https://github.com/hashicorp/nomad/blob/v0.10.4/client/rpc.go#L245-L309
Hm I'm a bit confused, and perhaps missing something, but i thinked that session.Open() opening new Stream inside yamux tcp connection(Session), which is managed by helper/pool, and this session is handled in [1] on another side, and for now looks like yamux inside in yamux, which is madness and looks like overhead, and absolutely unclear
So real streams handling happens in [2], and passing test (about the truth, I don't fully understand what it doing) looks confused for me, in any way I won't argue with you, and i must dig further(only it is unclear where)
And now we doesn't collecting client.rpc... metrics so can't trace correlations
[1] https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/rpc.go#L236-L330
[2] https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/rpc.go#L251, https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/rpc.go#L335-L374
It's indeed pretty confusing indeed, and I could be wrong, but love the additional data points and hypothesis you are raising.
I think clients are special and we should document it. Clients use helper/pool to start a connection to a server, to heart beat with RpcNomad[1]. The server stores the underlying connection when client connects[2], then the server uses bidirectional yamux session creation support to open a session to the client[3], which is handled by rpcConnListener referenced functions[4]. This server->client path doesn't go through pool magic byte handling the same way as server<->server. In server->client path, server open sessions manually and always need to write the RpcNomad or RpcStreaming without writing a RpcMultiplex or RpcMultiplexV2 byte.
I'm suspicious of the handling when the underlying connection goes bad. I'll dig into that path further.
[1] https://github.com/hashicorp/nomad/blob/v0.10.4/client/rpc.go#L49-L73
[2] https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/node_endpoint.go#L379
[3] https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/client_rpc.go#L30-L53
[4] https://github.com/hashicorp/nomad/blob/v0.10.4/client/rpc.go#L245-L262
And now we doesn't collecting client.rpc... metrics so can't trace correlations
What about logs? "error performing RPC" or "error performing streaming RPC", "streaming RPC error", or "RPC error" are the significant log messages
nothing interesting in logs,

but this hung, as i wrote early correlated with leader switch, and seems that hang connection appear on previous leader. For example today this happens again

as you may see leader switch from 172.27.3.86 to 172.27.0.212 and after that on server node(and we reqest allocations stat only from server nodes) 172.27.3.86 begin present hung connections in CLOSE_WAIT state:
root@ip-172-27-3-86:/home/ruslan# ss -nt | grep 'CLOSE-WAIT'
CLOSE-WAIT 0 0 ::ffff:172.27.3.86:4646 ::ffff:172.27.0.133:48214
CLOSE-WAIT 0 0 ::ffff:172.27.3.86:4646 ::ffff:172.27.0.133:47316
CLOSE-WAIT 0 0 ::ffff:172.27.3.86:4646 ::ffff:172.27.0.133:44368
CLOSE-WAIT 0 0 ::ffff:172.27.3.86:4646 ::ffff:172.27.0.133:35578
CLOSE-WAIT 0 0 ::ffff:172.27.3.86:4646 ::ffff:172.27.0.133:55764
CLOSE-WAIT 0 0 ::ffff:172.27.3.86:4646 ::ffff:172.27.0.133:38524
CLOSE-WAIT 0 0 ::ffff:172.27.3.86:4646 ::ffff:172.27.0.133:43874
CLOSE-WAIT 0 0 ::ffff:172.27.3.86:4646 ::ffff:172.27.0.133:45620
CLOSE-WAIT 0 0 ::ffff:172.27.3.86:4646 ::ffff:172.27.0.133:49618
CLOSE-WAIT 0 0 ::ffff:172.27.3.86:4646 ::ffff:172.27.0.133:34646
CLOSE-WAIT 0 0 ::ffff:172.27.3.86:4646 ::ffff:172.27.0.133:47366
if we debug this server we discover that all hanging present on NodeRpc, and it looks like Accpet call in yamux session on other side not called, because we see increased syncCh inside yamux
we have folow backtrace of hunging grountine:
[235 goroutines]
(dlv) goroutine 7622384
Switched from 0 to 7622384 (thread 11251)
(dlv) bt 30
0 0x000000000043304b in runtime.gopark
at /usr/local/go/src/runtime/proc.go:303
1 0x0000000000443316 in runtime.selectgo
at /usr/local/go/src/runtime/select.go:313
2 0x0000000000856877 in github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).read
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:141
3 0x00000000008565ce in github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:88
4 0x000000000051fcef in bufio.(*Reader).fill
at /usr/local/go/src/bufio/bufio.go:100
5 0x0000000000520589 in bufio.(*Reader).ReadByte
at /usr/local/go/src/bufio/bufio.go:242
6 0x0000000000ae7b0a in github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec.(*ioDecReader).readn1
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec/decode.go:90
7 0x0000000000af8ef8 in github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec.(*msgpackDecDriver).initReadNext
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec/msgpack.go:540
8 0x0000000000aeb3bf in github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec.(*Decoder).decode
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec/decode.go:635
9 0x0000000000aeb32d in github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec.(*Decoder).Decode
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec/decode.go:630
10 0x0000000001291065 in github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc.(*MsgpackCodec).read
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc/codec.go:121
11 0x0000000001290cd1 in github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc.(*MsgpackCodec).ReadResponseHeader
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc/codec.go:74
12 0x00000000012902fa in github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc.CallWithCodec
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc/client.go:29
13 0x0000000001463aa5 in github.com/hashicorp/nomad/nomad.NodeRpc
at /opt/gopath/src/github.com/hashicorp/nomad/nomad/client_rpc.go:231
14 0x000000000145d8ad in github.com/hashicorp/nomad/nomad.(*ClientAllocations).Stats
at /opt/gopath/src/github.com/hashicorp/nomad/nomad/client_alloc_endpoint.go:272
15 0x000000000045f9eb in runtime.call64
at /usr/local/go/src/runtime/asm_amd64.s:523
16 0x00000000004c8224 in reflect.Value.call
at /usr/local/go/src/reflect/value.go:447
17 0x00000000004c7cb4 in reflect.Value.Call
at /usr/local/go/src/reflect/value.go:308
18 0x000000000087c6be in net/rpc.(*service).call
at /usr/local/go/src/net/rpc/server.go:384
19 0x000000000087d5c2 in net/rpc.(*Server).ServeRequest
at /usr/local/go/src/net/rpc/server.go:505
20 0x00000000014d1269 in github.com/hashicorp/nomad/nomad.(*Server).RPC
at /opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:1426
21 0x00000000019d405b in github.com/hashicorp/nomad/command/agent.(*HTTPServer).allocStats
at /opt/gopath/src/github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:355
22 0x00000000019d1fb0 in github.com/hashicorp/nomad/command/agent.(*HTTPServer).ClientAllocRequest
at /opt/gopath/src/github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:147
23 0x0000000001a11f68 in github.com/hashicorp/nomad/command/agent.(*HTTPServer).ClientAllocRequest-fm
at /opt/gopath/src/github.com/hashicorp/nomad/command/agent/http.go:177
24 0x0000000001a0e28c in github.com/hashicorp/nomad/command/agent.(*HTTPServer).wrap.func1
at /opt/gopath/src/github.com/hashicorp/nomad/command/agent/http.go:298
25 0x0000000000759014 in net/http.HandlerFunc.ServeHTTP
at /usr/local/go/src/net/http/server.go:1964
26 0x00000000019b6672 in github.com/hashicorp/nomad/vendor/github.com/rs/cors.(*Cors).Handler.func1
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/rs/cors/cors.go:200
27 0x0000000000759014 in net/http.HandlerFunc.ServeHTTP
at /usr/local/go/src/net/http/server.go:1964
28 0x000000000075adc7 in net/http.(*ServeMux).ServeHTTP
at /usr/local/go/src/net/http/server.go:2361
29 0x00000000019b3950 in github.com/hashicorp/nomad/vendor/github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1
at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/NYTimes/gziphandler/gzip.go:269
30 0x0000000000759014 in net/http.HandlerFunc.ServeHTTP
at /usr/local/go/src/net/http/server.go:1964
ok lets examine frame 13 NodeRpc function
(dlv) frame 13
(dlv) print stream.session.synCh
chan struct {} {
qcount: 69,
dataqsiz: 256,
buf: *[256]struct struct {} [
.....
as you may see synCh have 69 items and this number only grows
ok lets examine session inside stream and watch connection info
(dlv) print stream.session.conn.fd
*net.netFD {
pfd: internal/poll.FD {
fdmu: (*"internal/poll.fdMutex")(0xc000dbe580),
Sysfd: 29,
pd: (*"internal/poll.pollDesc")(0xc000dbe598),
iovecs: *[]syscall.Iovec nil,
csema: 0,
isBlocking: 0,
IsStream: true,
ZeroReadIsEOF: true,
isFile: false,},
family: 10,
sotype: 1,
isConnected: false,
net: "tcp",
laddr: net.Addr(*net.TCPAddr) *{
IP: net.IP len: 16, cap: 16, [0,0,0,0,0,0,0,0,0,0,255,255,172,27,3,86],
Port: 4647,
Zone: "",},
raddr: net.Addr(*net.TCPAddr) *{
IP: net.IP len: 16, cap: 16, [0,0,0,0,0,0,0,0,0,0,255,255,172,27,0,212],
Port: 37268,
Zone: "",},}
as you can see connection is between servers 172,27,0,212 and 172,27,3,86
Also we found follow when we replace https://github.com/hashicorp/nomad/blob/v0.10.4/helper/pool/pool.go#L271-L277
with follow lines of code(exactly eqvivalent, but without if):
select {
case p.connListener <- c.session:
default:
p.logger.Printf("[WARN] Skip connListener")
}
We constantly see warn message in logs nearly every 2 minutes:

And this looks like a nonsense, but when code looks like this:
select {
case p.connListener <- c.session:
default:
if p.connListener != nil {
p.logger.Printf("[WARN] Skip connListener")
}
}
There no warning messages in logs, so i make conclusion that p.connListener become nil some times, and this is confirmed by the fact that the suspended connections look as if Accept is not called in yamux Session on other side
Interesting. I find it very odd that NodeRPC is used to connect to another server! If a server doesn't have a direct connection to that client, it should go through findNodeConnAndForward path. Do you happen to have nomad servers that also runs as nomad clients?
Also, thanks for the p.connListener tip - I'll investigate this too.
We have full separate nomad servers and clients. So servers function only as servers
OK - I suspect that's the bug - if NodeRPC is used to contact a server, the server is indeed isn't setup to handle that bidirectionally and will indeed hang.
Looking at https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/node_endpoint.go#L373-L383 ; I suspect the following events took place:
NodeRPC to query S2 and hangs forever.This race should happen rarely, but I suspect considering the frequency of your leadership election and frequency of alloc stat calls, it's more likely for you to detect the problem.
Does that seem plausible to you?
Sounds plausible, but it doesn't explain why synCh increasing inside yamux session, if i understand all correctly it is possible only in case when on other side of yamux Session doesn't called Accept
And about p.connListener, i think that warning happens on server side https://github.com/hashicorp/nomad/blob/master/nomad/server.go#L309 , and this is normal due https://github.com/hashicorp/nomad/blob/master/nomad/server.go#L413
synChincreasing inside yamux session, if i understand all correctly it is possible only in case when on other side of yamux Session doesn't called Accept
Servers don't call Accept on the Yamux session - only nomad clients do IIUC. Nomad servers only accept the raw connection - so that would make sense.
I will still do a bit of digging today and post findings as well as a way to reproduce my steps.
@notnoop Seems that is full victory, for last 12 days we doesn't see any problems with stat calls(not any timeouts or something else), and no any leaked gorountines. Thanks
Most helpful comment
@notnoop Seems that is full victory, for last 12 days we doesn't see any problems with stat calls(not any timeouts or something else), and no any leaked gorountines. Thanks