Tendermint version (use tendermint version or git rev-parse --verify HEAD if installed from source): v0.25.0
ABCI app (name for built-in, URL for self-written if it's publicly available): https://github.com/MinterTeam/minter-go-node
Environment:
What happened: After some time under load nodes stop responding. http://localhost:26657/status and some other rpc endpoints became not available with huge timeout (more than 60 secs). It seems like ConsensusState's (or ConsensusReactor's) mutex is deadlocked. Restarting node solves problem.
Strange thing is that this bug happens when block is committed and new block is not even started (no BeginBlock call to Application).
What you expected to happen: Tendermint should be working normally.
Have you tried the latest version: yes
How to reproduce it (as minimally and precisely as possible): Download Minter node, launch and synchronize it. Then send some transactions (50 txs in block will be sufficient).
Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file): There are no errors or warnings in logs, except that node starts to lose connection to other nodes after consensus stops.
Config (you can paste only the changes you've made): Default config
node command runtime flags: -
/dump_consensus_state output for consensus bugs: dump_consensus_state is unavailable by timeout :(
Anything else we need to know: Tendermint is running in in-process mode. There are Local RPC calls to Tendermint. Bug happening only under load. I debugged our BeginBlock, EndBlock, Commit, DeliverTx implementations, they are finishing normally just before bug. It happens somewhere else. Also, we used v0.23.0 in our testnet before and everything was fine.
Is this reproducible on a single node or just with a network?
@ebuchman , I'm trying to reproduce bug on single node, but seems that it only occurs in network.
Here is last logs before consensus is deadlocked: https://pastebin.com/3n7xMwFe
I have not restarted node since last freeze, so you can see that many RPC endpoints are not accessible: http://minter-node-2.testnet.minter.network:26657/
Would be great to get the output of kill -6 <PID> once a node deadlocks.
@melekes sure: https://pastebin.com/sb4fb9sV
After restart node is alive, you can see that all RPC methods working normally.
Additional information:
Compiled on MacOS 10.14 (18A391) using go version go1.10.4 darwin/amd64
Tendermint version v0.26.0-rc0, but same bug was on v0.25.0 as well
Cool! Thank you
Raw notes:
goroutine 636358 [select, 122 minutes]:
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/p2p.(*MultiplexTransport).acceptPeers.func1(0xc4200ad860, 0x1082fa0, 0xc424d07bf0)
/Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/p2p/transport.go:266 +0x15e
created by github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/p2p.(*MultiplexTransport).acceptPeers
A lot of goroutines like ^^^ are blocked on writing to mt.acceptc. why? corresponding code https://github.com/tendermint/tendermint/blob/fb91ef7462b421349a56c32733724d920fce3ad4/p2p/transport.go#L250-L274
goroutine 67 [semacquire, 227 minutes]:
sync.runtime_SemacquireMutex(0xc42016c76c, 0x0)
/usr/local/go/src/runtime/sema.go:71 +0x3d
sync.(*Mutex).Lock(0xc42016c768)
/usr/local/go/src/sync/mutex.go:134 +0x108
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/p2p/pex.(*addrBook).MarkGood(0xc42016c700, 0xc4208ae180)
/Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/p2p/pex/addrbook.go:293 +0x3a
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/p2p.(*Switch).MarkPeerAsGood(0xc4201df7a0, 0x1089b40, 0xc422d281a0)
/Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/p2p/switch.go:370 +0x7d
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusReactor).peerStatsRoutine(0xc4210c9580)
/Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/reactor.go:829 +0x38f
created by github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusReactor).OnStart
/Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/reactor.go:76 +0x109
consensus peerStatsRoutine is blocked on marking peer as good
goroutine 631520 [semacquire, 224 minutes]:
sync.runtime_SemacquireMutex(0xc420168694, 0x42a100)
/usr/local/go/src/runtime/sema.go:71 +0x3d
sync.(*Mutex).Lock(0xc420168690)
/usr/local/go/src/sync/mutex.go:134 +0x108
sync.(*RWMutex).Lock(0xc420168690)
/usr/local/go/src/sync/rwmutex.go:93 +0x2d
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).GetLastHeight(0xc420168600, 0x0)
/Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/state.go:210 +0x46
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/core.Status(0x0, 0xc423717868, 0x42a194)
/Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/core/status.go:73 +0x555
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/client.Local.Status(0xc4200950e0, 0x10, 0xc427154090, 0x0)
/Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/client/localclient.go:52 +0x22
github.com/MinterTeam/minter-go-node/api.Status(0x107c340, 0xc420228c40, 0xc4200bf100)
/Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/api/status.go:23 +0x45
github.com/MinterTeam/minter-go-node/api.wrapper.func1(0x107c340, 0xc420228c40, 0xc4200bf100)
/Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/api/api.go:105 +0xd7
net/http.HandlerFunc.ServeHTTP(0xc42024aa10, 0x107c340, 0xc420228c40, 0xc4200bf100)
/usr/local/go/src/net/http/server.go:1947 +0x44
github.com/MinterTeam/minter-go-node/api.RateLimit.func2.1(0x107c340, 0xc420228c40, 0xc4200bf100)
/Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/api/api.go:170 +0x326
net/http.HandlerFunc.ServeHTTP(0xc42414e140, 0x107c340, 0xc420228c40, 0xc4200bf100)
/usr/local/go/src/net/http/server.go:1947 +0x44
github.com/MinterTeam/minter-go-node/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc4213b6310, 0x107c340, 0xc420228c40, 0xc4200bf100)
/Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/gorilla/mux/mux.go:162 +0xed
github.com/MinterTeam/minter-go-node/vendor/github.com/rs/cors.(*Cors).Handler.func1(0x107c340, 0xc420228c40, 0xc4200bea00)
/Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/rs/cors/cors.go:200 +0xfe
net/http.HandlerFunc.ServeHTTP(0xc4232aea40, 0x107c340, 0xc420228c40, 0xc4200bea00)
/usr/local/go/src/net/http/server.go:1947 +0x44
net/http.serverHandler.ServeHTTP(0xc4232e0340, 0x107c340, 0xc420228c40, 0xc4200bea00)
/usr/local/go/src/net/http/server.go:2697 +0xbc
net/http.(*conn).serve(0xc426424000, 0x107d2c0, 0xc42414e000)
/usr/local/go/src/net/http/server.go:1830 +0x651
created by net/http.(*Server).Serve
/usr/local/go/src/net/http/server.go:2798 +0x27b
2 /status HTTP requests is blocked on consensusState
A lot of goroutines like ^^^ are blocked on writing to mt.acceptc. why? corresponding code https://github.com/tendermint/tendermint/blob/fb91ef7462b421349a56c32733724d920fce3ad4/p2p/transport.go#L250-L274
It looks like the acceptRoutine may have exited, since it doesn't appear in the stack dump. That would explain why these routines are piling up, since there'd be nothing reading from them. Not sure why it exited, we'd need to see more logs. But we probably need to be smarter about bubbling up an error and/or restarting the acceptRoutine, rather than just exiting it. This wouldn't have been a problem in 0.23, as the p2p transport refactor came in 0.25, though I'm not sure exactly how/why this would lead to the consensus deadlock. @xla any thoughts here?
consensus peerStatsRoutine is blocked on marking peer as good
This sounds concerning. Suggests there might be a deadlock in the addrBook somewhere but so far I can't find it, and I don't see anything in the stack trace that it would be contending with. Also not seeing the consensus state blocked trying to send something to the peerStatsRoutine. Could it be we just caught it at the moment it was grabbing this mutex, but it isn't actually blocked there? Seems unlikely but I'm not seeing what it's contending with.
It looks like the acceptRoutine may have exited, since it doesn't appear in the stack dump. That would explain why these routines are piling up, since there'd be nothing reading from them. Not sure why it exited, we'd need to see more logs.
Yeah, I had the same thoughts.
Could it be we just caught it at the moment it was grabbing this mutex, but it isn't actually blocked there?
Could be
So, the number one thing is to find out why acceptRoutine exited (silently).
Had the same issue again today.
Here is new logs: https://pastebin.com/asUda6gn
May be it will help.
[54310]: E[11-02|11:59:39.851] Connection failed @ sendRoutine module=p2p peer=0xb78f00 conn=MConn{74.207.236.148:26656} err="pong timeout"
TODO: fix peer formatting (output its address instead of the pointer)
[54310]: goroutine 51800 [semacquire, 24 minutes]:
[54310]: sync.runtime_Semacquire(0xc42016869c)
[54310]: /usr/local/go/src/runtime/sema.go:56 +0x39
[54310]: sync.(*RWMutex).RLock(0xc420168690)
[54310]: /usr/local/go/src/sync/rwmutex.go:50 +0x49
[54310]: github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).GetRoundState(0xc420168600, 0x0)
[54310]: /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/state.go:218 +0x46
[54310]: github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusReactor).gossipVotesRoutine(0xc420151c00, 0x1089b40, 0xc427522000, 0xc4275220d0)
[54310]: /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/reactor.go:608 +0x15a
[54310]: created by github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusReactor).AddPeer
[54310]: /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/reactor.go:171 +0x205
consensusReactor can't add a peer because something is holding consensus state mtx.
Also, --log_level="*:debug" logs could help us find the culprit.
@melekes , here, just before crash: https://pastebin.com/dSPsg6Qj
interface conversion: interface {} is nil, not *consensus.PeerState that's something
interface conversion: interface {} is nil, not *consensus.PeerState that's something
Note, that last block (at height 46025) was processed at 14:05:26, but this error occurred only at 14:07:07. So before interface conversion error node already missed 18 blocks (1.5 min).
Yep, I noticed that. Nevertheless, it should not happen.
@danil-lashin started some work here
@danil-lashin Tried to run minter node, but got
I[11-05|14:56:22.312] ABCI Handshake App Info module=consensus height=0 hash=0000000000000000000000000000000000000000 software-version= protocol-version=0
I[11-05|14:56:22.312] ABCI Replay Blocks module=consensus appHeight=0 storeHeight=0 stateHeight=0
I[11-05|14:56:22.316] Completed ABCI Handshake - Tendermint and App are synced module=consensus appHeight=0 appHash=0000000000000000000000000000000000000000
I[11-05|14:56:22.318] This node is not a validator module=consensus addr=088F3A9B2D8CB88EB8D532D2EC3DABE0F71BD0B7 pubKey=PubKeyEd25519{EA089868813887495D5A0E18B0E35FC2E67E0CEBF5FA77FE4DBE9725D052CAC6}
I[11-05|14:56:22.354] Starting ConsensusReactor module=consensus impl=ConsensusReactor
I[11-05|14:56:22.354] ConsensusReactor module=consensus fastSync=true
E[11-05|14:56:22.594] Dialing failed module=p2p [email protected]:26656 err="dial tcp 74.207.236.148:26656: connect: connection refused" attempts=0
E[11-05|14:56:52.565] Dialing failed module=p2p [email protected]:26656 err="dial tcp 74.207.236.148:26656: connect: connection refused" attempts=1
^Ccaptured interrupt, exiting...
I[11-05|14:56:55.552] Stopping ConsensusReactor module=consensus impl=ConsensusReactor
E[11-05|14:56:55.553] Not stopping ConsensusState -- have not been started yet module=consensus impl=ConsensusState
E[11-05|14:56:55.554] Stopped accept routine, as transport is closed module=p2p numPeers=0
root@f26985dca600:/go/src/github.com/MinterTeam/minter-go-node# ping 74.207.236.148
PING 74.207.236.148 (74.207.236.148) 56(84) bytes of data.
64 bytes from 74.207.236.148: icmp_seq=1 ttl=37 time=163 ms
64 bytes from 74.207.236.148: icmp_seq=2 ttl=37 time=165 ms
^C
--- 74.207.236.148 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1004ms
rtt min/avg/max/mdev = 163.473/164.708/165.943/1.235 ms
also, created https://github.com/MinterTeam/minter-go-node/issues/128
@melekes, Hi! Thanks for the issue.
Seed nodes were down due to this bug. I've restarted them so you can try again: https://minter-node-1.testnet.minter.network:3000/
Also you can use this addrbook.json to sync node: https://gist.github.com/danil-lashin/56ca5a04ddfd4634c1f7bed6e425b63f
Just put it in ~/.minter/config/addrbook.json and start the node.
How do I send transactions? Do you have a script or something else so I can create "50 transactions"?
@melekes , I will prepare such script.
Btw, now my nodes freezing even without transactions load. At least 4-5 times an hour. I think that this bug happens after all kinds of load, not only after transaction execution. May be it correlates to some consensus timeout issues?
my node is syncing right now. we'll see
H: 117092 No issues so far.
I've started sending transaction. My local nodes went down. How is yours?
Looking at that last log:
14:05:55 node-2: D[11-02|14:05:55.512] Flush module=p2p peer=0xb79150 conn=MConn{74.207.236.148:26656}
14:06:22 node-2: D[11-02|14:06:22.619] Send Ping module=p2p peer=0xb79150
14:06:22 node-2: D[11-02|14:06:22.620] Starting pong timer module=p2p peer=0xb79150 dur=45s
14:06:22 node-2: D[11-02|14:06:22.620] Flush module=p2p peer=0xb79150 conn=MConn{74.207.236.148:26656}
14:07:07 node-2: D[11-02|14:07:07.621] Pong timeout module=p2p peer=0xb79150
14:07:07 node-2: E[11-02|14:07:07.621] Connection failed @ sendRoutine module=p2p peer=0xb79150 conn=MConn{74.207.236.148:26656} err="pong timeout"
14:07:07 node-2: I[11-02|14:07:07.621] Stopping MConnection module=p2p peer=0xb79150 impl=MConn{74.207.236.148:26656}
14:07:07 node-2: E[11-02|14:07:07.621] Stopping peer for error module=p2p peer="Peer{MConn{74.207.236.148:26656} 647e32df3b9c54809b5aca2877d9ba60900bc2d9 out}" err="pong timeout"
14:07:07 node-2: I[11-02|14:07:07.621] Stopping Peer module=p2p peer=0xb79150 impl="Peer{MConn{74.207.236.148:26656} 647e32df3b9c54809b5aca2877d9ba60900bc2d9 out}"
14:07:07 node-2: D[11-02|14:07:07.621] Stopping MConnection (ignoring: already stopped) module=p2p peer=0xb79150 impl=MConn{74.207.236.148:26656}
14:07:07 node-2: I[11-02|14:07:07.622] Reconnecting to peer module=p2p [email protected]:26656
14:07:07 node-2: I[11-02|14:07:07.622] Dialing peer module=p2p [email protected]:26656
14:07:07 node-2: I[11-02|14:07:07.887] Starting Peer module=p2p peer=0xb79150 impl="Peer{MConn{74.207.236.148:26656} 647e32df3b9c54809b5aca2877d9ba60900bc2d9 out}"
14:07:07 node-2: I[11-02|14:07:07.888] Starting MConnection module=p2p peer=0xb79150 impl=MConn{74.207.236.148:26656}
14:07:07 node-2: D[11-02|14:07:07.989] Read PacketMsg module=p2p peer=0xb79150 conn=MConn{74.207.236.148:26656} packet="PacketMsg{40:5A433AB908DBE702 T:1}"
14:07:07 node-2: D[11-02|14:07:07.989] Received bytes module=p2p peer=0xb79150 chID=64 msgBytes=5A433AB908DBE702
14:07:07 node-2: D[11-02|14:07:07.990] Receive module=blockchain src="Peer{MConn{74.207.236.148:26656} 647e32df3b9c54809b5aca2877d9ba60900bc2d9 out}" chID=64 msg="[bcStatusResponseMessage 46043]"
14:07:07 node-2: D[11-02|14:07:07.990] Read PacketMsg module=p2p peer=0xb79150 conn=MConn{74.207.236.148:26656} packet="PacketMsg{20:C96A6FA808DCE7021804 T:1}"
14:07:07 node-2: D[11-02|14:07:07.990] Received bytes module=p2p peer=0xb79150 chID=32 msgBytes=C96A6FA808DCE7021804
14:07:07 node-2: D[11-02|14:07:07.990] Receive module=consensus src="Peer{MConn{74.207.236.148:26656} 647e32df3b9c54809b5aca2877d9ba60900bc2d9 out}" chId=32 msg="[NewRoundStep H:46044 R:0 S:RoundStepPrevote LCR:0]"
14:07:07 node-2: I[11-02|14:07:07.990] Stopping MConnection module=p2p peer=0xb79150 impl=MConn{74.207.236.148:26656}
14:07:07 node-2: E[11-02|14:07:07.990] Stopping peer for error module=p2p peer="Peer{MConn{74.207.236.148:26656} 647e32df3b9c54809b5aca2877d9ba60900bc2d9 out}" err="interface conversion: interface {} is nil, not *consensus.PeerState"
14:07:07 node-2: I[11-02|14:07:07.990] Stopping Peer module=p2p peer=0xb79150 impl="Peer{MConn{74.207.236.148:26656} 647e32df3b9c54809b5aca2877d9ba60900bc2d9 out}"
So we don't receive anything from the peer, so we ping it, and then timeout waiting for the pong, so we disconnect. Then we try to reconnect, succeed, and then encounter the interface conversion: interface {} is nil, not *consensus.PeerState issue.
Can you give us more info on the network connectivity? Wonder why it's not receiving anything from the peer and then timing out waiting for the pong in the first place. Maybe there's corresponding logs from the peer it's talking to that would give us a clue? Also, anything special about the config being used here ?
The pong timeout seems to be a recurring issue here. Could it be pointing to some kind of networking issue? Where are these nodes running?
@ebuchman, there was 50+ nodes across globe in different DC connected to this particular. They all failing. Logs are equal on all machines. So I don't think that issue is in bad network conditions. But if I can give you some specific information - let me know.
Another interesting observation: when node freezes I can terminate it only with SIGKILL.
Ok noted. Couple clarifications about things you said:
Note, that last block (at height 46025) was processed at 14:05:26, but this error occurred only at 14:07:07. So before interface conversion error node already missed 18 blocks (1.5 min).
So this means one of the nodes froze, but other nodes kept making blocks? Is it possible to see logs from one of those nodes that was still making blocks and talking to the one that froze?
Btw, now my nodes freezing even without transactions load. At least 4-5 times an hour.
What does it mean that a node freezes multiple times an hour? Do they unfreeze at some point and then re-freeze, every ~10-15 minutes ?
Can we see a config file ?
So this means one of the nodes froze, but other nodes kept making blocks?
Right, there are nodes that continue making blocks. Nodes are freezing independently.
Is it possible to see logs from one of those nodes that was still making blocks and talking to the one that froze?
I will try to collect them.
What does it mean that a node freezes multiple times an hour? Do they unfreeze at some point and then re-freeze, every ~10-15 minutes ?
It means that nodes are running only for 10 mins, then freeze. They do not unfreeze without terminating with SIGKILL. I've managed to build a "restart script", so I can see that issue occurs very frequently.
Can we see a config file ?
Thanks! We'll keep investigating
My node is still running (H: 120345)
@melekes, can you give me IP of your node? I will send some transactions directly to it to drop it more quickly :)
178.128.196.40
@melekes, done. You will see some panic error, but it's just because of tx decode issue. If you want to get rid of it you can compile Minter from dev branch.
Nevertheless, Tendermint is deadlocked (http://178.128.196.40:26657/status) and I suppose that should not be happening.
Damn, I forgot to setup log redirect. Could you provide a script you're using? Thank you!
Here: https://github.com/danil-lashin/minter-mass-check-redeemer
You can use it that way:
go run main.go ea90fc03739cf8fe86c3b05d296c002d37a19303a3f3425a6384da6cbbd21686 {N} 178.128.196.40
N - nonce. Start from 0 and increment it each run.
2 panics. Probably not related
2018/11/07 15:55:39 http: panic serving 23.108.213.28:54436: runtime error: invalid memory address or nil pointer dereference
goroutine 112956 [running]:
net/http.(*conn).serve.func1(0xc000ff6960)
/usr/local/go/src/net/http/server.go:1746 +0xd0
panic(0xdab900, 0x18606e0)
/usr/local/go/src/runtime/panic.go:513 +0x1b9
github.com/MinterTeam/minter-go-node/core/state.(*StateDB).getStateValidators(0x0, 0x1f92a)
/root/go/src/github.com/MinterTeam/minter-go-node/core/state/statedb.go:364 +0x26
github.com/MinterTeam/minter-go-node/core/state.(*StateDB).GetStateValidators(0x0, 0x0)
/root/go/src/github.com/MinterTeam/minter-go-node/core/state/statedb.go:358 +0x2b
github.com/MinterTeam/minter-go-node/api.GetValidators(0x1103220, 0xc0051e88c0, 0xc00c746b00)
/root/go/src/github.com/MinterTeam/minter-go-node/api/validators.go:74 +0x9f
github.com/MinterTeam/minter-go-node/api.wrapper.func1(0x1103220, 0xc0051e88c0, 0xc00c746b00)
/root/go/src/github.com/MinterTeam/minter-go-node/api/api.go:105 +0xd7
net/http.HandlerFunc.ServeHTTP(0xc0001de840, 0x1103220, 0xc0051e88c0, 0xc00c746b00)
/usr/local/go/src/net/http/server.go:1964 +0x44
github.com/MinterTeam/minter-go-node/api.RateLimit.func2.1(0x1103220, 0xc0051e88c0, 0xc00c746b00)
/root/go/src/github.com/MinterTeam/minter-go-node/api/api.go:170 +0x31a
net/http.HandlerFunc.ServeHTTP(0xc006486580, 0x1103220, 0xc0051e88c0, 0xc00c746b00)
/usr/local/go/src/net/http/server.go:1964 +0x44
github.com/MinterTeam/minter-go-node/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc000c41ea0, 0x1103220, 0xc0051e88c0, 0xc00c746b00)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/gorilla/mux/mux.go:162 +0xf1
github.com/MinterTeam/minter-go-node/vendor/github.com/rs/cors.(*Cors).Handler.func1(0x1103220, 0xc0051e88c0, 0xc00c746500)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/rs/cors/cors.go:200 +0xf2
net/http.HandlerFunc.ServeHTTP(0xc002a69640, 0x1103220, 0xc0051e88c0, 0xc00c746500)
/usr/local/go/src/net/http/server.go:1964 +0x44
net/http.serverHandler.ServeHTTP(0xc000219ad0, 0x1103220, 0xc0051e88c0, 0xc00c746500)
/usr/local/go/src/net/http/server.go:2741 +0xab
net/http.(*conn).serve(0xc000ff6960, 0x11041a0, 0xc0064864c0)
/usr/local/go/src/net/http/server.go:1847 +0x646
created by net/http.(*Server).Serve
/usr/local/go/src/net/http/server.go:2851 +0x2f5
E[11-07|15:55:25.011] Panic in RPC HTTP handler module=rpc-server err="runtime error: index out of range" stack="goroutine 111044 [running]:\nruntime/debug.Stack(0xc007812658, 0xdab900, 0x1860720)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0xa7\ngithub.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server.RecoverAndLogHandler.func1.1(0xc0035efa80, 0x1104e20, 0xc002d8a760, 0xbef0e09b40837f90, 0xb7c7026903, 0x1878280, 0xc001a16200)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server/http_server.go:174 +0x50c\npanic(0xdab900, 0x1860720)\n\t/usr/local/go/src/runtime/panic.go:513 +0x1b9\ngithub.com/MinterTeam/minter-go-node/crypto.SigToPub(0xc0078128d8, 0x20, 0x20, 0xc007a26b40, 0x40, 0x40, 0x40, 0x0, 0x40)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/crypto/signature_nocgo.go:43 +0x124\ngithub.com/MinterTeam/minter-go-node/crypto.Ecrecover(0xc0078128d8, 0x20, 0x20, 0xc007a26b40, 0x40, 0x40, 0xc0078128f8, 0x597d70, 0xc0018c26f0, 0x1, ...)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/crypto/signature_nocgo.go:31 +0x6a\ngithub.com/MinterTeam/minter-go-node/core/check.(*Check).LockPubKey(0xc006e1df40, 0xc00663ec60, 0x0, 0xc007812a01, 0x3, 0x20)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/core/check/check.go:38 +0x9c\ngithub.com/MinterTeam/minter-go-node/core/transaction.RedeemCheckData.Run(0xc001038280, 0x9a, 0x9a, 0xcadcf736a6ef4312, 0xf68fcbc45d847a42, 0xb3845dfe542d3bf7, 0xbc7a1a695a2578d4, 0xe97640cdfa9a3228, 0x605abfcc37865ca9, 0x66986f56ab23afde, ...)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/core/transaction/redeem_check.go:92 +0x514\ngithub.com/MinterTeam/minter-go-node/core/transaction.RunTx(0xc005029400, 0xa4506ceb90beff01, 0xc00af33400, 0x13e, 0x280, 0x0, 0x1f927, 0x0, 0x0, 0x0, ...)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/core/transaction/executor.go:128 +0x34e\ngithub.com/MinterTeam/minter-go-node/core/minter.(*Blockchain).CheckTx(0xc000126080, 0xc00af33400, 0x13e, 0x280, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/core/minter/minter.go:289 +0xb3\ngithub.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/abci/client.(*localClient).CheckTxAsync(0xc0001fb0e0, 0xc00af33400, 0x13e, 0x280, 0x0)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/abci/client/local_client.go:86 +0x88\ngithub.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/proxy.(*appConnMempool).CheckTxAsync(0xc00002f720, 0xc00af33400, 0x13e, 0x280, 0x1)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/proxy/app_conn.go:114 +0x51\ngithub.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/mempool.(*Mempool).CheckTx(0xc002d3a320, 0xc00af33400, 0x13e, 0x280, 0xc0072352c0, 0x0, 0x0)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/mempool/mempool.go:316 +0x17b\ngithub.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/core.BroadcastTxSync(0xc00af33400, 0x13e, 0x280, 0x0, 0x0, 0x0)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/core/mempool.go:93 +0xb8\nreflect.Value.call(0xd85560, 0x10326c0, 0x13, 0xec7b8b, 0x4, 0xc0035efb00, 0x1, 0x1, 0xc0035efb00, 0xc0035efb08, ...)\n\t/usr/local/go/src/reflect/value.go:447 +0x449\nreflect.Value.Call(0xd85560, 0x10326c0, 0x13, 0xc0035efb00, 0x1, 0x1, 0x0, 0x0, 0xc005cc8dc4)\n\t/usr/local/go/src/reflect/value.go:308 +0xa4\ngithub.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server.makeHTTPHandler.func2(0x1102060, 0xc0035efa80, 0xc001a16200)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server/handlers.go:269 +0x188\nnet/http.HandlerFunc.ServeHTTP(0xc002c81f20, 0x1102060, 0xc0035efa80, 0xc001a16200)\n\t/usr/local/go/src/net/http/server.go:1964 +0x44\nnet/http.(*ServeMux).ServeHTTP(0xc002c81b60, 0x1102060, 0xc0035efa80, 0xc001a16200)\n\t/usr/local/go/src/net/http/server.go:2361 +0x127\ngithub.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server.maxBytesHandler.ServeHTTP(0x10f8a40, 0xc002c81b60, 0xf4240, 0x1102060, 0xc0035efa80, 0xc001a16200)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server/http_server.go:219 +0xcf\ngithub.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server.RecoverAndLogHandler.func1(0x1103220, 0xc00121e540, 0xc001a16200)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server/http_server.go:192 +0x394\nnet/http.HandlerFunc.ServeHTTP(0xc002c06ea0, 0x1103220, 0xc00121e540, 0xc001a16200)\n\t/usr/local/go/src/net/http/server.go:1964 +0x44\nnet/http.serverHandler.ServeHTTP(0xc001a1aa90, 0x1103220, 0xc00121e540, 0xc001a16200)\n\t/usr/local/go/src/net/http/server.go:2741 +0xab\nnet/http.(*conn).serve(0xc00785a3c0, 0x11041a0, 0xc000f844c0)\n\t/usr/local/go/src/net/http/server.go:1847 +0x646\ncreated by net/http.(*Server).Serve\n\t/usr/local/go/src/net/http/server.go:2851 +0x2f5\n"
A lot of the "Send failed" msgs:
E[11-07|15:55:16.383] Send failed module=p2p peer=0xbb4f60 channel=32 conn=MConn{103.29.68.156:26656} msgBytes=1919B3D508A5F2071802200F
First one is in our API, it's definitely not related. Second one is in the CheckTx method (fixed in dev branch).
Btw, may be this issue related to #846?
OK, I might have fixed the issue in https://github.com/tendermint/tendermint/pull/2748/commits/f69df639f6551ff93d74d10d7ad124f47b7472c9.
@danil-lashin feel free to try it out (branch: 2721-unresponsive-tm)
@melekes, problem is still there :(
New logs: https://pastebin.com/eKkNRJHJ
Was worth trying anyway, thank you 鈽曪笍
I think this is what is happening: applications panics for a particular transaction when executing CheckTx. Panic bubbles up all the way until we recover in HTTP Handler. Because we don't use defer in our localClient, it's mutex is never released -> consensus state is locked -> consensus is locked.
I can't reproduce the original issue with https://github.com/tendermint/tendermint/pull/2748/commits/1b036bb895e2e5a6112d541466f871f5cc9e4162 anymore.
Should we do the same for every method? It's certainly needed for methods exposed to the user (API).
@melekes, I still have the same issue.
Few notes:
馃槥
Are you using BroadcastTxCommit to send transactions? I've fixed one more possible issue https://github.com/tendermint/tendermint/pull/2748/commits/611268aaf5e92b46ffd7f0aaff878f846404976e
Are you using BroadcastTxCommit to send transactions?
Yes, we are. I tried this fix, but it does not help.
May be I can give you some more debug information? Or give direct access to server with failing node?
I am rushing a bit as always. It was incomplete. https://github.com/tendermint/tendermint/pull/2748/commits/f4f310903bea82c28e7320c42456922199f6c7d1 should work. How do I upgrade TM version? I've tried pointing dep to my branch, but then other nodes are on old version, so everything breaks.
Looks like https://github.com/tendermint/tendermint/commit/f4f310903bea82c28e7320c42456922199f6c7d1 solved issue! My node running and seems stable! 馃帀
Also I asked our community to update their nodes. I will monitor until morning and let you know for sure.
How do I upgrade TM version?
It's not easy. I not expected that after 0.26.0-rc0 there will be breaking changes. Now I manually apply patch on tendermint locally. Just don't want to create a fork of the repo.
Fantastic 馃憤 Will be waiting for the update.
@melekes, node is much more stable, but error is still there. Good news is that now we know what causes it. I sent a lot of transactions through broadcastTxCommit and saw this errors:
E[11-08|21:53:04.034] Error on broadcastTxCommit module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.035] Error on broadcastTxCommit module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.035] Error on broadcastTxCommit module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.036] Error on broadcastTxCommit module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.038] Error on broadcastTxCommit module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.038] Error on broadcastTxCommit module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.039] Error on broadcastTxCommit module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.040] Error on broadcastTxCommit module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.040] Error on broadcastTxCommit module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.043] Error on broadcastTxCommit module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.043] Error on broadcastTxCommit module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.046] Error on broadcastTxCommit module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.048] Error on broadcastTxCommit module=rpc err="failed to subscribe to tx: context deadline exceeded"
There is no errors when sending through broadcastTxSync/broadcastTxAsync. I think we are on the right way!
Good news! The deal with /broadcast_commit is that it's using pubsub to subscribe for every transaction result. And there's an implicit limit of how many subscribers & events pubsub can handle. Note it's better to use sync/async methods if you don't need 100% confirmation. We can think about how to improve things, but I'd prefer to create a separate issue for that.
And I'll ask to confirm that there's no "freeze" anymore. Thank you for swift answers!
@melekes, thank you! 馃帀
For now everything looks great, except when using /broadcast_commit for several thousand txs. In that case Tendermint "freezes".
So I think we just will not use /broadcast_commit. Btw, do you have a specific value of how many simultaneous txs it can handle?
For now everything looks great, except when using /broadcast_commit for several thousand txs. In that case Tendermint "freezes".
Wait, so if you push too much txs, TM stops producing blocks? I'll try to replicate locally and see what I can do.
Btw, do you have a specific value of how many simultaneous txs it can handle?
This depends largely on how long do we wait for tx to get into the block, which depends on block size, transaction propagation time, validators number. There's no universal number, but we certainly can do some benchmarks here.
Wait, so if you push too much txs, TM stops producing blocks? I'll try to replicate locally and see what I can do.
Right, but only when using /broadcast_commit. Errors like this appears and TM is not producing block anymore:
Error on broadcastTxCommit module=rpc err="failed to subscribe to tx: context deadline exceeded"
And /status not available as usual.
Also I've caught several panics like:
panic: Peer Peer{MConn{185.162.92.124:26656} dbfdcd7f4ae82d5d77b64b3fee59a264087b1beb out} has no state
Also I've caught several panics like:
panic: Peer Peer{MConn{185.162.92.124:26656} dbfdcd7f4ae82d5d77b64b3fee59a264087b1beb out} has no state
Can you share a stack trace? with the logs if possible
Can you share a stack trace? with the logs if possible
Also I've caught several panics like:
panic: Peer Peer{MConn{185.162.92.124:26656} dbfdcd7f4ae82d5d77b64b3fee59a264087b1beb out} has no state
Fixed in the last commit https://github.com/tendermint/tendermint/pull/2748/commits/223b069038862967b6dcd53e5fc509f47bd006b2
Thanks, applied! 馃憤No more panics for now.
The last thing to solve here: in /broadcast_commit we use pubsub to wait for the transaction result. When we have too many subscribers (say 10000), time to propagate a msg to the subscriber starts to grow. And more importantly, time to publish a msg starts to grow (because pubsub is a queue and publishing is synchronous) => consensus slows down! This will be addressed in pubsub 2.0 https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-033-pubsub.md.
Great, thank you again!
For now we will switch to BroadcastTxSync and will be waiting for the updates.
Have you tried my latest commits? specifically https://github.com/tendermint/tendermint/pull/2748/commits/8db23b097da52d6ac3e7fe0e83ab19a7ea418ed5. You should see a lot of "Timed out waiting ..." errors. I was not able to reproduce the "freezing" part, but I think my theory above is correct.
Yep, they appearing:
{
"jsonrpc": "2.0",
"id": "",
"error": {
"code": -32603,
"message": "Internal error",
"data": "Timed out waiting for tx to be included in a block"
}
}
And, it seems to me, that problem is gone. I can't reproduce it no more. 馃
May be now I need to send much more txs to face this problem.
@danil-lashin do you need us to maintain a patch with this based on the v0.26.0-rc0, or are you ok to upgrade to the v0.26 series proper?
Sorry again for making a breaking change after tagging the release candidate - won't happen again.
@ebuchman, that's totally fine for us to switch to v0.26. We have plans to launch new testnet next week, so we will upgrade to v0.26.1 in new release of our project. Thank you!
Amazing.
I posted a summary of my understanding in https://github.com/tendermint/tendermint/pull/2748#pullrequestreview-173704860
Does that seem correct? And can we close this issue once that PR is merged? If there are any other unresolved problems here, we should open new issues for them.
Thanks for all your help on this one.
https://github.com/tendermint/tendermint/pull/2748#pullrequestreview-173704860 is totally correct.
I think that we all agree that we can close this issue.
Anyway, we still have issue with BroadcastTxCommit (which you mentioned in 3rd bullet of PR review). But I think that it is better to open a new issue.
Thank you again!
PR with a fix/feature was merged to develop. Will be shipped with 0.26.2 release (check the changelog).
@melekes, I caught this bug again in v0.27.0 :(
Node freezes, /status is unresponsive. Here's the logs: https://pastebin.com/PkzgSWPj
馃槥 there's no panic / errors in the log. Could you attach the output of kill -6 <PID> (goroutines dump)?
Sure, I'm trying to reproduce it now!
We might have a deadlock somewhere in the p2p/addrbook.go.
Are you sure you're running 0.27.0? Because in 0.27.0 there's no such error E[11126-12-11|17:45:59.786] Stopping peer for error module=p2p peer="Peer{MConn{74.207.236.148:26656} 647e32df3b9c54809b5aca2877d9ba60900bc2d9 out}" err="Peer Peer{MConn{74.207.236.148:26656} 647e32df3b9c54809b5aca2877d9ba60900bc2d9 out} has no state", only panic.
Are you sure you're running 0.27.0?
Yes, I am. Here's version info:
https://minter-node-2.testnet.minter.network:8841/status
"version": "0.27.0"
BTW, the frozen node was running in the seed mode. Now I switched it off, so I can check if it reproducible in normal mode.
Also, since you've only pasted part of the log, I want to ask: Are there any panics / errors other than "Stopping peer for ... pong timeout" in the log?
Ooh, good to know! There're a number of issues with seeds currently https://github.com/tendermint/tendermint/issues/2955, https://github.com/tendermint/tendermint/issues/2967, https://github.com/tendermint/tendermint/issues/2821
There are some interesting errors reported by our community. Dont know if they are related.
Dec 12 14:08:03 sentry4 minter[4352]: E[12126-12-12|14:08:03.017] Stopping peer for error module=p2p peer="Peer{MConn{144.76.140.208:14796} 83a56da94cd26aec30d3323e8bcae6ead18cf83d in}" err="Read overflow, maxSize is 1047 but this amino binary object is 6589 bytes."
Dec 12 13:48:30 sentry4 minter[4352]: E[12126-12-12|13:48:30.191] Error attempting to add vote module=consensus err="Expected 35373/1/2, but got 35373/0/2: Unexpected step"
After scanning pex/addrbook.go and pex/file.go,I did not find anything obvious (I even dove into the netaddr String() implementation to see if it would panic, or places where addr or knownAddr might be nil), but it turns out I may have missed something.
Based on the latest pastebin by Danil, I noticed that TrapSignal was called 373 minutes ago, so likely the node started then. I also noticed that the address book was already blocking some goroutines since 373 minutes ago, so I suspected that it might have to do with adding local addresses. So I took a look at AddOurAddress() one more time. Looking at addrBook.AddOurAddress(nodeInfo.NetAddress()) in node.go line 452, it looks like nodeInfo.NetAddress() may return a nil address if lookup of a host fails.
// in p2p/node_info.go:
func (info DefaultNodeInfo) NetAddress() *NetAddress {
idAddr := IDAddressString(info.ID(), info.ListenAddr)
netAddr, err := NewNetAddressString(idAddr)
if err != nil {
switch err.(type) {
case ErrNetAddressLookup:
// XXX If the peer provided a host name and the lookup fails here
// we're out of luck.
// TODO: use a NetAddress in DefaultNodeInfo
default:
panic(err) // everything should be well formed by now
}
}
return netAddr
}
ErrNetAddressLookup gets returned from p2p/netaddress.go:
ip := net.ParseIP(host)
if ip == nil {
ips, err := net.LookupIP(host)
if err != nil {
return nil, ErrNetAddressLookup{host, err}
}
ip = ips[0]
}
And while NodeInfo.ValidateBasic() does validate the listen address, it only gets called upon handshake w/ the remote peer's NodeInfo in p2p/transport. The local node's info.ListenAddr gets set from the config, and never gets validated AFAIK.
// node/node.go
...
lAddr := config.P2P.ExternalAddress
if lAddr == "" {
lAddr = config.P2P.ListenAddress
}
nodeInfo.ListenAddr = lAddr
return nodeInfo
}
We should be validating NodeInfo there before returning, and panic'ing if it fails validation.
@danil-lashin Does your config.P2P.ExternalAddress or config.P2P.ListenAddress have a value that can't be resolved?
released the changes to v0.27.2 - now validates the NodeInfo on startup (will catch any unreachable addrs) and use defers in the addrbook so we shouldnt get caught holding mutexes.
Addressing your comment https://github.com/tendermint/tendermint/issues/2721#issuecomment-446900825
Dec 12 14:08:03 sentry4 minter[4352]: E[12126-12-12|14:08:03.017] Stopping peer for error module=p2p peer="Peer{MConn{144.76.140.208:14796} 83a56da94cd26aec30d3323e8bcae6ead18cf83d in}" err="Read overflow, maxSize is 1047 but this amino binary object is 6589 bytes."
This looks like its due to one of the nodes tampering with the max_packet_msg_payload_size config setting, which actually needs to be consistent across nodes or they will send each other invalid packets. This setting defaults to 1024 bytes, but with amino overhead, it comes to 1047 bytes. So it appears another node set it much higher.
Dec 12 13:48:30 sentry4 minter[4352]: E[12126-12-12|13:48:30.191] Error attempting to add vote module=consensus err="Expected 35373/1/2, but got 35373/0/2: Unexpected step"
I believe this can just happen due to network latency - we received a vote from a previous round (0) instead of the round we're on (1).
@danil-lashin can you confirm if its possible that your config.P2P.ExternalAddress or config.P2P.ListenAddress have a value that can't be resolved? If so, that should explain the deadlock, and should be fixed in v0.27.2
@ebuchman they are default.
laddr = "tcp://0.0.0.0:26656"
external_address = ""
Thanks @danil-lashin . Is the dump from https://pastebin.com/TfKN8hb9 from a seed node?
Is the dump from https://pastebin.com/TfKN8hb9 from a seed node?
Yes, it is.
Ok thanks. Have you ever seen this issue on non-seeds, or only seeds?
Can you explain how you're detecting the problem? The logs you've provided only cover a short amount of time (couple minutes) - are you catching the halt this quickly? Is it possible to get logs that go a little further back in time?
@danil-lashin could the issues here be related to use of BroadcastTxCommit? It's clear now that high use of that endpoint can lead to deadlocks. We're working on addressing this now.
Also there's a resource leak that may impact seeds in particular - #2967
@ebuchman first time this issue was related to BroadcastTxCommit, then we decided not to use it. Now it is only reproducible on seed nodes.
Ok thanks. I forgot to ask - are you running your seeds with p2p.seed_mode = true ?
Seems that issue was resolved! Thank you!