Tendermint: Tendermint became unresponsive after some load

Created on 29 Oct 2018 · 93Comments · Source: tendermint/tendermint

Tendermint version (use tendermint version or git rev-parse --verify HEAD if installed from source): v0.25.0

ABCI app (name for built-in, URL for self-written if it's publicly available): https://github.com/MinterTeam/minter-go-node

Environment:

OS (e.g. from /etc/os-release): bug is platform agnostic. Tested on MacOS 10.14, Debian 4.9.88-1+deb9u1, Ubuntu 18.04.1 LTS
Install tools: -
Others: -

What happened: After some time under load nodes stop responding. http://localhost:26657/status and some other rpc endpoints became not available with huge timeout (more than 60 secs). It seems like ConsensusState's (or ConsensusReactor's) mutex is deadlocked. Restarting node solves problem.

Strange thing is that this bug happens when block is committed and new block is not even started (no BeginBlock call to Application).

What you expected to happen: Tendermint should be working normally.

Have you tried the latest version: yes

How to reproduce it (as minimally and precisely as possible): Download Minter node, launch and synchronize it. Then send some transactions (50 txs in block will be sufficient).

Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file): There are no errors or warnings in logs, except that node starts to lose connection to other nodes after consensus stops.

Config (you can paste only the changes you've made): Default config

node command runtime flags: -

/dump_consensus_state output for consensus bugs: dump_consensus_state is unavailable by timeout :(

Anything else we need to know: Tendermint is running in in-process mode. There are Local RPC calls to Tendermint. Bug happening only under load. I debugged our BeginBlock, EndBlock, Commit, DeliverTx implementations, they are finishing normally just before bug. It happens somewhere else. Also, we used v0.23.0 in our testnet before and everything was fine.

consensus bug

Source

danil-lashin

All 93 comments

Is this reproducible on a single node or just with a network?

ebuchman on 29 Oct 2018

@ebuchman , I'm trying to reproduce bug on single node, but seems that it only occurs in network.

Here is last logs before consensus is deadlocked: https://pastebin.com/3n7xMwFe

danil-lashin on 1 Nov 2018

I have not restarted node since last freeze, so you can see that many RPC endpoints are not accessible: http://minter-node-2.testnet.minter.network:26657/

danil-lashin on 1 Nov 2018

Would be great to get the output of kill -6 <PID> once a node deadlocks.

melekes on 1 Nov 2018

@melekes sure: https://pastebin.com/sb4fb9sV

After restart node is alive, you can see that all RPC methods working normally.

danil-lashin on 1 Nov 2018

Additional information:

Compiled on MacOS 10.14 (18A391) using go version go1.10.4 darwin/amd64
Tendermint version v0.26.0-rc0, but same bug was on v0.25.0 as well

danil-lashin on 1 Nov 2018

Cool! Thank you

Raw notes:

goroutine 636358 [select, 122 minutes]:
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/p2p.(*MultiplexTransport).acceptPeers.func1(0xc4200ad860, 0x1082fa0, 0xc424d07bf0)
        /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/p2p/transport.go:266 +0x15e
created by github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/p2p.(*MultiplexTransport).acceptPeers

A lot of goroutines like ^^^ are blocked on writing to mt.acceptc. why? corresponding code https://github.com/tendermint/tendermint/blob/fb91ef7462b421349a56c32733724d920fce3ad4/p2p/transport.go#L250-L274

goroutine 67 [semacquire, 227 minutes]:
sync.runtime_SemacquireMutex(0xc42016c76c, 0x0)
        /usr/local/go/src/runtime/sema.go:71 +0x3d
sync.(*Mutex).Lock(0xc42016c768)
        /usr/local/go/src/sync/mutex.go:134 +0x108
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/p2p/pex.(*addrBook).MarkGood(0xc42016c700, 0xc4208ae180)
        /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/p2p/pex/addrbook.go:293 +0x3a
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/p2p.(*Switch).MarkPeerAsGood(0xc4201df7a0, 0x1089b40, 0xc422d281a0)
        /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/p2p/switch.go:370 +0x7d
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusReactor).peerStatsRoutine(0xc4210c9580)
        /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/reactor.go:829 +0x38f
created by github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusReactor).OnStart
        /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/reactor.go:76 +0x109

consensus peerStatsRoutine is blocked on marking peer as good

goroutine 631520 [semacquire, 224 minutes]:
sync.runtime_SemacquireMutex(0xc420168694, 0x42a100)
        /usr/local/go/src/runtime/sema.go:71 +0x3d
sync.(*Mutex).Lock(0xc420168690)
        /usr/local/go/src/sync/mutex.go:134 +0x108
sync.(*RWMutex).Lock(0xc420168690)
        /usr/local/go/src/sync/rwmutex.go:93 +0x2d
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).GetLastHeight(0xc420168600, 0x0)
        /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/state.go:210 +0x46
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/core.Status(0x0, 0xc423717868, 0x42a194)
        /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/core/status.go:73 +0x555
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/client.Local.Status(0xc4200950e0, 0x10, 0xc427154090, 0x0)
        /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/client/localclient.go:52 +0x22
github.com/MinterTeam/minter-go-node/api.Status(0x107c340, 0xc420228c40, 0xc4200bf100)
        /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/api/status.go:23 +0x45
github.com/MinterTeam/minter-go-node/api.wrapper.func1(0x107c340, 0xc420228c40, 0xc4200bf100)
        /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/api/api.go:105 +0xd7
net/http.HandlerFunc.ServeHTTP(0xc42024aa10, 0x107c340, 0xc420228c40, 0xc4200bf100)
        /usr/local/go/src/net/http/server.go:1947 +0x44
github.com/MinterTeam/minter-go-node/api.RateLimit.func2.1(0x107c340, 0xc420228c40, 0xc4200bf100)
        /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/api/api.go:170 +0x326
net/http.HandlerFunc.ServeHTTP(0xc42414e140, 0x107c340, 0xc420228c40, 0xc4200bf100)
        /usr/local/go/src/net/http/server.go:1947 +0x44
github.com/MinterTeam/minter-go-node/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc4213b6310, 0x107c340, 0xc420228c40, 0xc4200bf100)
        /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/gorilla/mux/mux.go:162 +0xed
github.com/MinterTeam/minter-go-node/vendor/github.com/rs/cors.(*Cors).Handler.func1(0x107c340, 0xc420228c40, 0xc4200bea00)
        /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/rs/cors/cors.go:200 +0xfe
net/http.HandlerFunc.ServeHTTP(0xc4232aea40, 0x107c340, 0xc420228c40, 0xc4200bea00)
        /usr/local/go/src/net/http/server.go:1947 +0x44
net/http.serverHandler.ServeHTTP(0xc4232e0340, 0x107c340, 0xc420228c40, 0xc4200bea00)
        /usr/local/go/src/net/http/server.go:2697 +0xbc
net/http.(*conn).serve(0xc426424000, 0x107d2c0, 0xc42414e000)
        /usr/local/go/src/net/http/server.go:1830 +0x651
created by net/http.(*Server).Serve
        /usr/local/go/src/net/http/server.go:2798 +0x27b

2 /status HTTP requests is blocked on consensusState

melekes on 1 Nov 2018

A lot of goroutines like ^^^ are blocked on writing to mt.acceptc. why? corresponding code https://github.com/tendermint/tendermint/blob/fb91ef7462b421349a56c32733724d920fce3ad4/p2p/transport.go#L250-L274

It looks like the acceptRoutine may have exited, since it doesn't appear in the stack dump. That would explain why these routines are piling up, since there'd be nothing reading from them. Not sure why it exited, we'd need to see more logs. But we probably need to be smarter about bubbling up an error and/or restarting the acceptRoutine, rather than just exiting it. This wouldn't have been a problem in 0.23, as the p2p transport refactor came in 0.25, though I'm not sure exactly how/why this would lead to the consensus deadlock. @xla any thoughts here?

consensus peerStatsRoutine is blocked on marking peer as good

This sounds concerning. Suggests there might be a deadlock in the addrBook somewhere but so far I can't find it, and I don't see anything in the stack trace that it would be contending with. Also not seeing the consensus state blocked trying to send something to the peerStatsRoutine. Could it be we just caught it at the moment it was grabbing this mutex, but it isn't actually blocked there? Seems unlikely but I'm not seeing what it's contending with.

ebuchman on 2 Nov 2018

It looks like the acceptRoutine may have exited, since it doesn't appear in the stack dump. That would explain why these routines are piling up, since there'd be nothing reading from them. Not sure why it exited, we'd need to see more logs.

Yeah, I had the same thoughts.

Could it be we just caught it at the moment it was grabbing this mutex, but it isn't actually blocked there?

Could be

So, the number one thing is to find out why acceptRoutine exited (silently).

melekes on 2 Nov 2018

Had the same issue again today.
Here is new logs: https://pastebin.com/asUda6gn
May be it will help.

danil-lashin on 2 Nov 2018

[54310]: E[11-02|11:59:39.851] Connection failed @ sendRoutine              module=p2p peer=0xb78f00 conn=MConn{74.207.236.148:26656} err="pong timeout"

TODO: fix peer formatting (output its address instead of the pointer)

melekes on 2 Nov 2018

[54310]: goroutine 51800 [semacquire, 24 minutes]:
[54310]: sync.runtime_Semacquire(0xc42016869c)
[54310]:         /usr/local/go/src/runtime/sema.go:56 +0x39
[54310]: sync.(*RWMutex).RLock(0xc420168690)
[54310]:         /usr/local/go/src/sync/rwmutex.go:50 +0x49
[54310]: github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).GetRoundState(0xc420168600, 0x0)
[54310]:         /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/state.go:218 +0x46
[54310]: github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusReactor).gossipVotesRoutine(0xc420151c00, 0x1089b40, 0xc427522000, 0xc4275220d0)
[54310]:         /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/reactor.go:608 +0x15a
[54310]: created by github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusReactor).AddPeer
[54310]:         /Users/daniillashin/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/reactor.go:171 +0x205

consensusReactor can't add a peer because something is holding consensus state mtx.

melekes on 2 Nov 2018

Also, --log_level="*:debug" logs could help us find the culprit.

melekes on 2 Nov 2018

@melekes , here, just before crash: https://pastebin.com/dSPsg6Qj

danil-lashin on 2 Nov 2018

interface conversion: interface {} is nil, not *consensus.PeerState that's something

melekes on 2 Nov 2018

interface conversion: interface {} is nil, not *consensus.PeerState that's something

Note, that last block (at height 46025) was processed at 14:05:26, but this error occurred only at 14:07:07. So before interface conversion error node already missed 18 blocks (1.5 min).

danil-lashin on 2 Nov 2018

Yep, I noticed that. Nevertheless, it should not happen.

melekes on 2 Nov 2018

@danil-lashin started some work here

melekes on 2 Nov 2018

👍1

@danil-lashin Tried to run minter node, but got

I[11-05|14:56:22.312] ABCI Handshake App Info                      module=consensus height=0 hash=0000000000000000000000000000000000000000 software-version= protocol-version=0
I[11-05|14:56:22.312] ABCI Replay Blocks                           module=consensus appHeight=0 storeHeight=0 stateHeight=0
I[11-05|14:56:22.316] Completed ABCI Handshake - Tendermint and App are synced module=consensus appHeight=0 appHash=0000000000000000000000000000000000000000
I[11-05|14:56:22.318] This node is not a validator                 module=consensus addr=088F3A9B2D8CB88EB8D532D2EC3DABE0F71BD0B7 pubKey=PubKeyEd25519{EA089868813887495D5A0E18B0E35FC2E67E0CEBF5FA77FE4DBE9725D052CAC6}
I[11-05|14:56:22.354] Starting ConsensusReactor                    module=consensus impl=ConsensusReactor
I[11-05|14:56:22.354] ConsensusReactor                             module=consensus fastSync=true
E[11-05|14:56:22.594] Dialing failed                               module=p2p [email protected]:26656 err="dial tcp 74.207.236.148:26656: connect: connection refused" attempts=0
E[11-05|14:56:52.565] Dialing failed                               module=p2p [email protected]:26656 err="dial tcp 74.207.236.148:26656: connect: connection refused" attempts=1
^Ccaptured interrupt, exiting...
I[11-05|14:56:55.552] Stopping ConsensusReactor                    module=consensus impl=ConsensusReactor
E[11-05|14:56:55.553] Not stopping ConsensusState -- have not been started yet module=consensus impl=ConsensusState
E[11-05|14:56:55.554] Stopped accept routine, as transport is closed module=p2p numPeers=0
root@f26985dca600:/go/src/github.com/MinterTeam/minter-go-node# ping 74.207.236.148
PING 74.207.236.148 (74.207.236.148) 56(84) bytes of data.
64 bytes from 74.207.236.148: icmp_seq=1 ttl=37 time=163 ms
64 bytes from 74.207.236.148: icmp_seq=2 ttl=37 time=165 ms
^C
--- 74.207.236.148 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1004ms
rtt min/avg/max/mdev = 163.473/164.708/165.943/1.235 ms

also, created https://github.com/MinterTeam/minter-go-node/issues/128

melekes on 5 Nov 2018

@melekes, Hi! Thanks for the issue.

Seed nodes were down due to this bug. I've restarted them so you can try again: https://minter-node-1.testnet.minter.network:3000/

danil-lashin on 6 Nov 2018

Also you can use this addrbook.json to sync node: https://gist.github.com/danil-lashin/56ca5a04ddfd4634c1f7bed6e425b63f
Just put it in ~/.minter/config/addrbook.json and start the node.

danil-lashin on 6 Nov 2018

How do I send transactions? Do you have a script or something else so I can create "50 transactions"?

melekes on 6 Nov 2018

@melekes , I will prepare such script.

Btw, now my nodes freezing even without transactions load. At least 4-5 times an hour. I think that this bug happens after all kinds of load, not only after transaction execution. May be it correlates to some consensus timeout issues?

danil-lashin on 6 Nov 2018

my node is syncing right now. we'll see

melekes on 6 Nov 2018

H: 117092 No issues so far.

melekes on 6 Nov 2018

I've started sending transaction. My local nodes went down. How is yours?

danil-lashin on 6 Nov 2018

Looking at that last log:

14:05:55 node-2: D[11-02|14:05:55.512] Flush                                        module=p2p peer=0xb79150 conn=MConn{74.207.236.148:26656}
14:06:22 node-2: D[11-02|14:06:22.619] Send Ping                                    module=p2p peer=0xb79150
14:06:22 node-2: D[11-02|14:06:22.620] Starting pong timer                          module=p2p peer=0xb79150 dur=45s
14:06:22 node-2: D[11-02|14:06:22.620] Flush                                        module=p2p peer=0xb79150 conn=MConn{74.207.236.148:26656}
14:07:07 node-2: D[11-02|14:07:07.621] Pong timeout                                 module=p2p peer=0xb79150
14:07:07 node-2: E[11-02|14:07:07.621] Connection failed @ sendRoutine              module=p2p peer=0xb79150 conn=MConn{74.207.236.148:26656} err="pong timeout"
14:07:07 node-2: I[11-02|14:07:07.621] Stopping MConnection                         module=p2p peer=0xb79150 impl=MConn{74.207.236.148:26656}
14:07:07 node-2: E[11-02|14:07:07.621] Stopping peer for error                      module=p2p peer="Peer{MConn{74.207.236.148:26656} 647e32df3b9c54809b5aca2877d9ba60900bc2d9 out}" err="pong timeout"
14:07:07 node-2: I[11-02|14:07:07.621] Stopping Peer                                module=p2p peer=0xb79150 impl="Peer{MConn{74.207.236.148:26656} 647e32df3b9c54809b5aca2877d9ba60900bc2d9 out}"
14:07:07 node-2: D[11-02|14:07:07.621] Stopping MConnection (ignoring: already stopped) module=p2p peer=0xb79150 impl=MConn{74.207.236.148:26656}
14:07:07 node-2: I[11-02|14:07:07.622] Reconnecting to peer                         module=p2p [email protected]:26656
14:07:07 node-2: I[11-02|14:07:07.622] Dialing peer                                 module=p2p [email protected]:26656
14:07:07 node-2: I[11-02|14:07:07.887] Starting Peer                                module=p2p peer=0xb79150 impl="Peer{MConn{74.207.236.148:26656} 647e32df3b9c54809b5aca2877d9ba60900bc2d9 out}"
14:07:07 node-2: I[11-02|14:07:07.888] Starting MConnection                         module=p2p peer=0xb79150 impl=MConn{74.207.236.148:26656}
14:07:07 node-2: D[11-02|14:07:07.989] Read PacketMsg                               module=p2p peer=0xb79150 conn=MConn{74.207.236.148:26656} packet="PacketMsg{40:5A433AB908DBE702 T:1}"
14:07:07 node-2: D[11-02|14:07:07.989] Received bytes                               module=p2p peer=0xb79150 chID=64 msgBytes=5A433AB908DBE702
14:07:07 node-2: D[11-02|14:07:07.990] Receive                                      module=blockchain src="Peer{MConn{74.207.236.148:26656} 647e32df3b9c54809b5aca2877d9ba60900bc2d9 out}" chID=64 msg="[bcStatusResponseMessage 46043]"
14:07:07 node-2: D[11-02|14:07:07.990] Read PacketMsg                               module=p2p peer=0xb79150 conn=MConn{74.207.236.148:26656} packet="PacketMsg{20:C96A6FA808DCE7021804 T:1}"
14:07:07 node-2: D[11-02|14:07:07.990] Received bytes                               module=p2p peer=0xb79150 chID=32 msgBytes=C96A6FA808DCE7021804
14:07:07 node-2: D[11-02|14:07:07.990] Receive                                      module=consensus src="Peer{MConn{74.207.236.148:26656} 647e32df3b9c54809b5aca2877d9ba60900bc2d9 out}" chId=32 msg="[NewRoundStep H:46044 R:0 S:RoundStepPrevote LCR:0]"
14:07:07 node-2: I[11-02|14:07:07.990] Stopping MConnection                         module=p2p peer=0xb79150 impl=MConn{74.207.236.148:26656}
14:07:07 node-2: E[11-02|14:07:07.990] Stopping peer for error                      module=p2p peer="Peer{MConn{74.207.236.148:26656} 647e32df3b9c54809b5aca2877d9ba60900bc2d9 out}" err="interface conversion: interface {} is nil, not *consensus.PeerState"
14:07:07 node-2: I[11-02|14:07:07.990] Stopping Peer                                module=p2p peer=0xb79150 impl="Peer{MConn{74.207.236.148:26656} 647e32df3b9c54809b5aca2877d9ba60900bc2d9 out}"

So we don't receive anything from the peer, so we ping it, and then timeout waiting for the pong, so we disconnect. Then we try to reconnect, succeed, and then encounter the interface conversion: interface {} is nil, not *consensus.PeerState issue.

Can you give us more info on the network connectivity? Wonder why it's not receiving anything from the peer and then timing out waiting for the pong in the first place. Maybe there's corresponding logs from the peer it's talking to that would give us a clue? Also, anything special about the config being used here ?

The pong timeout seems to be a recurring issue here. Could it be pointing to some kind of networking issue? Where are these nodes running?

ebuchman on 7 Nov 2018

@ebuchman, there was 50+ nodes across globe in different DC connected to this particular. They all failing. Logs are equal on all machines. So I don't think that issue is in bad network conditions. But if I can give you some specific information - let me know.

Another interesting observation: when node freezes I can terminate it only with SIGKILL.

danil-lashin on 7 Nov 2018

Ok noted. Couple clarifications about things you said:

Note, that last block (at height 46025) was processed at 14:05:26, but this error occurred only at 14:07:07. So before interface conversion error node already missed 18 blocks (1.5 min).

So this means one of the nodes froze, but other nodes kept making blocks? Is it possible to see logs from one of those nodes that was still making blocks and talking to the one that froze?

Btw, now my nodes freezing even without transactions load. At least 4-5 times an hour.

What does it mean that a node freezes multiple times an hour? Do they unfreeze at some point and then re-freeze, every ~10-15 minutes ?

Can we see a config file ?

ebuchman on 7 Nov 2018

So this means one of the nodes froze, but other nodes kept making blocks?

Right, there are nodes that continue making blocks. Nodes are freezing independently.

Is it possible to see logs from one of those nodes that was still making blocks and talking to the one that froze?

I will try to collect them.

What does it mean that a node freezes multiple times an hour? Do they unfreeze at some point and then re-freeze, every ~10-15 minutes ?

It means that nodes are running only for 10 mins, then freeze. They do not unfreeze without terminating with SIGKILL. I've managed to build a "restart script", so I can see that issue occurs very frequently.

Can we see a config file ?

https://pastebin.com/CATEAWQ4

danil-lashin on 7 Nov 2018

Thanks! We'll keep investigating

ebuchman on 7 Nov 2018

My node is still running (H: 120345)

melekes on 7 Nov 2018

@melekes, can you give me IP of your node? I will send some transactions directly to it to drop it more quickly :)

danil-lashin on 7 Nov 2018

178.128.196.40

melekes on 7 Nov 2018

@melekes, done. You will see some panic error, but it's just because of tx decode issue. If you want to get rid of it you can compile Minter from dev branch.

Nevertheless, Tendermint is deadlocked (http://178.128.196.40:26657/status) and I suppose that should not be happening.

danil-lashin on 7 Nov 2018

Damn, I forgot to setup log redirect. Could you provide a script you're using? Thank you!

melekes on 7 Nov 2018

Here: https://github.com/danil-lashin/minter-mass-check-redeemer

You can use it that way:

go run main.go ea90fc03739cf8fe86c3b05d296c002d37a19303a3f3425a6384da6cbbd21686 {N} 178.128.196.40

N - nonce. Start from 0 and increment it each run.

danil-lashin on 7 Nov 2018

2 panics. Probably not related

2018/11/07 15:55:39 http: panic serving 23.108.213.28:54436: runtime error: invalid memory address or nil pointer dereference
goroutine 112956 [running]:
net/http.(*conn).serve.func1(0xc000ff6960)
        /usr/local/go/src/net/http/server.go:1746 +0xd0
panic(0xdab900, 0x18606e0)
        /usr/local/go/src/runtime/panic.go:513 +0x1b9
github.com/MinterTeam/minter-go-node/core/state.(*StateDB).getStateValidators(0x0, 0x1f92a)
        /root/go/src/github.com/MinterTeam/minter-go-node/core/state/statedb.go:364 +0x26
github.com/MinterTeam/minter-go-node/core/state.(*StateDB).GetStateValidators(0x0, 0x0)
        /root/go/src/github.com/MinterTeam/minter-go-node/core/state/statedb.go:358 +0x2b
github.com/MinterTeam/minter-go-node/api.GetValidators(0x1103220, 0xc0051e88c0, 0xc00c746b00)
        /root/go/src/github.com/MinterTeam/minter-go-node/api/validators.go:74 +0x9f
github.com/MinterTeam/minter-go-node/api.wrapper.func1(0x1103220, 0xc0051e88c0, 0xc00c746b00)
        /root/go/src/github.com/MinterTeam/minter-go-node/api/api.go:105 +0xd7
net/http.HandlerFunc.ServeHTTP(0xc0001de840, 0x1103220, 0xc0051e88c0, 0xc00c746b00)
        /usr/local/go/src/net/http/server.go:1964 +0x44
github.com/MinterTeam/minter-go-node/api.RateLimit.func2.1(0x1103220, 0xc0051e88c0, 0xc00c746b00)
        /root/go/src/github.com/MinterTeam/minter-go-node/api/api.go:170 +0x31a
net/http.HandlerFunc.ServeHTTP(0xc006486580, 0x1103220, 0xc0051e88c0, 0xc00c746b00)
        /usr/local/go/src/net/http/server.go:1964 +0x44
github.com/MinterTeam/minter-go-node/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc000c41ea0, 0x1103220, 0xc0051e88c0, 0xc00c746b00)
        /root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/gorilla/mux/mux.go:162 +0xf1
github.com/MinterTeam/minter-go-node/vendor/github.com/rs/cors.(*Cors).Handler.func1(0x1103220, 0xc0051e88c0, 0xc00c746500)
        /root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/rs/cors/cors.go:200 +0xf2
net/http.HandlerFunc.ServeHTTP(0xc002a69640, 0x1103220, 0xc0051e88c0, 0xc00c746500)
        /usr/local/go/src/net/http/server.go:1964 +0x44
net/http.serverHandler.ServeHTTP(0xc000219ad0, 0x1103220, 0xc0051e88c0, 0xc00c746500)
        /usr/local/go/src/net/http/server.go:2741 +0xab
net/http.(*conn).serve(0xc000ff6960, 0x11041a0, 0xc0064864c0)
        /usr/local/go/src/net/http/server.go:1847 +0x646
created by net/http.(*Server).Serve
        /usr/local/go/src/net/http/server.go:2851 +0x2f5

E[11-07|15:55:25.011] Panic in RPC HTTP handler                    module=rpc-server err="runtime error: index out of range" stack="goroutine 111044 [running]:\nruntime/debug.Stack(0xc007812658, 0xdab900, 0x1860720)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0xa7\ngithub.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server.RecoverAndLogHandler.func1.1(0xc0035efa80, 0x1104e20, 0xc002d8a760, 0xbef0e09b40837f90, 0xb7c7026903, 0x1878280, 0xc001a16200)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server/http_server.go:174 +0x50c\npanic(0xdab900, 0x1860720)\n\t/usr/local/go/src/runtime/panic.go:513 +0x1b9\ngithub.com/MinterTeam/minter-go-node/crypto.SigToPub(0xc0078128d8, 0x20, 0x20, 0xc007a26b40, 0x40, 0x40, 0x40, 0x0, 0x40)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/crypto/signature_nocgo.go:43 +0x124\ngithub.com/MinterTeam/minter-go-node/crypto.Ecrecover(0xc0078128d8, 0x20, 0x20, 0xc007a26b40, 0x40, 0x40, 0xc0078128f8, 0x597d70, 0xc0018c26f0, 0x1, ...)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/crypto/signature_nocgo.go:31 +0x6a\ngithub.com/MinterTeam/minter-go-node/core/check.(*Check).LockPubKey(0xc006e1df40, 0xc00663ec60, 0x0, 0xc007812a01, 0x3, 0x20)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/core/check/check.go:38 +0x9c\ngithub.com/MinterTeam/minter-go-node/core/transaction.RedeemCheckData.Run(0xc001038280, 0x9a, 0x9a, 0xcadcf736a6ef4312, 0xf68fcbc45d847a42, 0xb3845dfe542d3bf7, 0xbc7a1a695a2578d4, 0xe97640cdfa9a3228, 0x605abfcc37865ca9, 0x66986f56ab23afde, ...)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/core/transaction/redeem_check.go:92 +0x514\ngithub.com/MinterTeam/minter-go-node/core/transaction.RunTx(0xc005029400, 0xa4506ceb90beff01, 0xc00af33400, 0x13e, 0x280, 0x0, 0x1f927, 0x0, 0x0, 0x0, ...)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/core/transaction/executor.go:128 +0x34e\ngithub.com/MinterTeam/minter-go-node/core/minter.(*Blockchain).CheckTx(0xc000126080, 0xc00af33400, 0x13e, 0x280, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/core/minter/minter.go:289 +0xb3\ngithub.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/abci/client.(*localClient).CheckTxAsync(0xc0001fb0e0, 0xc00af33400, 0x13e, 0x280, 0x0)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/abci/client/local_client.go:86 +0x88\ngithub.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/proxy.(*appConnMempool).CheckTxAsync(0xc00002f720, 0xc00af33400, 0x13e, 0x280, 0x1)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/proxy/app_conn.go:114 +0x51\ngithub.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/mempool.(*Mempool).CheckTx(0xc002d3a320, 0xc00af33400, 0x13e, 0x280, 0xc0072352c0, 0x0, 0x0)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/mempool/mempool.go:316 +0x17b\ngithub.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/core.BroadcastTxSync(0xc00af33400, 0x13e, 0x280, 0x0, 0x0, 0x0)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/core/mempool.go:93 +0xb8\nreflect.Value.call(0xd85560, 0x10326c0, 0x13, 0xec7b8b, 0x4, 0xc0035efb00, 0x1, 0x1, 0xc0035efb00, 0xc0035efb08, ...)\n\t/usr/local/go/src/reflect/value.go:447 +0x449\nreflect.Value.Call(0xd85560, 0x10326c0, 0x13, 0xc0035efb00, 0x1, 0x1, 0x0, 0x0, 0xc005cc8dc4)\n\t/usr/local/go/src/reflect/value.go:308 +0xa4\ngithub.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server.makeHTTPHandler.func2(0x1102060, 0xc0035efa80, 0xc001a16200)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server/handlers.go:269 +0x188\nnet/http.HandlerFunc.ServeHTTP(0xc002c81f20, 0x1102060, 0xc0035efa80, 0xc001a16200)\n\t/usr/local/go/src/net/http/server.go:1964 +0x44\nnet/http.(*ServeMux).ServeHTTP(0xc002c81b60, 0x1102060, 0xc0035efa80, 0xc001a16200)\n\t/usr/local/go/src/net/http/server.go:2361 +0x127\ngithub.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server.maxBytesHandler.ServeHTTP(0x10f8a40, 0xc002c81b60, 0xf4240, 0x1102060, 0xc0035efa80, 0xc001a16200)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server/http_server.go:219 +0xcf\ngithub.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server.RecoverAndLogHandler.func1(0x1103220, 0xc00121e540, 0xc001a16200)\n\t/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server/http_server.go:192 +0x394\nnet/http.HandlerFunc.ServeHTTP(0xc002c06ea0, 0x1103220, 0xc00121e540, 0xc001a16200)\n\t/usr/local/go/src/net/http/server.go:1964 +0x44\nnet/http.serverHandler.ServeHTTP(0xc001a1aa90, 0x1103220, 0xc00121e540, 0xc001a16200)\n\t/usr/local/go/src/net/http/server.go:2741 +0xab\nnet/http.(*conn).serve(0xc00785a3c0, 0x11041a0, 0xc000f844c0)\n\t/usr/local/go/src/net/http/server.go:1847 +0x646\ncreated by net/http.(*Server).Serve\n\t/usr/local/go/src/net/http/server.go:2851 +0x2f5\n"

A lot of the "Send failed" msgs:

E[11-07|15:55:16.383] Send failed                                  module=p2p peer=0xbb4f60 channel=32 conn=MConn{103.29.68.156:26656} msgBytes=1919B3D508A5F2071802200F

melekes on 7 Nov 2018

First one is in our API, it's definitely not related. Second one is in the CheckTx method (fixed in dev branch).

Btw, may be this issue related to #846?

danil-lashin on 7 Nov 2018

OK, I might have fixed the issue in https://github.com/tendermint/tendermint/pull/2748/commits/f69df639f6551ff93d74d10d7ad124f47b7472c9.

@danil-lashin feel free to try it out (branch: 2721-unresponsive-tm)

melekes on 7 Nov 2018

@melekes, problem is still there :(

New logs: https://pastebin.com/eKkNRJHJ

danil-lashin on 8 Nov 2018

Was worth trying anyway, thank you ☕️

melekes on 8 Nov 2018

I think this is what is happening: applications panics for a particular transaction when executing CheckTx. Panic bubbles up all the way until we recover in HTTP Handler. Because we don't use defer in our localClient, it's mutex is never released -> consensus state is locked -> consensus is locked.

melekes on 8 Nov 2018

I can't reproduce the original issue with https://github.com/tendermint/tendermint/pull/2748/commits/1b036bb895e2e5a6112d541466f871f5cc9e4162 anymore.

melekes on 8 Nov 2018

Should we do the same for every method? It's certainly needed for methods exposed to the user (API).

melekes on 8 Nov 2018

@melekes, I still have the same issue.

Few notes:

Usually there was no panic before freeze.
Your node is not synced, may be this is the reason why you can't reproduce bug. Also it seems that your node is not in fast sync mode. Can you please restart it so it can synchronize faster?
Freeze occurs faster if I'm using Local RPC client. May be issue is here.

danil-lashin on 8 Nov 2018

😞

Are you using BroadcastTxCommit to send transactions? I've fixed one more possible issue https://github.com/tendermint/tendermint/pull/2748/commits/611268aaf5e92b46ffd7f0aaff878f846404976e

melekes on 8 Nov 2018

Are you using BroadcastTxCommit to send transactions?

Yes, we are. I tried this fix, but it does not help.

May be I can give you some more debug information? Or give direct access to server with failing node?

danil-lashin on 8 Nov 2018

I am rushing a bit as always. It was incomplete. https://github.com/tendermint/tendermint/pull/2748/commits/f4f310903bea82c28e7320c42456922199f6c7d1 should work. How do I upgrade TM version? I've tried pointing dep to my branch, but then other nodes are on old version, so everything breaks.

melekes on 8 Nov 2018

👍1

Looks like https://github.com/tendermint/tendermint/commit/f4f310903bea82c28e7320c42456922199f6c7d1 solved issue! My node running and seems stable! 🎉

Also I asked our community to update their nodes. I will monitor until morning and let you know for sure.

How do I upgrade TM version?

It's not easy. I not expected that after 0.26.0-rc0 there will be breaking changes. Now I manually apply patch on tendermint locally. Just don't want to create a fork of the repo.

danil-lashin on 8 Nov 2018

Fantastic 👍 Will be waiting for the update.

melekes on 8 Nov 2018

@melekes, node is much more stable, but error is still there. Good news is that now we know what causes it. I sent a lot of transactions through broadcastTxCommit and saw this errors:

E[11-08|21:53:04.034] Error on broadcastTxCommit                   module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.035] Error on broadcastTxCommit                   module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.035] Error on broadcastTxCommit                   module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.036] Error on broadcastTxCommit                   module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.038] Error on broadcastTxCommit                   module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.038] Error on broadcastTxCommit                   module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.039] Error on broadcastTxCommit                   module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.040] Error on broadcastTxCommit                   module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.040] Error on broadcastTxCommit                   module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.043] Error on broadcastTxCommit                   module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.043] Error on broadcastTxCommit                   module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.046] Error on broadcastTxCommit                   module=rpc err="failed to subscribe to tx: context deadline exceeded"
E[11-08|21:53:04.048] Error on broadcastTxCommit                   module=rpc err="failed to subscribe to tx: context deadline exceeded"

There is no errors when sending through broadcastTxSync/broadcastTxAsync. I think we are on the right way!

danil-lashin on 8 Nov 2018

Good news! The deal with /broadcast_commit is that it's using pubsub to subscribe for every transaction result. And there's an implicit limit of how many subscribers & events pubsub can handle. Note it's better to use sync/async methods if you don't need 100% confirmation. We can think about how to improve things, but I'd prefer to create a separate issue for that.

And I'll ask to confirm that there's no "freeze" anymore. Thank you for swift answers!

melekes on 8 Nov 2018

🎉1

@melekes, thank you! 🎉

For now everything looks great, except when using /broadcast_commit for several thousand txs. In that case Tendermint "freezes".

So I think we just will not use /broadcast_commit. Btw, do you have a specific value of how many simultaneous txs it can handle?

danil-lashin on 9 Nov 2018

For now everything looks great, except when using /broadcast_commit for several thousand txs. In that case Tendermint "freezes".

Wait, so if you push too much txs, TM stops producing blocks? I'll try to replicate locally and see what I can do.

Btw, do you have a specific value of how many simultaneous txs it can handle?

This depends largely on how long do we wait for tx to get into the block, which depends on block size, transaction propagation time, validators number. There's no universal number, but we certainly can do some benchmarks here.

melekes on 9 Nov 2018

Wait, so if you push too much txs, TM stops producing blocks? I'll try to replicate locally and see what I can do.

Right, but only when using /broadcast_commit. Errors like this appears and TM is not producing block anymore:

Error on broadcastTxCommit                   module=rpc err="failed to subscribe to tx: context deadline exceeded"

And /status not available as usual.

danil-lashin on 9 Nov 2018

Also I've caught several panics like:

panic: Peer Peer{MConn{185.162.92.124:26656} dbfdcd7f4ae82d5d77b64b3fee59a264087b1beb out} has no state

danil-lashin on 9 Nov 2018

Also I've caught several panics like:
panic: Peer Peer{MConn{185.162.92.124:26656} dbfdcd7f4ae82d5d77b64b3fee59a264087b1beb out} has no state

Can you share a stack trace? with the logs if possible

melekes on 9 Nov 2018

Can you share a stack trace? with the logs if possible

Sure: https://pastebin.com/e8VPZfDd

danil-lashin on 9 Nov 2018

Also I've caught several panics like:
panic: Peer Peer{MConn{185.162.92.124:26656} dbfdcd7f4ae82d5d77b64b3fee59a264087b1beb out} has no state

Fixed in the last commit https://github.com/tendermint/tendermint/pull/2748/commits/223b069038862967b6dcd53e5fc509f47bd006b2

melekes on 9 Nov 2018

Thanks, applied! 👍No more panics for now.

danil-lashin on 9 Nov 2018

The last thing to solve here: in /broadcast_commit we use pubsub to wait for the transaction result. When we have too many subscribers (say 10000), time to propagate a msg to the subscriber starts to grow. And more importantly, time to publish a msg starts to grow (because pubsub is a queue and publishing is synchronous) => consensus slows down! This will be addressed in pubsub 2.0 https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-033-pubsub.md.

melekes on 9 Nov 2018

👍1

Great, thank you again!

For now we will switch to BroadcastTxSync and will be waiting for the updates.

danil-lashin on 9 Nov 2018

Have you tried my latest commits? specifically https://github.com/tendermint/tendermint/pull/2748/commits/8db23b097da52d6ac3e7fe0e83ab19a7ea418ed5. You should see a lot of "Timed out waiting ..." errors. I was not able to reproduce the "freezing" part, but I think my theory above is correct.

melekes on 9 Nov 2018

Yep, they appearing:

{
  "jsonrpc": "2.0",
  "id": "",
  "error": {
    "code": -32603,
    "message": "Internal error",
    "data": "Timed out waiting for tx to be included in a block"
  }
}

danil-lashin on 9 Nov 2018

And, it seems to me, that problem is gone. I can't reproduce it no more. 🤔
May be now I need to send much more txs to face this problem.

danil-lashin on 9 Nov 2018

@danil-lashin do you need us to maintain a patch with this based on the v0.26.0-rc0, or are you ok to upgrade to the v0.26 series proper?

Sorry again for making a breaking change after tagging the release candidate - won't happen again.

ebuchman on 11 Nov 2018

@ebuchman, that's totally fine for us to switch to v0.26. We have plans to launch new testnet next week, so we will upgrade to v0.26.1 in new release of our project. Thank you!

danil-lashin on 11 Nov 2018

👍1

Amazing.

I posted a summary of my understanding in https://github.com/tendermint/tendermint/pull/2748#pullrequestreview-173704860

Does that seem correct? And can we close this issue once that PR is merged? If there are any other unresolved problems here, we should open new issues for them.

Thanks for all your help on this one.

ebuchman on 11 Nov 2018

https://github.com/tendermint/tendermint/pull/2748#pullrequestreview-173704860 is totally correct.

I think that we all agree that we can close this issue.

Anyway, we still have issue with BroadcastTxCommit (which you mentioned in 3rd bullet of PR review). But I think that it is better to open a new issue.

Thank you again!

danil-lashin on 12 Nov 2018

👍1

PR with a fix/feature was merged to develop. Will be shipped with 0.26.2 release (check the changelog).

melekes on 15 Nov 2018

🎉1

@melekes, I caught this bug again in v0.27.0 :(

Node freezes, /status is unresponsive. Here's the logs: https://pastebin.com/PkzgSWPj

danil-lashin on 11 Dec 2018

😞 there's no panic / errors in the log. Could you attach the output of kill -6 <PID> (goroutines dump)?

melekes on 11 Dec 2018

Sure, I'm trying to reproduce it now!

danil-lashin on 11 Dec 2018

https://pastebin.com/TfKN8hb9

danil-lashin on 11 Dec 2018

We might have a deadlock somewhere in the p2p/addrbook.go.

melekes on 13 Dec 2018

Are you sure you're running 0.27.0? Because in 0.27.0 there's no such error E[11126-12-11|17:45:59.786] Stopping peer for error module=p2p peer="Peer{MConn{74.207.236.148:26656} 647e32df3b9c54809b5aca2877d9ba60900bc2d9 out}" err="Peer Peer{MConn{74.207.236.148:26656} 647e32df3b9c54809b5aca2877d9ba60900bc2d9 out} has no state", only panic.

melekes on 13 Dec 2018

Are you sure you're running 0.27.0?

Yes, I am. Here's version info:
https://minter-node-2.testnet.minter.network:8841/status
"version": "0.27.0"

BTW, the frozen node was running in the seed mode. Now I switched it off, so I can check if it reproducible in normal mode.

danil-lashin on 13 Dec 2018

Also, since you've only pasted part of the log, I want to ask: Are there any panics / errors other than "Stopping peer for ... pong timeout" in the log?

melekes on 13 Dec 2018

Ooh, good to know! There're a number of issues with seeds currently https://github.com/tendermint/tendermint/issues/2955, https://github.com/tendermint/tendermint/issues/2967, https://github.com/tendermint/tendermint/issues/2821

melekes on 13 Dec 2018

There are some interesting errors reported by our community. Dont know if they are related.

Dec 12 14:08:03 sentry4 minter[4352]: E[12126-12-12|14:08:03.017] Stopping peer for error                      module=p2p peer="Peer{MConn{144.76.140.208:14796} 83a56da94cd26aec30d3323e8bcae6ead18cf83d in}" err="Read overflow, maxSize is 1047 but this amino binary object is 6589 bytes."

Dec 12 13:48:30 sentry4 minter[4352]: E[12126-12-12|13:48:30.191] Error attempting to add vote                 module=consensus err="Expected 35373/1/2, but got 35373/0/2: Unexpected step"

danil-lashin on 13 Dec 2018

👍1

After scanning pex/addrbook.go and pex/file.go,I did not find anything obvious (I even dove into the netaddr String() implementation to see if it would panic, or places where addr or knownAddr might be nil), but it turns out I may have missed something.

Based on the latest pastebin by Danil, I noticed that TrapSignal was called 373 minutes ago, so likely the node started then. I also noticed that the address book was already blocking some goroutines since 373 minutes ago, so I suspected that it might have to do with adding local addresses. So I took a look at AddOurAddress() one more time. Looking at addrBook.AddOurAddress(nodeInfo.NetAddress()) in node.go line 452, it looks like nodeInfo.NetAddress() may return a nil address if lookup of a host fails.

// in p2p/node_info.go:
func (info DefaultNodeInfo) NetAddress() *NetAddress {
    idAddr := IDAddressString(info.ID(), info.ListenAddr)
    netAddr, err := NewNetAddressString(idAddr)
    if err != nil {
        switch err.(type) {
        case ErrNetAddressLookup:
            // XXX If the peer provided a host name  and the lookup fails here
            // we're out of luck.
            // TODO: use a NetAddress in DefaultNodeInfo
        default:
            panic(err) // everything should be well formed by now
        }
    }
    return netAddr
}

ErrNetAddressLookup gets returned from p2p/netaddress.go:

    ip := net.ParseIP(host)
    if ip == nil {
        ips, err := net.LookupIP(host)
        if err != nil {
            return nil, ErrNetAddressLookup{host, err}
        }
        ip = ips[0]
    }

And while NodeInfo.ValidateBasic() does validate the listen address, it only gets called upon handshake w/ the remote peer's NodeInfo in p2p/transport. The local node's info.ListenAddr gets set from the config, and never gets validated AFAIK.

// node/node.go
        ...
    lAddr := config.P2P.ExternalAddress
    if lAddr == "" {
        lAddr = config.P2P.ListenAddress
    }
    nodeInfo.ListenAddr = lAddr
    return nodeInfo
}

We should be validating NodeInfo there before returning, and panic'ing if it fails validation.

@danil-lashin Does your config.P2P.ExternalAddress or config.P2P.ListenAddress have a value that can't be resolved?

jaekwon on 16 Dec 2018

released the changes to v0.27.2 - now validates the NodeInfo on startup (will catch any unreachable addrs) and use defers in the addrbook so we shouldnt get caught holding mutexes.

ebuchman on 16 Dec 2018

Addressing your comment https://github.com/tendermint/tendermint/issues/2721#issuecomment-446900825

Dec 12 14:08:03 sentry4 minter[4352]: E[12126-12-12|14:08:03.017] Stopping peer for error module=p2p peer="Peer{MConn{144.76.140.208:14796} 83a56da94cd26aec30d3323e8bcae6ead18cf83d in}" err="Read overflow, maxSize is 1047 but this amino binary object is 6589 bytes."

This looks like its due to one of the nodes tampering with the max_packet_msg_payload_size config setting, which actually needs to be consistent across nodes or they will send each other invalid packets. This setting defaults to 1024 bytes, but with amino overhead, it comes to 1047 bytes. So it appears another node set it much higher.

Dec 12 13:48:30 sentry4 minter[4352]: E[12126-12-12|13:48:30.191] Error attempting to add vote module=consensus err="Expected 35373/1/2, but got 35373/0/2: Unexpected step"

I believe this can just happen due to network latency - we received a vote from a previous round (0) instead of the round we're on (1).

ebuchman on 18 Dec 2018

@danil-lashin can you confirm if its possible that your config.P2P.ExternalAddress or config.P2P.ListenAddress have a value that can't be resolved? If so, that should explain the deadlock, and should be fixed in v0.27.2

ebuchman on 18 Dec 2018

@ebuchman they are default.

laddr = "tcp://0.0.0.0:26656"
external_address = ""

danil-lashin on 18 Dec 2018

Thanks @danil-lashin . Is the dump from https://pastebin.com/TfKN8hb9 from a seed node?

ebuchman on 18 Dec 2018

Is the dump from https://pastebin.com/TfKN8hb9 from a seed node?

Yes, it is.

danil-lashin on 20 Dec 2018

Ok thanks. Have you ever seen this issue on non-seeds, or only seeds?

Can you explain how you're detecting the problem? The logs you've provided only cover a short amount of time (couple minutes) - are you catching the halt this quickly? Is it possible to get logs that go a little further back in time?

ebuchman on 20 Dec 2018

@danil-lashin could the issues here be related to use of BroadcastTxCommit? It's clear now that high use of that endpoint can lead to deadlocks. We're working on addressing this now.

Also there's a resource leak that may impact seeds in particular - #2967

ebuchman on 16 Jan 2019

@ebuchman first time this issue was related to BroadcastTxCommit, then we decided not to use it. Now it is only reproducible on seed nodes.

danil-lashin on 17 Jan 2019

Ok thanks. I forgot to ask - are you running your seeds with p2p.seed_mode = true ?

ebuchman on 17 Jan 2019

Seems that issue was resolved! Thank you!

danil-lashin on 11 Feb 2019

🎉1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

WIP: Make consensus reactor more like state machine

melekes · 23Comments

Transactions are stuck in mempool and not getting propagated to other nodes

yuomii · 24Comments

Consensus stuck for 50 minute trying to finish a block

mattkanwisher · 24Comments

Mempool size full is there any why to fix this issue

skdjango001 · 23Comments

Run CheckTx before voting on blocks

ebuchman · 25Comments