Js-ipfs: Websocket connection errors on startup should be clearer and not crash IPFS

Created on 25 Apr 2018 · 10Comments · Source: ipfs/js-ipfs

Version: "ipfs": "^0.28.2"
Platform: Linux on AWS EC2 (Dockerized API)
Subsystem: /ipfs/src/core/boot.js or p2plib?

Type: Bug

Severity:Critical - System crash, application panic.

Description:

Upon start of the application, it starts normally - then as soon as the IPFS js node gets initialized, the application throws a critical error with the following:

App starting..

events.js:165
      throw er; // Unhandled 'error' event
      ^

Error: websocket error
    at WS.Transport.onError (/path/to/application/node_modules/engine.io-client/lib/transport.js:64:13)
    at WebSocket.ws.onerror (/path/to/application/node_modules/engine.io-client/lib/transports/websocket.js:150:10)
    at WebSocket.onError (/path/to/application/node_modules/ws/lib/EventTarget.js:109:16)
    at WebSocket.emit (events.js:180:13)
    at WebSocket.finalize (/path/to/application/node_modules/ws/lib/WebSocket.js:182:41)
    at ClientRequest._req.on (/path/to/application/node_modules/ws/lib/WebSocket.js:653:12)
    at ClientRequest.emit (events.js:180:13)
    at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:539:21)
    at HTTPParser.parserOnHeadersComplete (_http_common.js:117:17)
    at TLSSocket.socketOnData (_http_client.js:444:20)
Emitted 'error' event at:
    at done (/path/to/application/node_modules/ipfs/src/core/boot.js:58:19)
    at /path/to/application/node_modules/async/internal/parallel.js:39:9
    at /path/to/application/node_modules/async/internal/once.js:12:16
    at iterateeCallback (/path/to/application/node_modules/async/internal/eachOfLimit.js:44:17)
    at /path/to/application/node_modules/async/internal/onlyOnce.js:12:16
    at /path/to/application/node_modules/async/internal/parallel.js:36:13
    at done (/path/to/application/node_modules/ipfs/src/core/components/start.js:15:16)
    at series (/path/to/application/node_moduless/ipfs/src/core/components/start.js:39:25)
    at /path/to/application/node_modules/async/internal/parallel.js:39:9
    at /path/to/application/node_modules/async/internal/once.js:12:16

In the configuration, I have a node.js application that uses IPFS-js. I have the application dockerized as well as a full ipfs dockerized node that websockets as a peer to the IPFS-js app. This works perfectly fine and I have had no problem until last night when every single one of my environments started throwing the same error out of nowhere. Now none of my team can spin up the api because it has the same issue. One odd thing is that in the gap of time before it crashes, we can spam the API with a curl and get back the IPFS results for a quick second, otherwise it fails. I have confirmed it is not a websocket issue on my end and it is infact something with the IPFS-js package in the nodeJS application. Guidance is much appreciated.

Steps to reproduce the error:

npm install ipfs
set the options for websockets

let options = {
  config: {
    Addresses: {
      Swarm: [
        // '/dns4/wrtc-star.discovery.libp2p.io/tcp/443/wss/p2p-webrtc-star'
        '/dns4/ws-star.discovery.libp2p.io/tcp/443/wss/p2p-websocket-star'
      ]
    }
  }
}

initialize the node
const ipfsNode = new IPFS(options)
```
Connect to swarm
ipfsNode.on('ready', () => {
// Your node is now ready to use o/
console.log('IPFS Online Status: ', ipfsNode.isOnline())
ipfsNode.swarm.connect(ipfsPeer, (err, result) => {
console.log('connecting to peers: ', result)
ipfsNode.swarm.peers((err, peerCount) => {
console.log('There are this many peers: ', peerCount)
})
})
```

P1 dihard kinbug

Source

shessenauer

Most helpful comment

It does seem like we could surface a better error here, though, saying something like could not connect to websocket server <address> or something. It wouldn’t fix the problem of a lot of people all depending on a particular server that went down, but it would at least make it more clear what happened.

Mr0grog on 27 Apr 2018

👍3

All 10 comments

I just found this morning that the ws-star server was hung. It responded to normal http requests properly, meaning that our health-checks though everything was fine but the actual websocket endpoints didn't work.

I've restarted the server and confirmed it to be working again, currently adding better health-checks to it now. Could you please retry and report back if it's working now?

victorb on 26 Apr 2018

@VictorBjelkholm It is resolved now due to that fix, thank you so much! Much appreciated.

shessenauer on 26 Apr 2018

Mr0grog on 27 Apr 2018

👍3

@Mr0grog That's a good point, have been raised before in: https://github.com/ipfs/js-ipfs/issues/804#issuecomment-370750313

Basically, current implementation crashes when the node can't start listening on the swarm address (since the endpoint does not), which leads to this crash. While I agree that the error message could be better, I would also say that it should be a warning instead of a error, and the daemon should continue booting even if a swarm address fails to open.

victorb on 27 Apr 2018

I think I’d agree with all those points 😄

Mr0grog on 27 Apr 2018

@VictorBjelkholm I am getting this same error again on all my applications. Potentially may need to re-kick off your ws-star server

shessenauer on 21 May 2018

Coworker of shessenauer here: we're still getting this error. Are there other servers we can add that would be more stable? Even better, if there was an option to check that would allow for the automatic addition of swarm nodes connected to the nodes listed in options.config.Addresses.Swarm so that the list of servers is dynamic and thus maybe more resilient? That would be fantastic. Is that congruent with the idea of a Swarm?

Question: will this happen if any one of the endpoints in options.config.Addresses.Swarm is down? Another way of asking this question is: Do all of the servers in options.config.Addresses.Swarm have to be up to avoid this error?

Thanks guys.

Astrovicis on 7 Jun 2018

👍1

Ok so after looking at it a little more closely, the problem is that IPFS never gets out of the 'starting' state after a websocket connection error is thrown. Is there/can there be a reconnect option for this sort of thing?

Astrovicis on 7 Jun 2018

👍1

Re-opening (and re-titling) this to track things a little more clearly. There are two things to address here:

This error message needs to be more clear (e.g. “failed to connect to websocket ${address}” or something).
A connection failure (probably for any swarm connection?) should not be fatal for IPFS.

These are tied in with #1325, but are concrete enough we should be able to address them sooner and more directly.

Mr0grog on 11 Jun 2018

👍1

This was resolved by https://github.com/ipfs/js-ipfs/pull/1793