Js-ipfs: Better error handling in daemon + gateway

Created on 24 Apr 2018 · 38Comments · Source: ipfs/js-ipfs

Today, if there is a error in a stream or any other error within the daemon or the gateway, the entire process crashes. While not a problem for testing, it's a problem when deploying and trying to run a long-term daemon (for example, for the js-ipfs daemons).

I raised this before in https://github.com/ipfs/js-ipfs/pull/1243

Basically, we should be able to gracefully handle errors down the stack without having the daemon and gateway crash.

Current issue is the following error that happens every ~10 minutes:

/usr/src/app/node_modules/pull-pair/index.js:11
      throw new Error('already piped')
      ^

Error: already piped
    at sink (/usr/src/app/node_modules/pull-pair/index.js:11:13)
    at consume (/usr/src/app/node_modules/pull-defer/sink.js:7:17)
    at consume (/usr/src/app/node_modules/pull-defer/sink.js:7:17)
    at Connection.consume (/usr/src/app/node_modules/pull-defer/sink.js:7:17)
    at pull (/usr/src/app/node_modules/pull-stream/pull.js:43:9)
    at Dialer.handle (/usr/src/app/node_modules/multistream-select/src/dialer/index.js:47:5)
    at attemptMuxerUpgrade (/usr/src/app/node_modules/libp2p-switch/src/dial.js:164:10)
    at gotWarmedUpConn (/usr/src/app/node_modules/libp2p-switch/src/dial.js:55:7)
    at Switch.dial (/usr/src/app/node_modules/libp2p-switch/src/dial.js:41:9)
    at _getPeerInfo (/usr/src/app/node_modules/libp2p/src/index.js:238:19)

I'm not saying that we should (or should not) fix this particular error, but rather when this error (and similar) happens, the daemon and gateway should not die but continue running.

awesome endeavour diexpert statuready

Source

victorb

👍2

Most helpful comment

@fsdiogo we are not optimizing for shipping fast here, the target should be correctness and ensuring that the same problems don't come back to haunt us.

daviddias on 30 May 2018

👍5

All 38 comments

@fsdiogo as you're the current OKR owner for this, I've assigned this to you

victorb on 25 Apr 2018

👍1

Another error that happens directly after starting daemon right now:

Initializing daemon...
Using wrtc for webrtc support
/usr/src/app/src/cli/commands/daemon.js:40
        throw err
        ^

Error: websocket error
    at WS.Transport.onError (/usr/src/app/node_modules/engine.io-client/lib/transport.js:64:13)
    at WebSocket.ws.onerror (/usr/src/app/node_modules/engine.io-client/lib/transports/websocket.js:150:10)
    at WebSocket.onError (/usr/src/app/node_modules/engine.io-client/node_modules/ws/lib/EventTarget.js:109:16)
    at WebSocket.emit (events.js:160:13)
    at WebSocket.finalize (/usr/src/app/node_modules/engine.io-client/node_modules/ws/lib/WebSocket.js:182:41)
    at ClientRequest._req.on (/usr/src/app/node_modules/engine.io-client/node_modules/ws/lib/WebSocket.js:653:12)
    at ClientRequest.emit (events.js:160:13)
    at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:539:21)
    at HTTPParser.parserOnHeadersComplete (_http_common.js:112:17)
    at TLSSocket.socketOnData (_http_client.js:439:20)

victorb on 25 Apr 2018

@VictorBjelkholm I can't seem to be able to reproduce the error in your last comment, can you share the steps or configs you have that produces the error?

fsdiogo on 10 May 2018

@fsdiogo this issue (#1325) is not about a specific issue but rather having having a error handler for uncaught exceptions from within the daemon. Currently, any error that happens down the stack (in js-bitswap for example) crashes the daemon, which is why it's so unstable currently. We need to make it so the daemon won't fully be crashed and needing a restart just because a error in one stream or for one operation.

victorb on 14 May 2018

Alright, so it's best to start off on the daemon itself, i.e. try to reproduce an error and make it so that the daemon doesn't crash?

fsdiogo on 14 May 2018

@fsdiogo I think just adding if (Math.random() > 0.5) throw new Error('Test error') in various levels in the stack and testing manually will let you discover one or two places where we can catch the errors appropriately.

victorb on 14 May 2018

I've been thinking about a better approach to this problem. Instead of a band aid approach, we should really understand how we're dealing with errors and have a robust solution to propagate them.

We are using async throughout the project, but check this:

/*
  With async the stack trace gets lost: it only tells the line
  where we throw the error, but it came from one of the calls
  in the series, we don't have a way to know from which one.
 */

const async = require('async')

async.series([
  (cb) => cb(null, 'async x'),
  (cb) => cb('async error'),
  (cb) => cb(null, 'async xxx'),
], (err, res) => {
  if (err) {
      throw new Error(err)
    }
})

/*
> node async.js
  Users/diogo/async-promises/async.js:15
    throw new Error(err);
    ^

  Error: async error
    at async.series (/Users/diogo/async-promises/async.js:15:10)
    at /Users/diogo/async-promises/node_modules/async/dist/async.js:3874:9
    at /Users/diogo/async-promises/node_modules/async/dist/async.js:473:16
    at iterateeCallback (/Users/diogo/async-promises/node_modules/async/dist/async.js:976:17)
    at /Users/diogo/async-promises/node_modules/async/dist/async.js:958:16
    at /Users/diogo/async-promises/node_modules/async/dist/async.js:3871:13
    at async.series (/Users/diogo/async-promises/async.js:11:11)
    at /Users/diogo/async-promises/node_modules/async/dist/async.js:3866:24
    at replenish (/Users/diogo/async-promises/node_modules/async/dist/async.js:998:17)
    at iterateeCallback (/Users/diogo/async-promises/node_modules/async/dist/async.js:983:17)
*/

A solution for this is Promises:

/*
  With promises the stack trace is intact, being possible
  to observe all the lines and modules that lead to the error.
 */

new Promise((resolve, reject) => {
  setTimeout(() => resolve('promise x'), 100)
})
.then((res) => res + 'x')
.then((res) => randomFn())
.then((res) => res + 'xxx')
.catch((err) => console.log(err))

const randomFn = () => {
  throw new Error('randomFn error')
}

/*
> node promises.js
  Error: randomFn error
    at randomFn (/Users/diogo/async-promises/promises.js:14:9)
    at Promise.then.then (/Users/diogo/async-promises/promises.js:9:16)
    at <anonymous>
*/

Promises will vastly improve our debug capability and therefore help us with the handling of future errors. Moreover, we standardise the way we handle errors and improve the codebase by handling errors that are being lost along the way right now.

PR https://github.com/ipfs/js-ipfs/pull/1335 by @hugomrdias is a good example of that.

I had already opened the issue https://github.com/ipfs/js-ipfs/issues/1311 to start using promises, but I think this KR is the perfect case of why we should really do that. It's not a matter of callbacks vs promises, but of what works vs what doesn't.

I've been thinking of possible solutions but this is the best way that I've found, I've talked with @vasco-santos and @hugomrdias and they agree with this approach.

What do you guys think? @VictorBjelkholm @diasdavid

fsdiogo on 16 May 2018

👍1

@dryajov want to chime in as you've dealt with this too?

fsdiogo on 16 May 2018

@fsdiogo so the issue here is not really with async or Promise. The resason the stack is lost in your async example is because you are instantiating the Error object inside the series done callback:

async.series([
  (cb) => cb(null, 'async x'),
  (cb) => cb('async error'),
  (cb) => cb(null, 'async xxx'),
], (err, res) => {
  if (err) {
      throw new Error(err) // --> error instantiated here
    }
})

The Error object will capture the stack at the current place of instantiation, so if we change you're example to:

const async = require('async')

async.series([
  (cb) => cb(null, 'async x'),
  (cb) => cb(new Error('async error')),  // --> error instantiated here
  (cb) => cb(null, 'async xxx'),
], (err, res) => {
  if (err) {
    throw err
  }
})

We get the correct stacktrace:

/Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/test.js:11
    throw err
    ^

Error: async error
    at async.series (/Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/test.js:7:14)
    at /Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:3866:24
    at replenish (/Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:998:17)
    at iterateeCallback (/Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:983:17)
    at /Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:958:16
    at /Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:3871:13
    at async.series (/Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/test.js:6:11)
    at /Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:3866:24
    at replenish (/Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:998:17)
    at /Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:1002:9

Note that in you're promise example you do instantiate the error at the place of rejection, so the stack comes up correctly as well:

/*
  With promises the stack trace is intact, being possible
  to observe all the lines and modules that lead to the error.
 */

new Promise((resolve, reject) => {
  setTimeout(() => resolve('promise x'), 100)
})
.then((res) => res + 'x')
.then((res) => randomFn())
.then((res) => res + 'xxx')
.catch((err) => console.log(err))

const randomFn = () => {
  throw new Error('randomFn error')
}

/*
> node promises.js
  Error: randomFn error
    at randomFn (/Users/diogo/async-promises/promises.js:14:9)
    at Promise.then.then (/Users/diogo/async-promises/promises.js:9:16)
    at <anonymous>
*/

dryajov on 16 May 2018

@dryajov sure ur correct, but the problem isn't where you create the errors. It's propagation, with async, errors instances are just passed around (or teleported as i say it). With promises, errors are automatically re throwed and stack-traces follow the code path and with a lot less noise from async internals.

Not having full stack traces is really painful because we have to manually search the code base, commenting or whatever until we find the line we need. In contrast with complete stacks we can cmd+click BOOM editor jumped to the line we need perfect developer experience and full productivity.

hugomrdias on 16 May 2018

Aside from my comment above, this is still obviously a legitimate issue overall. Promises do help with that somewhat, but they are IMO not the final solution for this particular issue and I would argue that bringing them up here as an end all solution for this problem, might end up doing more harm than good.

I though about it for some time and I think some of the issues we have boil down to mixing error handling styles. We rely heavily on callback style async constructs in our code base, which occasionally get mixed with try/catch somewhere deep in the codebase or a third party module, add EventEmmiter error event types and Promises on top of that and we get a pretty wild compound of things to that can and do go wrong.

IMO, in order to handle errors correctly we need to:

find all the places where we're not handling errors appropriately and fix them
Standarise error handling across our code base
- look for throws and return them as callback errors where possible
- ditto for promises
EventEmmiters errors should also be appropriately handled - non fatal (error) type errors should be emited as warnings/user errors, since error is a special case and throws if there are no listeners attached
Finally, have a global error handler that catches unhandled errors and terminates the process gracefuly

dryajov on 16 May 2018

👍2

This stuff below also is tricky because callback might not be called, in some places we even have setImmediate(self.emit('error', err)) to solve that.

  const done = (err, res) => {
      if (err) {
        self.emit('error', err)
        return callback(err)
      }

      self.state.initialized()
      self.emit('init')
      callback(null, res)
    }

What do you think about removing event emitters ? moving stuff like boot() inside ipfs constructor to a instance method or a static factory method where we can return a promise or use callbacks.

hugomrdias on 16 May 2018

👍3

Ah, yeah - setImmediate(self.emit('error', err)) is mostly there because emit is sync, and the handler might actually never return which breaks the flow, hence the setImmediate to force it async. I think the same thing would happen with promises too?

Yeah, I think removing EventEmiter can be a good thing, we use it in js-ipfs (js-libp2p might still be apropriate?) primarily because we want to propagate errors during instantiation, so factory method seems like the right way to go.

dryajov on 16 May 2018

👍2 🎉1

Yes it's true for promises, that's why I mention removing event emitters. IMO this should be one of the first things to do then all the stuff you said and finally add promises to help with stacks, debugging , back track cancelables, retry/timeouts etc

hugomrdias on 16 May 2018

👍4

Thanks for the input @dryajov and @hugomrdias!

[ ] Find where we're not handling errors appropriately and fix them
[ ] Standardise error handling across our code base
- look for throws and return them as callback errors where possible
- same for promises
[ ] Have a global error handler that catches unhandled errors and terminates the process gracefuly
[ ] Remove EventEmitter and move boot to a factory method (requires a breaking change)

fsdiogo on 17 May 2018

👍2

So, I realize that removing event emitting can simplify things (no argument from me that any critical error during startup should translate to an error in the callback/rejection in a promise), but we should keep in mind that the events are part of the public API. Various users may be relying on them. They at least deserve notice of the removal well before it happens, and ideally a release or two where the events still work but log a warning when used.

(That said, event emitters should all be wrapped in setImmediate, nextTick, or try/catch to ensure that a listener cannot break the emitting code.)

Mr0grog on 17 May 2018

👍2

@Mr0grog agree, this change should be treated as a breaking change and possibly a major version bump (0.3.x?).

dryajov on 17 May 2018

@fsdiogo before refactoring the code, make sure to have tests that check for the situations where the daemon errors out today. Try things like opening a connection and write trash at different levels of the dialing (transport -> secio -> stream muxing -> protocols).

daviddias on 18 May 2018

Yeah @Mr0grog @dryajov, it will be a breaking change and apps using ipfs will probably have to be refactored, so that's why in its first iteration we should offer both versions. I've opened a small WIP PR with that: https://github.com/ipfs/js-ipfs/pull/1359.

But as @diasdavid pointed out, I shall start with a different task. I've updated https://github.com/ipfs/js-ipfs/issues/1325#issuecomment-389819805 with the tasks order.

fsdiogo on 18 May 2018

❤1

I have a couple more suggestions for your consideration:

When dealing with Node.js streams, use pump whenever possible to ensure errors are caught and handled
Add context to errors and allow our tests to assert on error messages we control in our modules by using explain-error

To address @VictorBjelkholm's original request:

In Node.js the best practice on uncaught exceptions is to crash the process. We need to track down and fix any issues that cause the process to crash. In Node.js you can't meaningfully reason about the state of the system in the case of an unhandled exception so you need to restart the process to avoid leaking memory, holding open connections unnecessarily, corrupting shared state etc. etc.

It's worth having a read of https://www.joyent.com/node-js/production/design/errors - relevant section:

The best way to recover from programmer errors is to crash immediately.

The only downside to crashing on programmer errors is that connected clients may be temporarily disrupted, but remember:

By definition, these errors are always bugs. We're not talking about legitimate system or network failures, but actual bugs in the program. They should be rare in production, and the top priority has to be to debug and fix them.

For all the cases described above (and many more), the requests in flight are not necessarily going to complete successfully anyway. They may complete successfully, they may crash the server again, they may complete incorrectly in obvious ways, or they may complete wrongly in very subtle ways that are very hard to debug.

In a reliable distributed system, clients must be able to deal with server failure by reconnecting and retrying requests. Network and system failure are a reality, whether or not the Node.js program itself is allowed to crash.

If your production program is crashing so often that these disconnections are a problem, then the real problem is that the server is so buggy, not that it crashes in the case of a bug.

If disconnecting clients is a frequently problem because a server crashes so often, you should focus on the bugs that cause the service to crash — and make those exceptional — rather than trying to avoid crashing in cases where the code is obviously wrong

alanshaw on 25 May 2018

👍1

I'm not clear on the benefits of removing EventEmitter. It's used for boot but IPFS also emits init start and stop events in addition to ready and error (note that error is used for startup failures as well as boot failures). People are probably using all of those.

To avoid crashing the process on an error event you can check and emit the error event only if there are listeners for it.

I get the point about handlers throwing and not allowing the callback to be called, but they _are_ unhandled exceptions - you don't really want to catch them (what would you do? - it's not your code that err'd) and if you put them on the next tick then it's just going to be an unhandled exception there too. Our code should be doing any cleanup and/or running critical code paths _before_ emit is called.

BTW I'm +1 on all the other points that @fsdiogo has listed.

alanshaw on 25 May 2018

When dealing with Node.js streams, use pump whenever possible

👍 pump and pumpify are both excellent and hugely useful little tools for streams. It’s a lot of work to make sure you do handle errors right without them.

In Node.js you can't meaningfully reason about the state of the system in the case of an unhandled exception so you need to restart the process to avoid leaking memory, holding open connections unnecessarily, corrupting shared state etc. etc.

I guess I read this issue more as “there are a lot of errors we could be catching and handling so as not to leave ourselves in a bad state, so we don’t need to crash the daemon as often.” Because otherwise I totally agree that simply ignoring any old error is a recipe for really confusing and possibly more dangerous disasters, like data loss, sometime later in the process’s lifetime.

I get the point about handlers throwing and not allowing the callback to be called, but they are unhandled exceptions - you don't really want to catch them (what would you do? - it's not your code that err'd)

In any other large system I’ve worked on, we’ve generally followed the logic that, if you catch them, you do you clean-up and then re-throw (or just use a finally without a catch). The point isn’t to stop the error; it’s to stop an external error from breaking your internal state. (e.g. you want to be as recoverable as possible if the external error is caught and recovered in some other way, like with a global error handler.)

and if you put them on the next tick then it's just going to be an unhandled exception there too.

Exactly! Same as above; that’s the point. It’s not your error, so you shouldn’t be preventing/stopping it. You should only be preventing it from breaking your internal state because, again, it’s not your error.

Our code should be doing any cleanup and/or running critical code paths before emit is called.

Wherever possible, 👍 ! It’s just not always reasonable/possible. Sometimes it’s important that the ordering of the event come before some other code (e.g. before[action]/will[action] events, which I know we don’t have here, but in the abstract, it still applies).

Mr0grog on 25 May 2018

👍1

👍 pump and pumpify are both excellent and hugely useful little tools for streams. It’s a lot of work to make sure you do handle errors right without them.

I would actually abstain of using node stream at all and instead use pull-stream (converting the node stream to pull stream with stream-to-pull-stream), the parts of our codebase that is still using node streams are most likely legacy leftovers that need to be reworked anyways.

dryajov on 25 May 2018

👍1

I would actually abstain of using node stream at all and instead use pull-stream

Yeah, you’re probably right for the case of all the stuff here. I think pump/pumpify are good light-weight tools for making built-in streams sane, while pull-stream is a heavier-weight replacement for built-in streams that already solves most of those problems :)

Mr0grog on 25 May 2018

👍1

Also, since we made the decision of using pull-stream exclusively a while back. Plus, although full featured, pull-stream are still light weight compared to node streams in general.

dryajov on 25 May 2018

@dryajov point taken! The sentiment was to ensure stream errors do not cause the daemon to crash so as long as we're ensuring that I'm happy 😄

alanshaw on 29 May 2018

❤1

@fsdiogo during last js-ipfs weekly, there were some question with regards on how to approach this. Do you feel it would be good to sync?

daviddias on 30 May 2018

Well, I was going to follow these steps, if you feel that's the way to go, awesome! If not, we should sync 🙂

fsdiogo on 30 May 2018

What is the approach to achieve this:

daviddias on 30 May 2018

Fixing the emitters as per https://github.com/ipfs/js-ipfs/issues/1325#issuecomment-389607930.
Remove throws and return them as new errors with appropriate codes and messages a la Node style

fsdiogo on 30 May 2018

Ok, I see, so that is where we diverge. You should first of all create the tests that cause the crashes.

daviddias on 30 May 2018

👍3

You want that approach so we don't start changing the code blindly, right? I get it.

I just think that way the process will be a lot slower, no?

fsdiogo on 30 May 2018

@fsdiogo we are not optimizing for shipping fast here, the target should be correctness and ensuring that the same problems don't come back to haunt us.

daviddias on 30 May 2018

👍5

Agree, we should have a test for every failure we run into 👍

dryajov on 30 May 2018

👍1

@fsdiogo a good source of problems to solve is the list of bugs -- https://github.com/ipfs/js-ipfs/labels/bug --. Creating tests that reproduce those will help you nail down where the problems are coming from.

@alanshaw, @pgte and @satazor have a vast amount of experience in making large Node.js services robust. They can give a ton of tips on how to move forward the best way possible, reach out!

daviddias on 8 Jun 2018

Alright, thanks @diasdavid! I'll reach out then!

fsdiogo on 8 Jun 2018

No more "already piped" with 0.31 - https://github.com/ipfs/js-ipfs/issues/1458

daviddias on 29 Jul 2018

Duplicate/Same as https://github.com/ipfs/js-ipfs/issues/1406. Let's continue there

daviddias on 1 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings