Kibana: [Ingest Manager] Error handling in server APIs

Created on 15 May 2020  路  6Comments  路  Source: elastic/kibana

Currently, various parts of the APIs provided by Ingest Manager have implemented error handling and logging in different levels of completeness.

Overall, we should do the following when an error happens:

  • [x] in the place where the error happens, use the Kibana Logger to log a descriptive error message. The logger is available since https://github.com/elastic/kibana/pull/66017~~
  • [x] in the place where the error happens, throw a Boom error with a suitable status code that is not 500, and the same error message that was logged
  • [ ] In the place where the error happens, throw an IngestManagerError with a suitable IngestManagerErrorType and a helpful and descriptive error message.
  • [x] in the request handler, if the caught error is a Boom error, use its status code in res.customError().
  • [x] In the request handler, if the caught error is instanceof IngestManagerError, use res.customError() to return the error to the caller. Pass the message, and get a suitable HTTP response code from the getHTTPResponseCode() helper.
  • [x] In the request handler, also use the Kibana Logger to log the error message to the Kibana log.
  • [x] In the request handler, if the error is not IngestManagerError, use status code 500. In that case, log an error to the console with the full error message, and also log the stack trace of the error.

For an example how this looks in implementation see https://github.com/elastic/kibana/pull/66541~~

Implementation example: https://github.com/elastic/kibana/pull/67278

Reasoning

  • Kibana platform code logs a stack trace whenever a request handler returns a status code 500. This stack trace comes from within platform code and is not helpful in debugging the error. I find it also confusing, because it implies that the error hasn't been caught and handled correctly, which is not entirely true in our code. (There is https://github.com/elastic/kibana/issues/65291 open for that.)
  • We should therefore use informative HTTP response codes whenever possible, or at least 400 bad request instead.
  • In cloud, customers can't inspect the Kibana log, so as much information as is practical should be provided through API error messages. This way, we can ask users e.g. on https://discuss.elastic.co/ to inspect their browser dev tools network tab and find out what happened. (Once we're in production, customers can ask support, and support agents will be able to inspect the Kibana logs.)
  • The UI can still decide not to show too much error information if that is not desired.

Tracking

This task is to go through each of these APIs and ensure it handles and reports errors properly:

  • [x] setup
  • [x] agent_config
  • [x] enrollment_api_key
  • [x] agent
  • [x] epm
  • [x] datasource
  • [x] data_streams
  • [x] install_script
  • [x] output
  • [x] settings
  • [x] app
beta1 Meta Ingest Management

Most helpful comment

  • in the place where the error happens, use the Kibana Logger to log a descriptive error message. The logger is available since #66017
  • in the place where the error happens, throw a Boom error with a suitable status code that is not 500, and the same error message that was logged
  • in the request handler, if the caught error is a Boom error, use its status code in res.customError().
  • in the request handler, only use status code 500 if there is no further information in the error. In that case, also log an error to the console with the full error.

A couple things here:

  • In general, we are moving away from any Hapi-specific APIs, including the Boom package. I do not suggest using this package.
  • Baking HTTP status codes deep in your business logic isn't generally recommended

    • We prefer not to couple our business logic to a specific consumer, like HTTP. I would instead recommend describing the exceptions in terms of your domain (for example, PackageMissingError) and then translating these domain-specific errors to the appropriate HTTP status code in your HTTP response handlers. This makes your business logic more flexible by allowing different endpoints to have different status code responses for the same error (which often makes sense in many situations).

    • You can also implement a generic error handler that you can wrap all of your HTTP endpoints in. This handler can be responsible for translating your domain-specific errors to HTTP responses.

  • Logging doesn't have to occur at the error throw site.

    • There should be no issues with moving your error logging up to the HTTP handler. Stack traces should be preserved so you can locate the original throw site. Logging at the throw site would require that you pass the logger everywhere, which is not recommended as it couples your code more tightly to the Core APIs.

Example Code

Business logic layer

class MyService {
  public async findPackage(packageName: string) {
    try {
      const package = await this.doApiCall(`/my-api`, { packageName });
      if (!package) {
        throw new PackageMissingError({ packageName });
      }
      return package;
     } catch (e) {
       // This is all made up for this example, but you get the idea
       if (e.timedOut) {
         throw new ElasticsearchUnavailableError(e);
       } else if (e.unauthenticated)
         throw new UnauthorizedError(e);
       }

       throw new UnknownError(e);
    }
  }
}

HTTP handler

router.get(
  { path: '/api/ingest/package', validation: { body: ... } },
  async (context, req, res) => {
    const { packageName } = req.body;
    try {
      const package = myService.findPackage(packageName);
      return res.ok({ package });
    } catch (e) {
      // This could be extracted into a generic `translateIngestError` util
      if (e instanceof PackageMissingError) {
        return res.notFound(); // don't log 'expected' or unexceptional errors
      } else if (e instanceof ElasticsearchUnavailableError) {
        log.warning(e);
        return res.unavailable();
      } else if (e instanceof UnauthorizedError) {
        return res.unauthorized();
      } else {
        log.error(e);
        return res.internalError();
      }
    }
  }
);

All 6 comments

Pinging @elastic/ingest-management (Team:Ingest Management)

  • in the place where the error happens, use the Kibana Logger to log a descriptive error message. The logger is available since #66017
  • in the place where the error happens, throw a Boom error with a suitable status code that is not 500, and the same error message that was logged
  • in the request handler, if the caught error is a Boom error, use its status code in res.customError().
  • in the request handler, only use status code 500 if there is no further information in the error. In that case, also log an error to the console with the full error.

A couple things here:

  • In general, we are moving away from any Hapi-specific APIs, including the Boom package. I do not suggest using this package.
  • Baking HTTP status codes deep in your business logic isn't generally recommended

    • We prefer not to couple our business logic to a specific consumer, like HTTP. I would instead recommend describing the exceptions in terms of your domain (for example, PackageMissingError) and then translating these domain-specific errors to the appropriate HTTP status code in your HTTP response handlers. This makes your business logic more flexible by allowing different endpoints to have different status code responses for the same error (which often makes sense in many situations).

    • You can also implement a generic error handler that you can wrap all of your HTTP endpoints in. This handler can be responsible for translating your domain-specific errors to HTTP responses.

  • Logging doesn't have to occur at the error throw site.

    • There should be no issues with moving your error logging up to the HTTP handler. Stack traces should be preserved so you can locate the original throw site. Logging at the throw site would require that you pass the logger everywhere, which is not recommended as it couples your code more tightly to the Core APIs.

Example Code

Business logic layer

class MyService {
  public async findPackage(packageName: string) {
    try {
      const package = await this.doApiCall(`/my-api`, { packageName });
      if (!package) {
        throw new PackageMissingError({ packageName });
      }
      return package;
     } catch (e) {
       // This is all made up for this example, but you get the idea
       if (e.timedOut) {
         throw new ElasticsearchUnavailableError(e);
       } else if (e.unauthenticated)
         throw new UnauthorizedError(e);
       }

       throw new UnknownError(e);
    }
  }
}

HTTP handler

router.get(
  { path: '/api/ingest/package', validation: { body: ... } },
  async (context, req, res) => {
    const { packageName } = req.body;
    try {
      const package = myService.findPackage(packageName);
      return res.ok({ package });
    } catch (e) {
      // This could be extracted into a generic `translateIngestError` util
      if (e instanceof PackageMissingError) {
        return res.notFound(); // don't log 'expected' or unexceptional errors
      } else if (e instanceof ElasticsearchUnavailableError) {
        log.warning(e);
        return res.unavailable();
      } else if (e instanceof UnauthorizedError) {
        return res.unauthorized();
      } else {
        log.error(e);
        return res.internalError();
      }
    }
  }
);

@joshdover Thanks for your comments! I'll adapt the original proposal.

What is the time frame for getting rid of Boom?

What is the time frame for getting rid of Boom?

No timeline right now, but we may consider removing Hapi altogether in the future to simplify our HTTP stack. Don't see this happening until some 8.x version.

I've updated the description, and opened https://github.com/elastic/kibana/pull/67278 with a minimal implementation example following @joshdover 's suggestions. (Thanks!)

I mostly agree. One notable exception is that I implemented the different error types as a typescript enum instead of having a class for each one. We don't really work with inheritance like this in the rest of the Ingest Manager code base and I didn't want to introduce yet another pattern.

Also, if we want to add more logic to the handlers (e.g like supporting different log levels) this should be easy to add later.

With https://github.com/elastic/kibana/pull/77975 & https://github.com/elastic/kibana/pull/74409 every route has the consistent logging and return values from this PR description.

Re-opened https://github.com/elastic/kibana/issues/65837 to track replacing our usage of Boom

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bhavyarm picture bhavyarm  路  3Comments

timroes picture timroes  路  3Comments

treussart picture treussart  路  3Comments

bradvido picture bradvido  路  3Comments

MaartenUreel picture MaartenUreel  路  3Comments