Kibana: Non-deterministic behavior while installing a package

Created on 25 Aug 2020  路  30Comments  路  Source: elastic/kibana

This PR https://github.com/elastic/package-storage/pull/370 introduces an update AWS package. The update might be faulty, so Kibana rejects it with HTTP 500. Unfortunately it says nothing, but "Bad request" (side issue: error reporting for developer purposes should be improved).

I clicked the Install button second time and it worked without any problems. See screenshots, I'm not sure if it works or not.

Zrzut ekranu 2020-08-25 o 13 22 31
Zrzut ekranu 2020-08-25 o 13 22 46

/cc @ruflin @kaiyan-sheng

Ingest Management bug

Most helpful comment

@jfsiii The returned error message looks good, thanks! I don't have any other comments regarding phrasing, I think you can proceed.

All 30 comments

Pinging @elastic/ingest-management (Team:Ingest Management)

@jfsiii You might know here as you spent some time on improving the error handling / error messaging.

We have a ticket to improve error handling https://github.com/elastic/kibana/issues/66688

The issue description says a 500 code but I believe 'Bad Request' is a 400.

I thought it might be an outdated package, but I see we are using 400 for unknown errors as well (for now, unknown errors will be 500 again soon)
https://github.com/elastic/kibana/blob/fec0d515b3fe9fcaa185ccb77191afd0baaaead6/x-pack/plugins/ingest_manager/server/errors.ts#L22-L26

I'll look into it, especially regarding the error message, but @neptunian would better know what might be going on with the installation.

@jfsiii I put more details about the error in this PR: https://github.com/elastic/package-storage/pull/370#issuecomment-679955429 . It looks that it's reported as HTTP 500.

I think the relevant parts of the proposal in https://github.com/elastic/kibana/issues/66688 for this situation are:

  • Create a descriptive error message as close as possible to the place where the error happens
  • Log that error message in the route handler
  • Return that error message in the HTTP response (because cloud users can't access the Kibana log)
  • As the stack trace for all 500 errors will come from within platform code, log the stack trace of the error before passing a 500 error to the platform, e.g. by also logging the stack trace in the route handler.

@skh @jfsiii Creating a package policy calls ensureInstalledPackage and which calls installLatestPackage which calls installPackage. It doesn't use the api and so it doesn't benefit from the error handling in installPackageHandler. This also occurs in /setup and a few other places. I mentioned this concern to @ruflin yesterday. Any reason why we can't directly call our API instead of installPackage?

@neptunian But there must be some route handler handling the request triggering the error, is there a reason why we can't add error handling to that handler?

@skh Yes, the handler is createPackagePolicyHandler.

We have all this error handling specific to installing a package in the installPackageHandler. Such as removing the package if it failed or reinstalling a previous package if the update failed.

@skh agreed, I have a WIP at #74409 which does that

@neptunian indeed, I've been in that code recently with #75372 and #74409.

We don't want to incur the cost of creating/handling those new connections. We also may need to pass requests around to access current user info, etc. It'd be a big change in approach. We can talk about it more anywhere outside this ticket if you want.

@mtojek I ran the mage -debug testIntegration command and got the failing result. Can you post steps here or find me on Slack/Zoom to get it so I can POST to that route and see the failure? Maybe we edit a test file?

Steps to reproduce it:

  1. Git checkout the branch behind the PR: https://github.com/elastic/package-storage/pull/370 (problem is related to the AWS packages
  2. mage build
  3. mage -debug testIntegration - this command will spin up stack with Kibana, ES and package-registry.

Actually you don't need to anything special, just try to serve and install the latest aws package even via the Kibana UI.

@mtojek thanks, those are the steps I took before posting. I can see the tests failing w/500 vs 200, but I don't know how to stand up the same stack the tests are using and run HTTP / ES requests.

_Update_: I just found https://github.com/elastic/package-storage/blob/master/testing/environments/README.md

I ran those docker compose commands and now I am executing requests. This will be very helpful!

Steps for me or anyone else following along:

  1. Git checkout the branch behind the PR: https://github.com/elastic/package-storage/pull/370 (problem is related to the AWS packages
  2. mage build
  3. cd testing/environments/
  4. docker-compose commands from the README:
    docker-compose -f snapshot.yml pull
    docker-compose -f snapshot.yml -f local.yml up --force-recreate
  5. in separate window, run HTTP requests (or view Kibana at http://localhost:5601/, etc)

I can now reproduce this error with

$ curl --user elastic:changeme -X POST http://localhost:5601/api/ingest_manager/epm/packages/aws-0.2.7 -H 'kbn-xsrf: xyz'
{"statusCode":500,"error":"Internal Server Error","message":"Bad Request"}

Now I'll try to run a local kibana branch like https://github.com/jfsiii/kibana/tree/ingest-default-error-handler against this local registry and do what I can to improve the error reporting. I'll report back later today (I'm afk for a few more hours)

Maybe I'm not afk :)

I ran that registry locally in my https://github.com/jfsiii/kibana/tree/ingest-default-error-handler branch

I ran this in one window

curl --user elastic:changeme -X POST http://localhost:5601/api/ingest_manager/epm/packages/aws-0.2.7 -H 'kbn-xsrf: xyz'
{"statusCode":500,"error":"Internal Server Error","message":"Bad Request"}

and saw this in the logs

server    log   [10:11:36.617] [error][ingestManager][plugins] Error: error deleting pipeline logs-aws.cloudtrail-0.2.7
    at deletePipeline (/Users/jfsiii/work/kibana/x-pack/plugins/ingest_manager/server/services/epm/elasticsearch/ingest_pipeline/remove.ts:63:13)
    at process._tickCallback (internal/process/next_tick.js:68:7)
server    log   [10:11:37.515] [error][ingestManager][plugins] { Error: Bad Request
    at respond (/Users/jfsiii/work/kibana/node_modules/elasticsearch/src/lib/transport.js:349:15)
    at checkRespForFailure (/Users/jfsiii/work/kibana/node_modules/elasticsearch/src/lib/transport.js:306:7)
    at HttpConnector.<anonymous> (/Users/jfsiii/work/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:173:7)
    at IncomingMessage.wrapper (/Users/jfsiii/work/kibana/node_modules/lodash/lodash.js:4949:19)
    at IncomingMessage.emit (events.js:203:15)
    at endReadableNT (_stream_readable.js:1145:12)
    at process._tickCallback (internal/process/next_tick.js:63:19)
  status: 400,
  displayName: 'BadRequest',
  message: 'Bad Request',
  path: '/_ingest/pipeline/logs-aws.cloudtrail-0.2.7',
  query: undefined,
  body:
   '---\nerror:\n  root_cause:\n  - type: "parse_exception"\n    reason: "Failed to parse content to map"\n  type: "parse_exception"\n  reason: "Failed to parse content to map"\n  caused_by:\n    type: "json_parse_exception"\n    reason: "Duplicate field \'ListGroupsForUser\'\\n at [Source: (org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper);\\\n      \\ line: 489, column: 26]"\nstatus: 400\n',
  statusCode: 400,
  response:
   '---\nerror:\n  root_cause:\n  - type: "parse_exception"\n    reason: "Failed to parse content to map"\n  type: "parse_exception"\n  reason: "Failed to parse content to map"\n  caused_by:\n    type: "json_parse_exception"\n    reason: "Duplicate field \'ListGroupsForUser\'\\n at [Source: (org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper);\\\n      \\ line: 489, column: 26]"\nstatus: 400\n',
  toString: [Function],
  toJSON: [Function] }
server   error  [10:11:31.963]  Error: Internal Server Error
    at HapiResponseAdapter.toError (/Users/jfsiii/work/kibana/src/core/server/http/router/response_adapter.ts:129:19)
    at HapiResponseAdapter.toHapiResponse (/Users/jfsiii/work/kibana/src/core/server/http/router/response_adapter.ts:79:19)
    at HapiResponseAdapter.handle (/Users/jfsiii/work/kibana/src/core/server/http/router/response_adapter.ts:74:17)
    at Router.handle (/Users/jfsiii/work/kibana/src/core/server/http/router/router.ts:273:34)
    at process._tickCallback (internal/process/next_tick.js:68:7)

Root cause aside, what status code and message would most helpful/appropriate for the response to /api/ingest_manager/epm/packages/aws-0.2.7 ?

@leehinman Does this error message look familiar? ^^

I think I've seen this message before with packages that contained pipelines in YAML format that were not accepted by elasticsearch for formatting reasons. This can be tested by trying to add the pipelines manually with curl.

What I've seen is that elasticsearch insists on the YAML file starting with ---, but there are lots of ways to invalidate YAML so it might be something else.

jfsiii if you don't mind, I have assigned it to you.

For the status code, this seems to be indeed a bad request (400) aka malformed request syntax, the elasticsearch error showed here could be propagated back?

@ycombinator in the package-spec are we validating pipeline in the yaml format?

in the package-spec are we validating pipeline in the yaml format?

Not at the moment but we plan to. The spec does define constraints for YAML files in the dataset/<dataset>/elasticsearch/ingest_pipeline folder. We just haven't had the time yet to _implement the enforcement_ of these constraints.

That being said, elastic-package offers a format sub-command that should've failed parsing any YAML files as it tried to format them. And this command is automatically being run as part of CI on the elastic/integrations repo. So I wonder why this was not caught there. @mtojek any thoughts?

That being said, elastic-package offers a format sub-command that should've failed parsing any YAML files as it tried to format them.

I will verify this. Thanks for heads up!

And this command is automatically being run as part of CI on the elastic/integrations repo. So I wonder why this was not caught there.

I wonder if the PR was green then or has been merged despite errors.

@jfsiii

Root cause aside, what status code and message would most helpful/appropriate for the response to /api/ingest_manager/epm/packages/aws-0.2.7 ?

I would say the most meaningful one: parsing exception around ---.

EDIT:

To fully understand the problem. Is the line --- incorrect in terms of YAML syntax? Don't they mean a start of YAML document?

To fully understand the problem. Is the line --- incorrect in terms of YAML syntax? Don't they mean a start of YAML document?

I have not been able to find out if elasticsearch is overly picky here, but if I'm not completely mistaken it does reject pipelines without it, so if this is valid YAML it would be a bug in elasticsearch.

Please note that in Ingest Manager code on the Kibana side we don't look at the YAML at all. We can respond with "error adding pipeline to elasticsearch", but we will not know if the error was caused by a missing --- at the beginning of the file, or some other YAML formatting problem. All we see is a (not entirely useful) error message from elasticsearch (the one in https://github.com/elastic/kibana/issues/75862#issuecomment-680057522 )

@mtojek (and others) how about this:

$ curl --user elastic:changeme -X POST http://localhost:5601/api/ingest_manager/epm/packages/aws-0.2.7 -H 'kbn-xsrf: xyz' | jq
{
  "statusCode": 400,
  "error": "Bad Request",
  "message": "HTTP 400 response from ES at /_ingest/pipeline/logs-aws.elb_logs-0.2.7: [parse_exception] Failed to parse content to map"
}

Another alternative is:

$ curl --user elastic:changeme -X POST http://localhost:5601/api/ingest_manager/epm/packages/aws-0.2.7 -H 'kbn-xsrf: xyz' | jq
{
  "statusCode": 400,
  "error": "Bad Request",
  "message": "400 Bad Request from /_ingest/pipeline/logs-aws.cloudtrail-0.2.7. response body: ---\nerror:\n  root_cause:\n  - type: \"parse_exception\"\n    reason: \"Failed to parse content to map\"\n  type: \"parse_exception\"\n  reason: \"Failed to parse content to map\"\n  caused_by:\n    type: \"json_parse_exception\"\n    reason: \"Duplicate field 'ListGroupsForUser'\\n at [Source: (org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper);\\\n      \\ line: 489, column: 26]\"\nstatus: 400\n"
}

More info on the ES error responses

The ES client failure is currently returning YAML inside the body string

  body:
   '---\nerror:\n  root_cause:\n  - type: "parse_exception"\n    reason: "Failed to parse content to map"\n  type: "parse_exception"\n  reason: "Failed to parse content to map"\n  caused_by:\n    type: "json_parse_exception"\n    reason: "Duplicate field \'ListGroupsForUser\'\\n at [Source: (org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper);\\\n      \\ line: 489, column: 26]"\nstatus: 400\n',
  response:
   '---\nerror:\n  root_cause:\n  - type: "parse_exception"\n    reason: "Failed to parse content to map"\n  type: "parse_exception"\n  reason: "Failed to parse content to map"\n  caused_by:\n    type: "json_parse_exception"\n    reason: "Duplicate field \'ListGroupsForUser\'\\n at [Source: (org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper);\\\n      \\ line: 489, column: 26]"\nstatus: 400\n',
  toString: [Function],

I believe this is because it believes that's appropriate since we sent a YML request. I added an Accept: 'application/json' to the params and got a JSON blob for the error.

and put this in the Kibana logs

server    log   [17:12:29.375] [error][ingestManager][plugins] { Error: [parse_exception] Failed to parse content to map
    at respond (/Users/jfsiii/work/kibana/node_modules/elasticsearch/src/lib/transport.js:349:15)
    at checkRespForFailure (/Users/jfsiii/work/kibana/node_modules/elasticsearch/src/lib/transport.js:306:7)
    at HttpConnector.<anonymous> (/Users/jfsiii/work/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:173:7)
    at IncomingMessage.wrapper (/Users/jfsiii/work/kibana/node_modules/lodash/lodash.js:4949:19)
    at IncomingMessage.emit (events.js:203:15)
    at endReadableNT (_stream_readable.js:1145:12)
    at process._tickCallback (internal/process/next_tick.js:63:19)
  status: 400,
  displayName: 'BadRequest',
  message: '[parse_exception] Failed to parse content to map',
  path: '/_ingest/pipeline/logs-aws.cloudtrail-0.2.7',
  query: undefined,
  body:
   { error:
      { root_cause: [Array],
        type: 'parse_exception',
        reason: 'Failed to parse content to map',
        caused_by: [Object] },
     status: 400 },
  statusCode: 400,
  response:
   '{"error":{"root_cause":[{"type":"parse_exception","reason":"Failed to parse content to map"}],"type":"parse_exception","reason":"Failed to parse content to map","caused_by":{"type":"json_parse_exception","reason":"Duplicate field \'ListGroupsForUser\'\\n at [Source: (org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper); line: 489, column: 26]"}},"status":400}',
  toString: [Function],
  toJSON: [Function] }

The first option is a bit vague:

  1. The user might not correlate the endpoint /_ingest/pipeline/logs-aws.elb_logs-0.2.7 with the correct ingest pipeline (there can be few ones).
  2. No references to the map that was invalid (whole document, some part of it?)

The second option gives more details regarding the exception's reason, which is an ultimate root cause analysis (the field is duplicated). If it doesn't leak any confidential details, I would vote for this one. One more possible improvement would be providing filename of the affected ingest pipeline.

Please note that in Ingest Manager code on the Kibana side we don't look at the YAML at all. We can respond with "error adding pipeline to elasticsearch", but we will not know if the error was caused by a missing --- at the beginning of the file, or some other YAML formatting problem. All we see is a (not entirely useful) error message from elasticsearch (the one in #75862 (comment) )

(that's rather a side issue)

@skh I'm not convinced if it's not a security threat, as the user might pass or upload a malformed document, which can blow up ES instance. We plan to install .zip files in the future, which may contain huge but compressed files that can cause OOM errors. I believe there should be some basic restrictions (file size, pipeline format verification, etc.).

@mtojek This is touching several aspects.

I'm not convinced if it's not a security threat, as the user might pass or upload a malformed document, which can blow up ES instance.

A user who can install packages has enough privileges to upload documents through the dev console, or just use curl. So, if an attacker has come to the point where they can install a maliciously crafted package, we're already in a lot of trouble. Still, I agree we should always try to prevent users from breaking their own setup, and give them helpful error messages.

We plan to install .zip files in the future, which may contain huge but compressed files that can cause OOM errors. I believe there should be some basic restrictions (file size, pipeline format verification, etc.).

Definitely. The feature to allow direct package upload is tracked here: https://github.com/elastic/kibana/issues/70582

pipeline format verification

The final authority whether a pipeline (or any other elasticsearch asset) is valid is elasticsearch itself. I'm not sure how much effort we should put into validation code on the Kibana side, for the following reasons:

  • This is duplicate effort. I hope we can trust elasticsearch not to accept broken / invalid assets, so trying to install them, catching the errors, and reporting them back to the user in an understandable way is the best validation we can currently get.
  • Our validation code (should we add it) and elasticsearch's opinion on what is a valid asset and what isn't might diverge over time.

That said, basic validation that assets are valid JSON, YAML etc. are of course useful. I'm more talking about the cases where the file syntax is valid, but the asset is still not acceptable to elasticsearch.

@mtojek

I added the full response body to the message. It's now ${error.message} from ES at ${error.path}: ${error.response};

In this example, that becomes:

> curl --user elastic:changeme -X POST http://localhost:5601/api/ingest_manager/epm/packages/aws-0.2.7 -H 'kbn-xsrf: xyz' | jq
{
  "statusCode": 400,
  "error": "Bad Request",
  "message": "[parse_exception] Failed to parse content to map from ES at /_ingest/pipeline/logs-aws.cloudtrail-0.2.7: {\"error\":{\"root_cause\":[{\"type\":\"parse_exception\",\"reason\":\"Failed to parse content to map\"}],\"type\":\"parse_exception\",\"reason\":\"Failed to parse content to map\",\"caused_by\":{\"type\":\"json_parse_exception\",\"reason\":\"Duplicate field 'ListGroupsForUser'\\n at [Source: (org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper); line: 489, column: 26]\"}},\"status\":400}"
}

The endpoint that had the failure (/_ingest/pipeline/logs-aws.cloudtrail-0.2.7 in this case) is the only info I have at this point (an error that made it up to the response handler). We also log the error as

server    log   [09:12:04.554] [error][ingestManager][plugins] { Error: [parse_exception] Failed to parse content to map
    at respond (/Users/jfsiii/work/kibana/node_modules/elasticsearch/src/lib/transport.js:349:15)
    at checkRespForFailure (/Users/jfsiii/work/kibana/node_modules/elasticsearch/src/lib/transport.js:306:7)
    at HttpConnector.<anonymous> (/Users/jfsiii/work/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:173:7)
    at IncomingMessage.wrapper (/Users/jfsiii/work/kibana/node_modules/lodash/lodash.js:4949:19)
    at IncomingMessage.emit (events.js:203:15)
    at endReadableNT (_stream_readable.js:1145:12)
    at process._tickCallback (internal/process/next_tick.js:63:19)
  status: 400,
  displayName: 'BadRequest',
  message: '[parse_exception] Failed to parse content to map',
  path: '/_ingest/pipeline/logs-aws.cloudtrail-0.2.7',
  query: undefined,
  body:
   { error:
      { root_cause: [Array],
        type: 'parse_exception',
        reason: 'Failed to parse content to map',
        caused_by: [Object] },
     status: 400 },
  statusCode: 400,
  response:
   '{"error":{"root_cause":[{"type":"parse_exception","reason":"Failed to parse content to map"}],"type":"parse_exception","reason":"Failed to parse content to map","caused_by":{"type":"json_parse_exception","reason":"Duplicate field \'ListGroupsForUser\'\\n at [Source: (org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper); line: 489, column: 26]"}},"status":400}',
  toString: [Function],
  toJSON: [Function] }

We could add some try/catch'es closer to the error in separate PRs but the error'ing endpoint & the response seems like a reasonable default.

@jfsiii The returned error message looks good, thanks! I don't have any other comments regarding phrasing, I think you can proceed.

What is the status of this issue @jfsiii ?

@ph in progress (see zube). It had to wait for https://github.com/elastic/kibana/pull/74409 to land.

I have the code for the error shown and now I am working on the tests.

#76865 is ready for review

@mtojek & @kaiyan-sheng I developed it using the steps from this PR & assets from elastic/package-storage#370 so it "should" work as discussed. I'm not sure how easy it is for you to test that PR but let me know if you have any questions or comments

Was this page helpful?
0 / 5 - 0 ratings

Related issues

timroes picture timroes  路  3Comments

timmolter picture timmolter  路  3Comments

Ginja picture Ginja  路  3Comments

ctindel picture ctindel  路  3Comments

celesteking picture celesteking  路  3Comments