Kibana: [Ingest][EPM] Ingest Manager start lifecycle didn't finish in 30 secs

Created on 11 May 2020 · 24Comments · Source: elastic/kibana

Screen Shot 2020-05-11 at 7 29 39 AM

Seen this a few times in Cloud QA, but not _every_ time. When I have, it's been the first deployment I did that day.

EPM beta1 Ingest Management bug

Source

jfsiii

All 24 comments

Pinging @elastic/ingest-management (Feature:EPM)

elasticmachine on 11 May 2020

I've got a branch where I just started logging actions and times so I can get a baseline before making changes.

I suspect we'll add a few Promise.alls and remove some awaits but we can discuss that in a PR

jfsiii on 11 May 2020

It occurred to me that we could also use APM to get more information.

I will try that on a Cloud instance and report back.

jfsiii on 12 May 2020

@jfsiii is this still a valid issue?

ph on 1 Jun 2020

@ph I think so, but hopefully not for long. @neptunian has two PRs which should help a lot (https://github.com/elastic/kibana/pull/67868 & https://github.com/elastic/kibana/pull/67893)

I just saw that https://github.com/elastic/kibana/issues/66301 was resolved, but we haven't we haven't check in Cloud or with @mtojek who saw it recently https://github.com/elastic/kibana/issues/67743#issue-627136821

jfsiii on 1 Jun 2020

ok I will assign it to @neptunian, assuming that the parallel changes will solve it.

ph on 1 Jun 2020

@jfsiii @neptunian @nchaulet Looking at report this seem to still be an issue, could we investigate another solution? I believe @nchaulet or @jfsiii mentioned background task?

ph on 5 Jun 2020

Some info in case it's helpful: I recently encountered this after migrating to a new kibana index on our dev cluster; the /setup request is slow enough against that cluster that it always times out. Since the 30s timeout happens at the kibana level, I'm also not able to e.g. make that POST independently (via curl etc).

rylnd on 5 Jun 2020

@ph I believe @skh also mentioned something similar. Thanks for adding this to this week's agenda

jfsiii on 5 Jun 2020

👍1

Some info in case it's helpful: I recently encountered this after migrating to a new kibana index on our dev cluster; the /setup request is slow enough against that cluster that it always times out. Since the 30s timeout happens at the kibana level, I'm also not able to e.g. make that POST independently (via curl etc).

I also ran into the 30s timeout yesterday on two different MBPs. One of those failures was on master directly, and the other was a branch that was up-to-date with master.

andrew-goldstein on 5 Jun 2020

@andrew-goldstein just to clarify this only happens with ingest manager is "enabled" which should on the dev cluster?

ph on 5 Jun 2020

@andrew-goldstein just to clarify this only happens with ingest manager is "enabled" which should on the dev cluster?

Yes, I was pointing to the dev cluster, and I happened to also specify a new index name (that didn't yet exist) for the kibana.index setting in kibana.dev.yml.

andrew-goldstein on 5 Jun 2020

Is there any workaround or way to extend the timeout? I've had to disable ingestManager (which disables SIEM) in order to get Kibana to load. When it timesout I'm seeing exceptions like this in ES. I'm running a snapshot build from https://github.com/elastic/kibana/commits/0493978a6e5a5c8155aea1b27a2aee7453e10ba1.

path: /_index_template/metrics-system.socket_summary, params: {name=metrics-system.socket_summary}
org.elasticsearch.transport.RemoteTransportException: [HOSTNAME][10.200.0.10:9300][indices:admin/index_template/put]
Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (create-index-template-v2 [metrics-system.socket_summary], cause [api]) within 30s

andrewkroh on 6 Jun 2020

I wonder if as a temporary solution, we could run the setup async to not block Kibana?

ruflin on 8 Jun 2020

@andrewkroh @andrew-goldstein @rylnd Can you give any info about the memory and cpu on those systems?

We saw a timeout in another issue that seemed related to memory in the docker container https://github.com/elastic/kibana/issues/66301

https://github.com/elastic/kibana/pull/67868 & https://github.com/elastic/kibana/pull/67893 were mostly reducing HTTP calls and doing things in parallel vs serial in a few spot. They should have improved mem & cpu somewhat but that wasn't their goal.

I'm not sure about your workflow, but can you make a note to follow up here after https://github.com/elastic/kibana/pull/68221 lands on your system? That PR does have big mem & CPU improvements. I'd love to know if they resolve your issue.

jfsiii on 8 Jun 2020

We don't have traces or any telemetry to check for these failures (even in Cloud) but I wonder if we're seeing a failure mode we didn't expect/test/encounter rather than simply a slower version of the happy path. There's seems to be some overlap with new/missing index.

@neptunian Perhaps an ES (or some other) connection is erroring and we don't capture that? There are certainly some subtleties around managing Promise errors.

jfsiii on 8 Jun 2020

@ruflin If async is possible I'd love to move to it long-term. Let's not block Kibana if we don't need to. However, right now, all the services are guaranteed the code in setupIngestManager has run and succeeded. Things like installing default packages, configs, indices, etc. We might be able to break these up and use something like background tasks to track them but I would prefer not to do that before alpha. I think we need to fix this issue before alpha so I'd like to try something less risky to start. As I wrote above, I'm curious about the similarities around index change.

jfsiii on 8 Jun 2020

You could still track the status using a promise, just not let that promise block all of Kibana.

One way would be to turn the success property in your API into a promise which a consumer needs to await before calling registerDatasource. But it's probably better to make registerDatasource an async function that internally first awaits setupIngestManager before performing the registration and returning.

rudolf on 8 Jun 2020

@rudolf good suggestion. Thanks! registerDatasource isn't the only thing that depends on the function succeeding, but storing the setup request as a promise that we can await in different places is helpful.

I can look into this or yield to anyone else who's interested/available

jfsiii on 8 Jun 2020

Some info in case it's helpful: I recently encountered this after migrating to a new kibana index on our dev cluster; the /setup request is slow enough against that cluster that it always times out. Since the 30s timeout happens at the kibana level, I'm also not able to e.g. make that POST independently (via curl etc).

@rylnd were you ever able to reproduce this scenario and see if it happens again?

neptunian on 8 Jun 2020

@jfsiii Looking at https://github.com/elastic/kibana/blob/1af927aacaf5d4f2b73877ae0f256fe2f98d7298/x-pack/plugins/ingest_manager/server/services/setup.ts#L33 I think there isn't anything in there which should be blocking for Kibana but only for using Ingest Manager and Endpoint. My current thinking is to trigger it to run all of it in the background and only block users / API's from using Ingest Manager and Endpoint before it is completed. I expect the setup to be idempotent which means even if we didn't see an error and need to run it again, things should still be ok.

ruflin on 9 Jun 2020

@ruflin agreed. I started a branch last night where we store the promise and should have a PR to discuss later today

jfsiii on 9 Jun 2020

🎉1

I think there isn't anything in there which should be blocking for Kibana

👍 we can remove the await, which will kick off the request and stop us from blocking Kibana and causing the error shown in the description.

(only blocking for) Ingest Manager and Endpoint. My current thinking is to trigger it to run all of it in the background and only block users / API's from using Ingest Manager and Endpoint before it is completed

That's not how it works today

more details

during server#setup we add the API routes which means they're available once scripts/kibana has started listening on port 5601
When the UI is loaded in a browser (public#start method) we make a POST to /api/ingest_manager/setup which kicks off all those side-effects in setupIngestManager

There's an unknown amount of time before setupIngestManager is called. In an API/CLI-only workflow it might never be opened. That's why some scripts get around this by calling /api/ingest_manager/setup

Today, most handlers do nothing to check if Ingest Manager setup has completed. The Fleet setup handler is the only handler or service which does.

The Endpoint UI will display a notification if Ingest Manager had an error on setup, but there's no guarantee that setupIngestManager has run.

In summary, I think we can fix this issue by removing the await. That might highlight some places in Ingest Manager which need an explicit await setupIngestManager or some other way to deal with the dependency which is what I'm exploring in https://github.com/elastic/kibana/pull/68631

jfsiii on 10 Jun 2020

👍1

I opened https://github.com/elastic/kibana/pull/69089 to avoid blocking Kibana in the start method.

jfsiii on 15 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings