Kibana: Kibana throws esaggs error when timeout is hit

Created on 11 Feb 2020  Â·  18Comments  Â·  Source: elastic/kibana

Kibana version: 7.6.0 BC4

Elasticsearch version: 7.6.0 BC4

Server OS version: ECK 1.0.0GA on GKE 1.13.11-gke.14

Browser version: Version 80.0.3987.87 (Official Build) (64-bit)

Browser OS version: MacOS 10.14.6

Original install method (e.g. download page, yum, from source, etc.): docker image

Describe the bug:

When viewing a dashboard, Kibana becomes unusable for all users for around 10-20 minutes. It eventually recovers. Link: https://kibana.greysector.inf.elasticnet.co/app/kibana#/dashboard/96cc3580-f69b-11e9-8344-2f4cf656fceb

Dashboard object: https://github.com/elastic/endgame-evrelay/blob/dev/etc/kibana-dashboard.ndjson

Steps to reproduce:

  1. Click on the above dashboard link
  2. Kibana becomes unusable for all users for 10-20 minutes

Expected behavior:

  1. Kibana is still usable.

Screenshots (if relevant):

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

I 2020-02-07T18:29:35.617340561Z POST /api/saved_objects/_bulk_get 200 79ms - 9.0B I 2020-02-07T18:29:35.995383829Z Request validation error: child "panels" fails because ["panels" at position 0 fails because [child "series" fails because ["series" at position 0 fails because [child "fill" fails because ["fill" must be a number]]]]] (saved object id: 74cf4680-f66d-11e9-8344-2f4cf656fceb). This most likely means your TSVB visualization contains outdated configuration. You can report this problem under https://github.com/elastic/kibana/issues/new?template=Bug_report.md I 2020-02-07T18:29:43.498397891Z GET /login 200 16ms - 9.0B I 2020-02-07T18:29:51.187508719Z POST /api/ui_metric/report 200 8953ms - 9.0B I 2020-02-07T18:29:53.495998306Z GET /login 200 15ms - 9.0B I 2020-02-07T18:30:03.505812060Z GET /login 200 23ms - 9.0B E 2020-02-07T18:30:06.175500400Z Debug: internal, implementation, error E 2020-02-07T18:30:06.175561888Z SyntaxError: Unexpected token u in JSON at position 0 at JSON.parse (<anonymous>) at server.route.handler (/usr/share/kibana/src/legacy/core_plugins/elasticsearch/lib/create_proxy.js:85:21) at process._tickCallback (internal/process/next_tick.js:68:7) I 2020-02-07T18:30:06.175845815Z Unexpected token u in JSON at position 0

TSVB error appears to be unrelated? @stacey-gammon ?
Any additional context:

This dashboard is against data from the Endgame (Elastic Endpoint Security) product. @jonathan-buttner is a good POC on the data involved. @stiltz and @ajrios33 can provide access to the Kibana involved.

Search AppServices bug medium hours triaged

All 18 comments

Pinging @elastic/kibana-app (Team:KibanaApp)

Some more context from slack convo:

Only is triggered during certain date ranges:

  • I can do Nov. 3 to Nov. 9 but Nov. 3 to Nov. 16 fails
  • There wasn't a combination of visualizations I removed that made it work.

On that last point, @stiltz, if you remove _all_ visualizations, it works, right? At that point there should be no searches being sent out. If you pare the dashboard down to a single visualization (non-TSVB), does it still occur, over the same time range of Nov. 3rd to Nov 16th?

Can you do that and send the new logs over if it still occurs? I want to see if you still get the same error Unexpected token u in JSON at position 0 at JSON.parse and want to make sure it's not a TSVB issue, since those visualizations have more Kibana server logic. If it's blocking usage for all users it has to be bogging down Kibana server or ES server. When this happens can you query ES and get back responses in a timely manner?

On that last point, @stiltz, if you remove _all_ visualizations, it works, right? At that point there should be no searches being sent out.

Correct - no issue when I remove all the panels.

If you pare the dashboard down to a single visualization (non-TSVB), does it still occur, over the same time range of Nov. 3rd to Nov 16th?

I removed all but the Hosts with Highest Event Count which is a vertical bar chart and set the date range for Nov 3 to Nov 16 and the error occurred. I saw the popup with this error Unexpected token u in JSON at position 0 at JSON.parse . I don't see it in the kibana logs from GCP though.

Can you do that and send the new logs over if it still occurs?
k8s_cluster_location_us-central1-a_cluster_name_grey-sector-v2__logs__2020-02-11T11-12.csv.zip. GCP doesn't allow for great export so if you want it in a different format we'll have to have @ajrios33 grab them for us.

I want to see if you still get the same error Unexpected token u in JSON at position 0 at JSON.parse and want to make sure it's not a TSVB issue, since those visualizations have more Kibana server logic. If it's blocking usage for all users it has to be bogging down Kibana server or ES server. When this happens can you query ES and get back responses in a timely manner?

This interrupts usage for about 10 minutes sometimes but everything at the Elasticsearch layer continues to operate. Sometimes it resolves more quickly.

Some investigation:

Dashboard with just the metric Endpoints currently streaming pointing to endgame-*works with 24hr range over:

  • Nov 9, 10, 11, 12, 13, 14, 15, 16, 17,
  • Entire range worked Nov 8th - Nov17th

With a dashboard with only Events by type (visualize bar chart), worked:

  • Nov 9, 10, 11, 12, 13 (got slower here - much more data), 15, 16, 17
  • Each 24 hr range with 13, 14, 15, 16, is quite slow, but loaded.
  • Each 24hr range had 180million hits
  • Nov 16 - 18th -> worked (354mill hits)
  • Nov 15th - 18th -> worked
  • Nov 14th - 18th -> worked (900mill hits)
  • Nov 8th - 18th -> worked

ES is probably caching the 24hr requests, making subsequent requests faster, which might be why we can't repo

  • Nov 19th - 26th, full week of data that we didn't search for earlier -> Hit it!!

    [essaggs] > Unexpected token u in JSON at position 0 at JSON.parse
    ... I suspect ESaggs might not be handling the timeout response correctly?

A refresh of the page caused Kibana to load as usual, so this didn't bog down the Kibana server.

We tried to bog down the Kibana server - went back to the full dashboard, queried "Oct 26th - Now". Esaggs error for every panel, but didn't see the gateway timeout errors again.

Pinging @elastic/kibana-app-arch (Team:AppArch)

Yea, I can repro the error by dropping the request timeout to like 500ms.

Screen Shot 2020-02-12 at 10 43 18 AM

@ppisljar and me investigated a similar (or maybe the same problem) a way back: https://github.com/elastic/kibana/issues/50275

I believe the problem here is that we're hitting the elasticsearch.requestTimeout (not to be confused with the shardTimeout which corresponds to the timeout Elasticsearch _search parameter). When that timeout is hit, the format of the error is different than when we get back an Elasticsearch error, and we're trying to JSON.parse a parameter on the error that doesn't exist (i.e. is undefined), which is why we're getting that specific error.

I am using 7.6, and noticed I was also getting this issue as well. Noticed the network requests are cancelled after 30 seconds, which is the default for kibana server. I dunno who is the dev working on the ticket, but I would love a better timeout message to give insight to user.

We are having the same issue, except the Unexpected token u in JSON at position 0 part. See below.

If I recall correctly, it started happening after we migrated from 7.5.2 to 7.6.0. Though, I will do a rollback and report here if it's true. The dashboard I'm getting this is the Kubernetes Overview dashboard of Metricbeat. When I enable polling/refreshing for the dashboard, it'll lock after raising these errors and stop refreshing.

Not sure how to tackle this, since our logging dashboard works fine with refresh in any rate.

2020-02-24T08:02:00.409Z    {"type":"log","@timestamp":"2020-02-24T08:02:00Z","tags":["warning","plugins","metrics","visTypeTimeseries"],"pid":6,"message":"Request validation error: child \"panels\" fails because [\"panels\" at position 0 fails because [child \"series\" fails because [\"series\" at position 0 fails because [child \"metrics\" fails because [\"metrics\" at position 2 fails because [\"unit\" is not allowed]]]]]] (saved object id: a4c9d360-30df-11e7-8df8-6d3604a72912-ecs). This most likely means your TSVB visualization contains outdated configuration. You can report this problem under https://github.com/elastic/kibana/issues/new?template=Bug_report.md"}
2020-02-24T08:02:00.270Z    {"type":"log","@timestamp":"2020-02-24T08:02:00Z","tags":["warning","plugins","metrics","visTypeTimeseries"],"pid":6,"message":"Request validation error: child \"panels\" fails because [\"panels\" at position 0 fails because [child \"series\" fails because [\"series\" at position 0 fails because [child \"metrics\" fails because [\"metrics\" at position 1 fails because [\"unit\" is not allowed]]]]]] (saved object id: 58e644f0-30d6-11e7-8df8-6d3604a72912-ecs). This most likely means your TSVB visualization contains outdated configuration. You can report this problem under https://github.com/elastic/kibana/issues/new?template=Bug_report.md"}
2020-02-24T08:02:00.923Z    {"type":"log","@timestamp":"2020-02-24T08:02:00Z","tags":["warning","plugins","metrics","visTypeTimeseries"],"pid":6,"message":"Request validation error: child \"panels\" fails because [\"panels\" at position 0 fails because [child \"series\" fails because [\"series\" at position 0 fails because [child \"fill\" fails because [\"fill\" must be a number]]]]] (saved object id: 022a54c0-2bf5-11e7-859b-f78b612cde28-ecs). This most likely means your TSVB visualization contains outdated configuration. You can report this problem under https://github.com/elastic/kibana/issues/new?template=Bug_report.md"}

...
Error goes on for all the visualizations on the dashboard

The request validation errors are unrelated to the timeout problem and can be ignored as they are not having any negative effect (besides the message itself) - they will be fixed in 7.6.1

I have noticed that as more machines are allocated or search is improved,
it will not occur. I am interested in knowing if it is a recent thing.
I've also increased timeouts on kibana to the point where it is elastic
which times out. I think the magic number was when I increase from 30 sec
to 3000 sec.

Not sure if it is a recent addition? Maybe timeout limits changed by
default for 7.5.2 to 7.6.0

On Mon, Feb 24, 2020, 13:42 Oguzhan Ozdemir notifications@github.com
wrote:

We are having the same issue, except the Unexpected token u in JSON at
position 0 part. See below.

If I recall correctly, it started happening after we migrated from 7.5.2
to 7.6.0. Though, I will do a rollback and report here if it's true. The
dashboard I'm getting this is the Kubernetes Overview dashboard of
Metricbeat. When I enable polling/refreshing for the dashboard, it'll
lock after raising these errors and stop refreshing.

Not sure how to tackle this, since our logging dashboard works fine with
refresh in any rate.

2020-02-24T08:02:00.409Z {"type":"log","@timestamp":"2020-02-24T08:02:00Z","tags":["warning","plugins","metrics","visTypeTimeseries"],"pid":6,"message":"Request validation error: child "panels" fails because ["panels" at position 0 fails because [child "series" fails because ["series" at position 0 fails because [child "metrics" fails because ["metrics" at position 2 fails because ["unit" is not allowed]]]]]] (saved object id: a4c9d360-30df-11e7-8df8-6d3604a72912-ecs). This most likely means your TSVB visualization contains outdated configuration. You can report this problem under https://github.com/elastic/kibana/issues/new?template=Bug_report.md"}
2020-02-24T08:02:00.270Z {"type":"log","@timestamp":"2020-02-24T08:02:00Z","tags":["warning","plugins","metrics","visTypeTimeseries"],"pid":6,"message":"Request validation error: child "panels" fails because ["panels" at position 0 fails because [child "series" fails because ["series" at position 0 fails because [child "metrics" fails because ["metrics" at position 1 fails because ["unit" is not allowed]]]]]] (saved object id: 58e644f0-30d6-11e7-8df8-6d3604a72912-ecs). This most likely means your TSVB visualization contains outdated configuration. You can report this problem under https://github.com/elastic/kibana/issues/new?template=Bug_report.md"}
2020-02-24T08:02:00.923Z {"type":"log","@timestamp":"2020-02-24T08:02:00Z","tags":["warning","plugins","metrics","visTypeTimeseries"],"pid":6,"message":"Request validation error: child "panels" fails because ["panels" at position 0 fails because [child "series" fails because ["series" at position 0 fails because [child "fill" fails because ["fill" must be a number]]]]] (saved object id: 022a54c0-2bf5-11e7-859b-f78b612cde28-ecs). This most likely means your TSVB visualization contains outdated configuration. You can report this problem under https://github.com/elastic/kibana/issues/new?template=Bug_report.md"}

...
Error goes on for all the visualizations on the dashboard

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/elastic/kibana/issues/57327?email_source=notifications&email_token=AACM7Z23HOJCFZGLDYHTRSLREN6OZA5CNFSM4KTDNHEKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMW5N7Q#issuecomment-590206718,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AACM7Z5PZU473QZ2FZPSY3LREN6OZANCNFSM4KTDNHEA
.

@flash1293 I see, thanks. That's good to know. One negative effect I see though is auto-refresh stops working, which I thought might be related to timeout issue. Anyway, if 7.6.1 is on the way with a fix I'm not gonna spent time on a rollback for testing and move on until the fix come out.

@aoozdemir The auto-refresh bug is an independent bug and will also be fixed in 7.6.1 - seems like you are hitting them all.

@flash1293 Ah, that's unfortunate. I guess today is not my lucky day :). Thanks.

I believe this has been fixed, as I'm now properly seeing a "Request timed out" error message, and it seems like the Kibana server doesn't have any issues. If you're still seeing this error, feel free to re-open.

image

@lukasolson I have a number of customers with this problem. Glad to see it's fixed, but I am not clear which versions the fix is in. Can you clarify? Thanks!

I believe it was fixed in 7.7 or 7.7.1.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

spalger picture spalger  Â·  3Comments

MaartenUreel picture MaartenUreel  Â·  3Comments

celesteking picture celesteking  Â·  3Comments

stacey-gammon picture stacey-gammon  Â·  3Comments

socialmineruser1 picture socialmineruser1  Â·  3Comments